When we think of a line, we think of something like this: ______ . It’s flat. Maybe it slants up, maybe it slants down, but it’s straight, continues in a predictable direction and has one number to describe its slope. In statistics, when we use the term “linear model,” we are not necessarily describing a straight line. Although statistical linear models could describe the classic straight line, most statistically linear models are not represented by straight lines but by curvilinear graphs. Both shapes in this picture are “linear”:
Why does this merit a blog post? For the third time this month, people have expressed surprise when I have said or discussed with them that “linear” does not mean “straight line.”
What is a linear model?
We use the term linear in statistics to describe the parameters in the model we are using. Linear means that the response (y variable) is expected to be a linear combination of explanatory variables (either discrete or continuous). Linear refers to the explanatory variables being additive.
Why do we use linear models?
Ecologists use linear models because most of the time they are extremely useful for predicting actual response variables. We use them because they work! We can test whether this is true with our own data sets by looking at the residuals of our y-values. A residual describes how much an actual data point differs from its predicted data point given the model we are using. If we plot residuals we should see a random scatter plot that would fit a flat line. If we plot residuals and we see a U-shape or an inverted U-shape in the residual plots, a linear model may not be a good fit. The U-shapes mean that the residuals are not randomly distributed across the data set and indicates non-linearity.
How do we deal with non-linearity?
Most of the time we don’t. Non-linear models are more difficult to deal with. Instead we can do nonlinear transformations to turn non-linear relationships between variables into a linear relationship.
What transformations do we use?
There are five basic nonlinear transformations: exponential, quadratic, reciprocal, logarithmic and power. For a handy-dandy chart on how to perform each one, see the chart here. Applying a transformation to either the independent or dependent variable (sometimes both) changes the relationship between them. We use the transformation to increase linearity, and thus we need to re-check the residual plots after transformation to ensure that we get closer to the random scatterplot for residuals (we don’t want to exacerbate that U-shape). Plotting and checking is always a good test.
Key take-away: linearity can imply straight lines but we need to be careful where we look for those lines. Looking in the residuals we want a flat line with zero slope. Looking at our model fit, those curvilinear lines are a-okay.
Mathematical ecology club is moving through more complicated examples of how and when to transform data, and what R does internally when we specify linear models. If anyone is interested in learning more about how to use statistical models with your data correctly — and understand what you are doing with each R command, pick up a copy of the Marc Kery book and join us on Sunday mornings at Saints.
More stats posts to come!