## 3. Regression III

### Introduction

Thus far in this series we’ve been focussing on how to perform a linear regression analysis; what error-minimization models are available, the circumstances under which use of different linear models may be appropriate, and the effects of data standardization. By this time you should understand that linear regression is actually a bit more complex a topic than is usually portrayed in textbooks, but you should also feel confident that this complexity can be turned to your advantage in planning and carrying out such analyses. Before we plunge even deeper into this subject, though, it would be a good idea to devote at least one column to the topic of interpreting the results of a regression analysis. Here we’ll see, as we have before, things are not as straightforward as they might seem.

### Result Interpretation

Let’s start by considering the very first regression result we obtained for our example trilobite data from the first essay in this series (see Regression 1 and the Regression 1 worksheet). In that first analysis we used the standard least-squares error-minimization model to estimate the following linear relation between glabellar length (x) and glabellar width (y) that minimized deviation in the latter.

(3.1) \[y = 0.64x + 2.51\]

Is this a good model of variation within those data? Is it adequate for the purpose of data characterization? Or could these numbers be hiding some underlying pattern that might affect our interpretation of these data? By just looking at (or presenting) the raw slope and intercept coefficients it’s difficult to know what to make of this result other than to accept that this is the equation of the straight line which minimizes deviation along the glabellar-width axis. The problem, of course, is that the same line may be calculated for a large number of data geometries (Fig. 1).

Given the vastly different interpretations we would give to the glabellar length-width relationship under these different circumstances (e.g., a regular and predictable relation in Fig. 1A, no relation whatsoever in Fig, 1D), and the failure of the estimated regression lines to distinguish between these alternative patterns per se, it seems clear that we need to look beyond the simple calculation of a linear regression line’s equation if we are to use regression to gain a deeper understanding of our variables (which, after all, is the real point of any quantitative data analysis). Fortunately, our friends over in the statistics department have anticipated our problem and relatively easy-to-apply solutions are available.

Obviously, the four cases illustrated in Figure 1 differ from one another primarily in magnitude and distribution of the deviations between the linear model’s predictions—represented by the regression line—and the observed data points. Although we previously referred to these deviations as errors, they are also often referred to as ‘residuals’ in the sense of being a proportion of the observed data left over after the (in this case) linear pattern has been accounted for, or ‘explained’, by the regression model. Residuals are easiest to calculate for least-squares regressions where they are defined as the difference between the observed y–value (\(y_i\)) and the y-value predicted by the regression equation (\(\hat{y}_i\))^{1},

(3.2a) \[y'_i = y_i-\hat{y}_i\]

(3.2b) \[y'_i = y_i (mx_i+b)\]

where m and b represent the least-squares regression slope and y-intercept respectively. Figure 2 illustrates the residual plots for each of the alternative data geometries shown in Figure 1.

Although the data patterns in Figure 1 were devised to be obvious, it is, in most instances, a good idea to inspect residual plots like these to determine whether a linear regression analysis has been an appropriate model for the characterization of any particular dataset. If your residual plot looks like the pattern shown in Figure 2A, a linear regression analysis was appropriate. More specifically, in order to be regarded as appropriate the residual plot should (1) exhibit no linear trend in itself, (2) form a scatter about a flat, horizontal line through the residual bivariate mean whose deviations approximate a normal distribution, and (3) be arranged such that the standard deviation of variability in the residual values is the same everywhere along the regression line, regardless of the x-axis interval selected (a feature known by the rather scary term ‘homoscedasticity’). Any other residual plot geometry is problematic.

In Figure 2B, even though the data pattern contains a consistent linear trend, the fact that the residuals exhibit such a well-defined, non-random structure suggests the linear model did not provide an appropriate characterization. In this case, the marked wave-like pattern indicates that the positions of successive points are dependent on the positions of preceding points (the signal of strong autocorrelation). The form of this particular autocorrelated pattern indicates use of a curvilinear regression model would be preferable.

Figure 2C represents a problem of a different kind; one that is even more common than autocorrelation. In this instance the standard deviation of the residuals from the linear model increases as a function of the x-variable’s magnitude. This is the ‘heteroscedastic’ condition. Pronounced heteroscedasticity usually means that large variable values are not being estimated with the same precision as smaller values. Alternatively, this situation could signal the affects of a fundamental structural constraint, such as allometry. Regardless of its origin, the greater variability of the data at one or the other end of the distribution affects the precision with which a linear relation can be estimated. The usual solution to this problem is to a apply a data transformation (e.g., logarithmic transform) which minimizes the effect of such scaling differences, and then repeating the regression analysis on the transformed data.

Finally, the somewhat artificial example of Figure 4D is there to remind us that linear regression analysis can find structure in data that deviate only slightly from being structure-less. In this instance, the appearance of a non-zero regression slope is due solely to a slight asymmetry in the density of the data distribution. This residual plot also suggests marked non-linear (and autocorrelated) structure.

Graphically determining whether a linear regression model is appropriate to apply to a bivariate dataset is only the beginning. From here it is appropriate to ask how well does the linear model fit these data? This question is usually approached by partitioning the variance exhibited by the observed data into parts. Once again, the calculations are most easily approached for least-squares regression analysis, since the variance of interest is subsumed entirely by a single variable (the y–axis).

As we discussed last time, the variance is based on calculation of the sum of the squared deviations from the sample mean.

(3.3) \[SS_T = \sum(y_i - \overline{y})^2\]

If this quantity is taken to be the total sum of squares (ssT), another quantity can be defined as the sum of squares of the regression (ssR) and calculated as the sum of the squared deviations of the y -values estimated by the regression equation (\(\hat{y}_i\)) from the sample mean.

(3.4) \[SS_R = \sum(\hat{y}_i - \overline{y})^2\]

In the simplest case, where all data points lie precisely on the regression line, \(SS_T = SS_R\). In all practical cases, though, there will be some degree of deviation between the observed and estimated points. The difference between these values, then, is a measure of the dispersion of residual values about the regression line, a quantity termed the sum of squares of the error.

(3.5) \[SS_E = SS_T - SS_R\]

While the \(SS_E\) is a measure of the absolute error of regression, an arguably more useful comparative statistic can be determined from the ratio of \(SS_T\) and \(SS_R\) . This is the so-called * r^{2}*-value, also termed the coefficient of dispersion.

(3.6)

Whereas the \(SS_E\) can adopt any positive value and is sensitive to scaling differences between the variables of different regression analyses, * r^{2}* is a measure of relative dispersion and constrained to vary between 0 and 1. Thus, this statistic measures the goodness of fit between the data and a linear model in a manner that is comparable across analyses involving different datasets. Because it is a ratio,

*also has no units (whereas the \(SS_T\) , \(SS_R\) , and \(SS_E\) unit is the square of whatever unit in which variable was measured). For this reason*

**r**^{2}*is often reported as a percentage. If*

**r**^{2}*, no linear relation exists between*

**r**^{2}= 0*and*

**x***. If*

**y***, a scatterplot of*

**r**^{2}= 1*and*

**x***forms a perfectly straight line.*

**y**If you read statistical textbooks you will often come across statements to the effect that the square-root of * r^{2}* is the ‘correlation coefficient’ (usually represented symbolically by

*). This is not precisely true. We will discuss the correlation coefficient in detail in a future essay. For now, suffice it to say that*

**r***is related to*

**r**^{2}*mathematically, but differs in that*

**r****may adopt any value between –1 and 1 whereas**

*r**is constrained to vary between 0 and 1 (see equations 3.3-3.6). Thus, \(\sqrt{r^2}\), which is technically called the multiple correlation coefficient and which I will refer to as*

**r**^{2}*, is also constrained to vary between 0 and 1, unlike the Pearson product-moment correlation coefficient (*

**r**_{m}*).*

**r**If you have had a course in statistics or read the regression section of a statistics text you will know that the ‘significance’ of a least-squares linear regression is typically tested with an *F*-statistic. The *F*-distribution (which was named for the eminent British statistician Sir Ronald Fisher) is the distribution of values expected by drawing two samples of varying sizes randomly from a normal distribution and calculating, for all possible combinations of such samples, the ratio of their variances. Because the precision with which the population variance can be estimated by samples varies with the size of the sample, the form of the *F*-distribution changes for different combinations of sample sizes. Hence, the large number of tables devoted to the *F*-distribution in textbook appendices.

Basically, the *F*-distribution tabulates how different the ratios of sample variances drawn from the same normal distribution can be expected to be. If the observed difference between variance ratios exceeds (say) 95 percent of the expected values, one is justified in rejecting the null hypothesis that the samples were drawn from the same distribution. An *F*-test is typically applied to regression problems by comparing the ratio of the variances attributed to the regression and residual (or error) sums of squares. By choosing this statistic to examine we are implicitly asking the question ‘Did performing a linear regression analysis have any effect on reducing the deviation of observations from the sample mean?’ If the regression analysis ‘explains’ a large component of variation observed in the sample, this should be reflected in a marked difference between \(SS_R\) and \(SS_E\), such that the distribution of the values used in their calculation differs from the ‘single sample’ expectations of the *F*-test.

In order to perform this *F*-test we first need to change the \(SS_R\) and \(SS_E\) values into variances. As we saw in the last essay, this is done by turning these component estimates of the total deviation into mean values. The mean squared deviation (= variance) of the total deviation is calculated by dividing \(SS_T\) by one less than total sample size (** n-1** , see previous essay for an explanation of this convention). The mean squared deviation (= variance) of the values estimated by the regression is calculated by dividing \(SS_R\) by one less than the number of factors involved in the regression’s calculation. Since, for a linear regression, there are two such factors (the slope and the

*-intercept), this value is (2-1), or 1. The mean squared deviation (= variance) of the residual values is calculated by dividing \(SS_E\) by the difference between these two denominators,*

**y****, or**

*(n-1)-1**. Once these values are in hand, the*

**n-2***F*-statistic is calculated as follows.

(3.7) \[ F = \frac{(SS_R)}{(\frac{SS_E}{n-2})} \]

To illustrate these calculations on a real dataset, let us return to the trilobite glabellar data from the first essay of this series.

Genus | Length (mm) | Width (mm) | |
---|---|---|---|

1 | Acaste | 5.10 | 3.46 |

2 | Balizonma | 4.60 | 6.53 |

3 | Calymene | 12.98 | 14.15 |

4 | Ceraurus | 7.90 | 5.32 |

5 | Cheirurus | 12.83 | 12.96 |

6 | Cybantyx | 16.41 | 13.08 |

7 | Cybeloides | 6.60 | 6.84 |

8 | Dalmanites | 10.00 | 9.12 |

9 | Delphion | 8.08 | 10.77 |

10 | Narroia | 15.67 | 9.25 |

11 | Ormathops | 4.53 | 4.11 |

12 | Phacopdina | 6.44 | 6.94 |

13 | Pricyclopyge | 21.53 | 14.64 |

14 | Ptychoparia | 12.82 | 9.36 |

15 | Rhenops | 22.27 | 17.56 |

16 | Sphaerexochus | 4.93 | 6.21 |

17 | Trimerus | 16.35 | 15.02 |

18 | Zachanthoides | 13.41 | 8.51 |

The equation of the least-squares regression through these data is \(y = 0.64x + 2.51\) . What we know want to know is whether this a good linear model of these data? Whether a linear model is appropriate for these data? And whether this linear model has had any significant effect on the error structure of these data?

We start graphically, by calculating the least-squares regression residuals and plotting these on a scatter diagram (Fig. 3).

Overall these residuals look pretty good. While there is an aspect of heteroscedasticity in the fact that the two most extreme values are separated from the remainder by what appears to be a noticeable gap, and both ends of the scatter exhibit a somewhat smaller deviation from the regression line than the middle group of points, I would not judge these to be serious problems for this dataset. Such effects could easily be a by-product of the dataset's small size. Based on this graphic evidence, I would conclude that a linear model was appropriate. But is this particular linear model a good fit? To assess that we use equations 3.3 to 3.6 to complete the following table.

Regression Equation | \(y = 0.64x + 2.51\) |
---|---|

\(SS_T\) | 286.92 mm^{2} |

\(SS_R\) | 223.95 mm^{2} |

\(SS_E\) | 62.97 mm^{2} |

\(r^2\) | 0.78 |

Based on these statistics there does seem to be a large amount of the total dispersion accounted for by the regression (\(SS_R\)) (implying a corresponding marked reduction in the residual dispersion \(SS_E\)). The * r^{2}* value estimates the proportional variance accounted for by the regression at 78 per cent and provides us with a means for comparing this result to the results of other regression analyses. But is this error reduction result good enough to be considered statistically significant? To ask this question another way, ‘Can I assure myself that the difference between \(SS_R\) and \(SS_E\) reflected by the linear trend accounts for a meaningful proportion of the variance in the population from which the sample was drawn as measured against the expectation of results that might be obtained by chance alone?’ To answer this question we use the terms of equation 3.7 to complete the following table.

Source of Variation | Sum of Squares | Degrees of Freedom | Mean Squares | F-statistic |
---|---|---|---|---|

Total | 286.92 mm^{2} | 17 | 16.88 mm^{2} | 56.90 |

Regression | 223.95 mm^{2} | 1 | 223.95 mm^{2} | |

Error | 62.97 mm^{2} | 16 | 3.93 mm^{2} |

Consulting a table of *F*-statistic critical values for a significance level (one-sided) of 0.5, degrees of freedom (numerator) of 1, and degrees of freedom (denominator) of 16 we see that any *F*-statistic \(\geq 4.49\) is considered adequate for rejecting the hypothesis that there is no difference between the variance attributable to the regression and the residual (or error) variance with a 95 per cent certainty. Since our calculated F-statistic is much greater than this critical value, the least-squares regression of these data has passed the statistical significance test.

Up to this point we’ve been focusing on least-squares regression because the equations used to quantify the quality of such regressions are fairly straightforward, easy to calculate, and backed-up by extensive theoretical investigation. In no small measure, this is because least-squares regression concerns itself with deviation in only a single variable. Reduced major axis (RMA) and major axis (MA) regressions are different beasts entirely because they both involve estimation of the joint variation of both * x* and

*variables. This is much more complex and, for some reason, has not been explored extensively by theoreticians in the context of regression analysis. With a little imagination and a bit of plane geometry, though, one can imagine what an equivalent set of calculations would look like for the RMA and MA cases.*

**y**Reasoning from analogy, the univariate, least-squares \(SS_T\) and \(SS_R\) parameters, both of which are represented as squared deviations (distances) from the \(\overline{y}\) need to be replaced by equivalent squared deviations from observed and estimated data points referenced to the bivariate or joint mean (\(\overline{x},\overline{y}\)). In the case of the \(SS_T\) this is a simple sum of squares of the straight-line (Euclidean) distances between each point in the dataset and the bivariate mean

(3.8) \[SS_T = \sum(x_i-\overline{x})^2 + (y_i-\overline{y})^2\]

The \(SS_R\) value, then, represents and equivalent quantity after the estimated positions of the observed data points have been projected onto the linear regression line.

(3.9) \[SS_R = \sum(\hat{x}_i-\overline{x})^2 + (\hat{y}_i-\overline{y})^2\]

The trick in dealing with RMA and MA regressions lies in obtaining these estimated \(\hat{x}_i\hat{y}_i\) values. Recall that, unlike least-squares regressions, both RMA and MA regressions estimate regression lines by minimizing the combined lengths of the line segments between the observed data points and the regression model arranged such that each segment is oriented perpendicular to the model (see Regression 1, Fig. 4; Regression 2, Fig. 2). Given these geometric relations, the estimated \(\hat{x}_i\hat{y}_i\) values for a RMA or MA regression must be obtained by projecting the observed data points onto the regression line in a manner that respects this error-minimization geometry. One way of accomplishing this is by determining the intersection point between the regression line and the equations of a series of lines that pass through each data point and have slopes perpendicular to that of the regression line (Fig. 4).

Equations you can use to project data points (\(x_i,y_i\)) onto a RMA or MA regression line follow.

Slope of a line perpendicular to regression line \(y = mx +b\)

(3.10) \[m_{residual} = (m)(-1)\]

Intercept of a line perpendicular to regression line \(y = mx +b\) and passing through point (\(x_i,y_i\))

(3.11) \[b_{residual} = y_i - [(x_i)(m_{residual})]\]

Projected \(x_i\)-value (\(\hat{x}_i\)) onto regression line \(y = mx +b\).

(3.12) \[\hat{x}_i = (-1)\left[\frac{(b-b_{residual})}{(m-m_{residual})}\right]\]

Projected \(y_i\)-value (\(\hat{y}_i\)) onto regression line \(y = mx +b\).

(3.13) \[\hat{y}_i = m\hat{x}_i + b\]

Once these values have been obtained, quantities analogous to \(SS_E\) and/or * r^{2}* can be calculated (see equations 3.5 and 3.6). The Excel worksheet that accompanies this column summarizes these calculations for the trilobite glabellar data. As you will see, use of the RMA and MA approaches to estimating a linear model improves the overall fit of a linear model to these data markedly.

Let me close this column with a few observations, an encouragement, and an exception. The complications involved in obtaining \(\hat{x}_i\hat{y}_i\) values for a RMA and MA regressions (equations 3.11 through 3.13) go some way to explaining why few statistical textbooks discuss these regression models. Whereas \(\hat{y}_i\) can be obtained by a simple subtraction and \(\hat{x}_i = x_i\) in the world of least squares, a rather complicated and non-intuitive geometric construction was necessary to accomplish the same estimations for linear models determined under the RMA and MA conventions. The methods I have outlined above, however, show that, contrary to many textbooks, it is not correct to say that RMA and MA regressions cannot be used to estimate the values of such variables. Least-squares regression still has the edge in terms of being able to estimate a value of * y* in the absence of any information other than the corresponding value of

__. But, given an observed value__

**x***and the equation of a RMA or MA regression line, either \(\hat{x}\) or \(\hat{y}\) or both can be estimated; the calculations required are just a bit more involved. To the extent that the regression relation is the ‘true’ relation between two variables, and deviations from the regression model are ‘true’ error, the equations I develop above allow RMA and MA regressions to be used in a manner completely analogous to least-squared regressions.*

**x,y**Moreover, because \(\hat{x}\) and \(\hat{y}\) can be estimated for RMA and MA regressions, significance testing using the classic *F*-test approach is also possible for these models. The alternative *F*-test method I have proposed here is ad hoc only in the sense that other approaches to these calculations are possible. I have modelled my approach on a procedure that is analogous geometrically to the *F*-test used in least-squares regression analysis. If you are willing to assume—of if you have evidence that—both variables are distributed in a bivariate normal manner, you may apply the *F*-test I’ve outlined with as much confidence as the least-squares *F*-test. If you suspect your data may not follow normal distributions you may still apply it as a rough-and-ready tool for indicating (not demonstrating) the statistical significance of a RMA or MA regression; just as you would a least-squares *F*-test. I created the RMA-MA versions of the F-test to fill an obvious gap in the pantheon of regression analysis techniques and I would be interested to know how it performs. The results obtained for the trilobite dataset appear reasonable; even a bit better than I had hoped.

As for the exception, I began this discussion by citing an example of an autocorrelated data pattern (Fig. 1B) that, I claimed, should not be analyzed by linear regression methods because to do so violated the assumption of homoscedasticity. I stand by that claim in the sense that, if you did attempt to characterize such patterns using linear models your models would ignore what is arguably the most interesting aspect of these patterns. Still, this is not to say that linear methods are completely useless as part of an analytic approach to such data.

Figure 5A shows the well-known Sepkoski (2002) Phanerozoic extinction-intensity data. Note that this data set, like the example shown in Figure 5B, is composed of a complexly autocorrelated data sequence representing the contrast between so-called mass-extinction and background-extinction events, all superimposed on a declining trend: the background extinction gradient. When Jack Sepkoski and David Raup first published their family-level extinction data they recognized these patterns and wanted a way to separate them so as to better estimate the relative magnitudes of the extinction-intensity peaks. To do this they selected the standard approach to removing a linear trend from such data and performed a least-squares linear regression analysis of percent extinction magnitude on time. The resultant residual data (shown for the 2002 genus-level dataset in Fig. 5B) served as the basis for most of their (and others’) subsequent analyses. The point is, these quantitative tools, which are extraordinarily useful in and of themselves, become even more so when deployed in sympathetic combination with other such tools.

## Reader's Comments

There were several interesting responses-comments-questions that have come in since last time. Perhaps the most technical was the following.

"Do you know of a simple method for determining whether a value within a set of data differs significantly from the mean for that data set? Relatively small numbers (n = 7 to 14)." -- Anonymous

There are two ways to answer this question, both of which are relevant to the material I covered in the last newsletter. As with most statistical questions, the first step to getting a clear handle on the correct approach is to get a handle on what question you’re really asking. This question can be read to be asking ‘What is the probability that observation was drawn from a distribution (characterized by mean and standard deviation )"? If this is the question you’re interested in, let’s take the following set of observations (also supplied by the questioner)…

n | x | n | x |
---|---|---|---|

1 | 2.42 | 7 | 5.09 |

2 | 2.45 | 8 | 5.19 |

3 | 4.52 | 9 | 5.26 |

4 | 5.00 | 10 | 5.48 |

5 | 5.06 | 11 | 6.60 |

6 | 5.08 | 12 | 7.19 |

…and let’s say we want to know how likely we were to draw the observation 8.18 from this population. The question is not idle. The average interval between values is 0.43 whereas 8.81 is 1.62 units from the largest previously recorded value. That having been said, there is one interval (2.45-4.52) in which the difference is larger than 1.62. What to do?

The first thing to check is where 8.81 falls in terms of standard deviation units from the mean. We discussed this last time in the section on data standardization. Conversion to standard deviation units is accomplished by calculating the z-score of 8.81 which involves subtracting the sample mean (5.17) from this value and dividing the difference by the sample standard deviation (1.38). This z-score is 2.63, which means that 8.81 is 2.63 standard deviation units from the sample mean. If we are willing assume the sample was drawn from a normal population, we can look up the proportion of the normal distribution that lies beyond to 2.63 standard deviation units above the mean. This value in 0.0043. If we are willing to accept a 5 per cent chance of being wrong we can regard it as very unlikely that the value of 8.81 was part of the distribution from which we drew our sample, provided our sample (1) was drawn from a single population, (2) was drawn from a population of values that did conform to the normal distribution and (3) is composed of values that were drawn randomly and independently from that distribution. If we are unhappy with assumption 2 (and few palaeontological populations are normally distributed) we can relax it by using Chebyshev's Theorem to estimate a non-parametric version of the same test. This theorem holds that the fraction of any dataset lying within at least k standard deviations from the mean is:

(3.14) \[p = \ - \left(\frac{1}{k^2}\right)\]

Applying this equation to the example data we can say that at least 0.86 (86%) of a population with a mean of 5.17 and a standard deviation of 1.38 would lie at or below 2.63 standard deviations above the mean. That of course, means that as much as 0.14 (14%) of the distribution could lie beyond 2.63 standard deviations above the mean. Thus, we are left with an ambiguous result. The value 8.81 would seem to lie outside bounds of a normal distribution so constituted, but perhaps not outside the bounds of a non-normal distribution. The obvious next step (other than to collect a larger sample) would be to check to see of the 12-point sample conforms to the expectations of a normal distribution. Unfortunately, a discussion of procedures for normality testing will need to await a future column.

## Reference

Sepkoski, J. J., Jr. 2002. A compendium of fossil marine animals. 563. In D. Jablonski and M. Foote (eds.), Bulletins of American Palaeontology 363. Palaeontological Research Institution, Ithaca, New York.

## Footnotes

^{1}Throughout this essay I will be assuming a ‘y on x’ least squares regression. For an ‘x on y’ least squares regression reverse the x’s and y’s in the equations.