Appendix 2E Correlation and regression

Sumit Dey-Chowdhury

A2E.1 Introduction

This appendix introduces correlation and regression, which are techniques for investigating the statistical relationship between two, or more, variables. We will cover:

A2E.2 Correlation

correlation
A measure of the statistical relationship between two variables.

The association between two variables can be measured using the correlation coefficient. This gives us both the direction of the relationship, and an indication of its strength. Figure A2E.1 shows three different scatter plots between pairs of two variables (Xi, Yi), each representing a different type of relationship.

Figure A2E.1 Three broad types of correlation between two variables X and Y

Three broad types of correlation between two variables X and Y

Figure A2E.1a a. positive correlation

a. positive correlation

Figure A2E.1b b. negative correlation

b. negative correlation

Figure A2E.1c c. zero correlation

c. zero correlation

In panel a, the scatter points to a positive relationship between the two variables X and Y. That is, as the values of X become larger, they broadly coincide with larger values of Y. Although the scatter points do not line up to form a perfect linear relationship, the positive relationship between the two variables seems clear when we look at it. This suggests a positive correlation between the two variables: one increases when the other does. An example of two economic variables that share a positive correlation would be consumption and income, in the form of an upward-sloping consumption function.

In panel b, the correlation between the variables X and Y appears to be negative. High values of X are generally associated with low values of Y, and vice versa. Again, the nature of the correlation appears to be strong. An example of two economic variables that are negatively correlated would be prices and quantities for a particular product or service, in the form of a conventional downward-sloping demand curve.

Panel c shows no obvious relationship between the two variables. There is little association between the levels of X and Y, the data points appear to be randomly located and do not align in any particular direction. In this example, we would suggest that the two variables are uncorrelated. This relationship might apply to any two variables that are independent of each other, for example the price of cheese and the sales of garden furniture.

A2E.3 Calculating the correlation coefficient

Although a scatter plot of the data provides a visual representation of the direction and strength of the correlation between two variables, we can discover more about the nature of the relationship by calculating a correlation coefficient.

Pearson correlation coefficient
Also known as Pearson’s r, the correlation coefficient is a numerical measure of correlation between two sets of data. It is calculated by dividing the covariance of the two variables by the product of their standard deviations.

The Pearson correlation coefficient, r, between two variables (X and Y) can be calculated by dividing the covariance between the two variables by the product of their standard deviations:

\[r = \frac{\sum\limits_{i=1}^{n} \; \left(X_{i} − \overline{X}\right)\left(Y_{i} − \overline{Y}\right)}{\sqrt{\sum\limits_{i=1}^{n} \; \left(X_{i} − \overline{X}\right)^{2} × \sum\limits_{i=1}^{n} \; \left(Y_{i} − \overline{Y}\right)^{2}}}\]

Where there is a sample consisting of n pairs of data (Xi, Yi) for i = 1, 2, …, n, \(\overline{X}\) is the arithmetic mean of the X observations and \(\overline{Y}\) is the arithmetic mean of the Y observations.

Alternatively, the following formula can also be used, and is often preferred, as it is easier to calculate:

\[r = \frac{n\sum\limits_{i=1}^{n} \; X_{i}Y_{i} − \left(\sum\limits_{i=1}^{n} \; X_{i}\right)\left(\sum\limits_{i=1}^{n} \; Y_{i}\right)}{\sqrt{\left(\left(n\sum\limits_{i=1}^{n} \; X_{i}^{2} − \left(\sum\limits_{i=1}^{n} \; X\right)^{2}\right)\right)\left(\left(n\sum\limits_{i=1}^{n} \; Y_{i}^{2} − \left(\sum\limits_{i=1}^{n} \; Y\right)^{2}\right)\right)}}\]

The correlation coefficient was developed in the 1880s by Karl Pearson, a mathematician and biostatistician.

The Pearson correlation coefficient, r, measures both the direction and strength of association between two variables X and Y, such that:

−1 ≤ r ≤ +1

If r > 0, there is a positive correlation between the two variables, as shown in Figure A2E.1a. The closer the correlation coefficient is to positive 1, the stronger the positive correlation. In this case the data in the scatter chart becomes more closely aligned to an upward-sloping line of best fit.

If r < 0, the two variables are negatively correlated, as shown in Figure A2E.1b. The strength of the negative correlation increases as the correlation coefficient approaches −1 and the data becomes more closely aligned to a downward-sloping line of best fit.

If r = 0, then the two variables exhibit zero correlation. In practise, if two variables share a very weak positive or negative correlation, then the scatter plot between them will be similar to one of zero-correlation. That is, it would be very difficult to identify any form of relationship between the two variables, such as in Figure A2E.1c. Therefore, we generally think of two variables as being uncorrelated if r ≈ 0.

To give an example of how the correlation coefficient is calculated, consider the data shown in Figure A2E.2. This shows 10 pairs of observations (Xi, Yi) for i = 1, 2, …, 10. Plotting the data in a scatter chart shows a relatively good positive relationship between the two. However, calculating the correlation coefficient will provide us with a number that can both confirm whether the relationship is positive, and if so, the relative strength of the correlation. Figure A2E.3 shows the necessary workings to calculate the Pearson correlation coefficient.

Obs. X Y
1 1 3
2 4 7
3 5 4
4 7 13
5 8 8
6 10 13
7 12 15
8 13 12
9 15 19
10 19 16

Figure A2E.2 A sample of data consisting of 10 pairs of observations (Xi, Yi)

A sample of data consisting of 10 pairs of observations (Xi, Yi)

Obs. X Y X2 Y2 XY
1 1 3 1 9 3
2 4 7 16 49 28
3 5 4 25 16 20
4 7 13 49 169 91
5 8 8 64 64 64
6 10 13 100 169 130
7 12 15 144 225 180
8 13 12 169 144 156
9 15 19 225 361 285
10 19 16 361 256 304
Totals (∑) 94 110 1,154 1,462 1,261

Figure A2E.3 Workings required to calculate the Pearson correlation coefficient

Workings required to calculate the Pearson correlation coefficient

From Figure A2E.3, we have calculated the following: \(n = 10, \sum\limits_{i}^{n} \; X_{i} = 94, \sum\limits_{i}^{n} \; Y_{i} = 110, \sum\limits_{i}^{n} \; X_{i}^{2} = 1,154, \sum\limits_{i}^{n} \; Y_{i}^{2} = 1,462 \text{ and } \sum\limits_{i}^{n} \, X_{i}Y_{i} = 1,261\)

Putting these values into the equation for the Pearson correlation coefficient gives:

\[r = \frac{(10 × 1,261) − (94 × 110)}{\sqrt{(10 × 1,154 − (94)^{2})(10 × 1,462 − (110)^{2})}}\] \[r = \frac{2,270}{\sqrt{6,814,080}} = 0.8696\]

The correlation coefficient of r = 0.8696 confirms a positive correlation between the two variables X and Y. As the correlation coefficient is close to 1, we can also suggest that the correlation is strongly positive.

It doesn’t matter how we label the axes, the correlation between Y and X is the same as between X and Y. The coefficient is also independent of units of measurement. If either or both variables were multiplied or divided by a constant amount, then the correlation coefficient we calculate would remain the same.

A2E.4 Is the correlation coefficient statistically significant?

So far, we have based the strength of the association between two variables on the size of the correlation coefficient. A strong positive correlation is where the correlation coefficient approaches positive 1, and a strong negative correlation when the correlation coefficient approaches negative 1.

However, this basic rule of thumb does not give us enough information to determine whether the correlation coefficient is significantly different from zero or not. Most correlation exercises are based on a sample of data drawn from a population. Therefore, we are again using sample statistics to make an inference about “true” population parameters. It is not sufficient to imply that a sample correlation necessarily applies to the entire population. The sample correlation estimate may deliver a given correlation coefficient by chance.

If the sample size is small, then even a strong positive or negative correlation coefficient may be unrepresentative of the population as a whole. And even if our sample size is sufficiently large, what size of correlation coefficient would indicate a significant association between two variables? Could this be 0.3, 0.5 or 0.7?

To resolve these questions, we can test whether the correlation coefficient calculated from a sample of data implies whether or not the coefficient for the population is statistically different from zero at some level of significance. This can be done by setting up a hypothesis test and following the principles that we outlined in Appendix 2D.

The unknown population correlation coefficient is represented by the parameter ρ and the hypothesis test for the population coefficient is defined by:

H0: ρ = 0
H1: ρ ≠ 0

When testing for correlation, we are always testing whether to accept or reject a null hypothesis that asserts no genuine correlation between X and Y. The alternative hypothesis, H1, would be that there is a non-zero correlation coefficient between the two variables. As we can reject the null hypothesis in each direction (implying a positive or negative correlation), this will be a two-tail test.

The test statistic based on the calculated sample correlation coefficient r is:

\[t = \frac{r \sqrt{n − 2}}{\sqrt{1 − r^{2}}}\]

This test statistic is distributed as a t-distribution with n − 2 degrees of freedom. From our example using the data in Figure A2E.2, we have estimated that r = 0.8696, and given n = 10, we can calculate the test statistic as:

\[t = \frac{0.8696\sqrt{10 − 2}}{\sqrt{1 − \left(0.8696\right)^{2}}} = 4.9814\]

A critical value for the test statistic can be taken from the t-distribution table – see Figure A2D.11 of Appendix 2D. Choosing to set the significance level of the test at 5%, meaning there is a one-in-twenty chance of rejecting the null hypothesis by chance, we are looking for a critical value from the table where α = 0.025. These are the t-values that cut off 2.5% at the top and bottom of the distribution. Given there are n = 10 observations, the test statistic is based on v = 10 − 2 = 8 degrees of freedom. For these values of v and α, the critical value for the test statistic is t* = ±2.306.

As the test statistic (t = 4.9814) lies outside the acceptance range for the critical test value (t* = ± 2.306), then we can reject the null hypothesis of no correlation between X and Y at the 5% significance level.

One of the key considerations when interpreting the correlation coefficient is that correlation does not imply causation. Finding a significant positive or negative correlation coefficient between X and Y does not imply that X causes Y. It may be that Y is causing X, or that they influence each other.

spurious correlation
Where two or more variables are found to be associated with each other but are not causally related. The association may be a coincidence, or due to a common unseen external factor.

Our statistical relationship may also be a spurious correlation. This is the finding of a statistically significant correlation between two variables when in the real world there is no relationship at all. This might happen if two variables, which are independent of each other, are both influenced in a similar way by another unrelated variable. The most common cause of spurious correlation is time. It is common to find variables that show trends or patterns over time. For example, if one variable trended upwards, and another downwards, then a statistically significant negative correlation may be found even if the two variables are completely unrelated.

In summary, there are four reasons why we might find a statistically significant and non-zero correlation coefficient r between two variables X and Y:

  1. Y influences X.
  2. X influences Y.
  3. X and Y influence each other.
  4. X and Y are unrelated, but both influenced by another variable, Z.

The calculation of a correlation coefficient, unfortunately, does not help us to distinguish between these alternatives.

A2E.5 The coefficient of rank correlation

Spearman’s coefficient of rank correlation
A measure of correlation between the rankings of two variables.

Another method for calculating a correlation coefficient between two variables X and Y is the Spearman’s rank coefficient. Instead of looking at the actual values of X and Y, the correlation coefficient is based on the rankings of the observations from highest to lowest. That is, the highest value X observation is given the ranking of 1, and the lowest value the bottom ranking. The same is done for the Y observations. If the two variables are positively correlated, we would expect pairs of X and Y observations to show broadly similar rankings. If negatively correlated, the pairs of X and Y observations would show broadly opposite rankings.

Developed by Charles Edward Spearman, a psychologist, Spearman’s coefficient of rank correlation between two variables is equal to the Pearson correlation between the rank values of those two variables.

Spearman’s rank approach evaluates the monotonic relationship between two variables, that is, as one variable increases in size, does the other variable also increase, decrease, or show no relationship? Therefore, it is often preferred to the Pearson method which looks at the association between the actual values of the two variables, when using non-linear data.

The rank correlation coefficient can be calculated using the following formula:

\[r = 1 − \frac{6\sum\limits_{i}^{n} \; d_{i}^{2}}{n(n^{2} − 1)}\]

where \(d_{i} = \text{Rank}\left( X_{i} \right) − \text{Rank}\left( Y_{i} \right)\)

Using the same data shown in Figure A2E.3, the necessary calculations required for the rank correlation coefficient are set out in Figure A2E.4. Note that if two or more observations have the same value. then usual practise is to assign each an average of the ranks. For example, observations 4 and 6 both have the same Y value. As these coincide with the 4th and 5th rank each is assigned a ranking of 4.5.

Obs. X Y Rank X Rank Y d d2
1 1 3 10 10 0 0
2 4 7 9 8 1 1
3 5 4 8 9 −1 1
4 7 13 7 4.5 2.5 6.25
5 8 8 6 7 −1 1
6 10 13 5 4.5 0.5 0.25
7 12 15 4 3 1 1
8 13 12 3 6 −3 9
9 15 19 2 1 1 1
10 19 16 1 2 −1 1
Totals (∑) 94 110 55 55 0 21.5

Figure A2E.4 Calculating the Spearman’s rank correlation coefficient

Calculating the Spearman’s rank correlation coefficient

Given n = 10 and \(\sum\limits_{i}^{n} \; d_{i}^{2} = 21.5\), the correlation coefficient is:

\[r = 1 − \frac{(6 \times 21.5)}{10( 10^{2} − 1 )} = 0.8697\]

As with the Pearson coefficient, Spearman’s rank approach also indicates a strong positive relationship between the two variables. The statistical significance of the correlation coefficient can also be tested. If the true population rank correlation coefficient is denoted by ρ, we may be interested in testing a null hypothesis (H0) that this is statistically different from zero at a given level of significance.

H0: ρ = 0
H1: ρ ≠ 0

The test statistic is simply the calculated rank correlation coefficient from the sample of data, which in our example is r = 0.8697. The critical value for this test statistic is not drawn from the normal or t-distributions, but from tables specific to the Spearman rank correlation coefficient, an extract of which is shown in Figure A2E.5.

Two-sided α
.10 .05 .01
n
5 .900
6 .829 .886
7 .714 .786 .929
8 .643 .738 .881
9 .600 .700 .833
10 .564 .648 .794
11 .536 .618 .818
12 .497 .591 .780
13 .475 .566 .745
14 .457 .545 .716
15 .441 .525 .689
16 .425 .507 .666
17 .412 .490 .645
18 .399 .476 .625
19 .388 .462 .608
20 .377 .450 .591
21 .368 .438 .576
22 .359 .428 .562
23 .351 .418 .549
24 .343 .409 .537
25 .336 .400 .526
26 .329 .392 .515
27 .323 .385 .505
28 .317 .377 .496
29 .311 .370 .487
30 .305 .364 .478

Figure A2E.5 Critical values for Spearman’s rank correlation coefficient

Critical values for Spearman’s rank correlation coefficient

Note that for a sample size of n, two-sided critical values are given for significance levels of α = 0.1, 0.05 and 0.01.

To test the null hypothesis at the 5% significance level, the critical value for the sample test statistic is found from Figure A2E.5 for n = 10 and α = 0.05 giving r* = 0.648. Because the rank correlation coefficient we calculated exceeds this value, we can reject the null hypothesis and assert that the rank correlation coefficient is different from zero at the 5% significance level.

A2E.6 Regression analysis

regression analysis
A statistical process for estimating the relationships between a dependent variable (often called the ‘outcome’ or ‘response’ variable) and one or more independent variables (often called ‘explanatory’ variables).

Regression analysis is the process of summarising the relationship between variables by using the line of best fit or regression line. For example, if we had pairs of observations (Xi Yi), we could estimate a regression line of the form:

\[Y_{i} = a + bX_{i}\]

where a and b are the estimated regression line coefficients.

In this model, the variable Y has been set as the dependent variable and variable X as the independent variable. The model informs us as to whether X influences Y, and if so, how might changes in X affect the value of Y. Regression analysis also enables us to test the goodness of fit of the regression line to the sample data, and whether the estimated coefficients are statistically significant.

So how do we go about choosing the coefficients a and b to define a good line for best fit between X and Y? A good regression line is one that minimises the sum of squared errors between the observed data points and the fitted points on the line. To see how, consider the sample of data presented in Figure A2E.2 consisting of ten pairs of X–Y observations. We know this data is positively correlated, so a regression line of best fit will most likely be upward sloping.

Figure A2E.6 shows a scatter of the data points with a purported regression line. The regression errors are the differences between the actual values of Y and the predicted values of Y, given the regression line equation and the value of X. These errors (ei) are simply shown as the vertical distance from the observation to the regression line where:

\[e_{i} = Y_{i} − a − bX_{i}\]

Figure A2E.6 The intuition behind the ordinary least squares regression line

The intuition behind the ordinary least squares regression line

ordinary least squares (OLS) regression
This is the basic approach to estimating a linear regression model, where the regression coefficients are calculated to minimise the sum of squared errors between the actual dependent variable data and the predicted data, given the regression coefficients and the actual data for the independent variables.

Ordinary least squares (OLS) regression is where the coefficients a and b are calculated to minimise the sum of squared errors:

\[\sum_{i=1}^{n} \; e_{i}^{2}\]

This gives us two formulae to estimate the slope coefficient b and the intercept coefficient a:

\[b = \frac{n\sum\limits_{i}^{n} \; X_{i}Y_{i} − \sum\limits_{i}^{n} \; X_{i}\sum\limits_{i}^{n} \; Y_{i}}{n\sum\limits_{i}^{n} \; X_{i}^{2} − \left(\sum\limits_{i}^{n} \; X_{i} \right)^{2}}\]

and

\[a = \overline{Y} − b\overline{X}\]

Using these two formulae, we can calculate the regression coefficients for the data in Figure A2E.2 we have been using throughout this appendix. For this we have already calculated the following summations and averages:

\(n = 10, \sum\limits_{i}^{n} \quad\) \(X_{i} = 94, \sum\limits_{i}^{n} \quad\) \(Y_{i} = 110, \sum\limits_{i}^{n} \quad\) \(X_{i}^{2} = 1,154, \sum\limits_{i}^{n} \quad\) \(Y_{i}^{2} = 1,462, \sum\limits_{i}^{n} \quad\) \(X_{i}Y_{i} = 1,261, \overline{X} = \frac{94}{10} = 9.4\), and \(\overline{Y} = \frac{110}{10} = 11\)

The slope coefficient b is calculated:

\[b = \frac{\left(10 \times 1,261\right) − \left(94 \times 110\right)}{\left(10 \times 1,154\right) − 94^{2}} = 0.8395\]

and the intercept coefficient a is calculated:

\[a = 11 − (0.8395 \times 9.4) = 3.1087\]

Therefore, the OLS regression line for our sample of data is:

\[Y = 3.1087 + 0.8395X\]

The intercept term, in this case a = 3.1087, tells us the value of Y when X = 0. The slope coefficient, b = 0.8395, tells us the amount we expect Y to change for every unit increase in the value of X.

Two conditions necessary for OLS regression to produce unbiased and efficient estimates are that the error term is normally distributed and that errors are statistically independent of each other. If these conditions are not met, transformation of the data or a change in the modelling approach may be required.

How it’s done Prediction

Once estimated, the regression line may be used to find predicted values of the dependent variable Y given a value of the independent value X. For example, using the regression model estimated in this appendix, we can find the predicted value for Y given X = 25:

\[\widehat{Y} = 3.1087 + 0.8395(25) = 24.0962\]

Predicted values within the range of sample data for X are known as interpolations, whereas those formed for values of the independent variable(s) outside of the sample range are described as extrapolations.

How it’s done Calculating elasticity

The coefficient b in the linear regression model is described as the slope coefficient, and gives the change in the dependent variable Y for a unit change in the independent variable X. The concept of elasticity, on the other hand, measures the proportionate change in Y following a proportionate change in X.

The elasticity (\(\eta\)) of Y given X is:

\[\eta = \frac{\frac{\Delta Y}{Y}}{\frac{\Delta X}{X}} = \frac{\mathrm{\Delta}Y}{\mathrm{\Delta}X} \times \frac{X}{Y}\] \[\eta = b \times \frac{\overline{X}}{\overline{Y}}\]

In the example calculations used in this appendix, we have b = 0.8395, \(\overline{X}\) = 9.4, and \(\overline{Y}\) = 11. This gives the elasticity between Y and X as: \(\eta = 0.8395 \times \frac{9.4}{11} = 0.7174\)

The slope coefficient tells us that a 1 unit change in X leads to 0.8395 units change in Y. The elasticity tells us that a 1% change in X leads to a 0.7174% change in Y.

A2E.7 Assessing the regression line goodness of fit

coefficient of determination
This is characterised by R2 (pronounced R-squared) and is the proportion of the variation in the dependent variable that can be explained by the variation in the independent variable or variables.

We are not just interested in the relationship between X and Y described by the regression equation, but also how good the fit of this line is to the data. The coefficient R2, commonly known as the coefficient of determination, is such a measure of goodness of fit. R2 lies between the values of 0 and 1 (0 ≤ R2 ≤ 1) such that a poorly fitting regression equation will have a low R2 (close to 0), whereas a regression equation that fits the data very well will have a high R2 (close to 1).

The intuition behind the R2 statistic is demonstrated in Figure A2E.7. The values for the dependent Y variable vary around its mean average \(\overline{Y}\), but how much of that variation can be explained by the regression line, and how much by the error term? The higher the proportion of total variation in the variable Y explained by the regression line, the higher the R2. Likewise, the higher the proportion of total variation in the variable Y explained by the error term, the lower the R2.

Figure A2E.7 The method underlying the calculation of R2

The method underlying the calculation of R2

The total variation in the Y variable is given by the total sum of squares (TSS). This can be separated into the variation accounted for by the regression line, known as the regression sum of squares (RSS), and the variation accounted for by the error term, known as the error sum of squares (ESS).

TSS = RSS + ESS

The coefficient of determination R2 is the proportion of TSS accounted for by RSS.

\[R^{2} = \frac{\text{RSS}}{\text{TSS}}\]

Using the data from Figure A2E.2 (represented in Figure A2E.7) we can calculate the total sum of squares as the sum of the square deviations of Y from its average (\(\overline{Y}\)):

\[\text{TSS} = \sum_{i}^{n} \; \left( Y_{i} − \overline{Y}\right)^{2}\] \[= \sum_{i}^{n} \; Y_{i}^{2} − n\overline{Y}^{2} = 1,462 − \left( 10 \times 11^{2} \right) = 252\]

The error sum of squares can be calculated using as the sum of the squared deviations of Y from the fitted values using the regression line (\(\widehat{Y}_{i} = a + bX_{i}\)):

\[\text{ESS} = \sum_{i}^{n} \; \left( Y_{i} − \widehat{Y}_{i} \right)^{2}\] \[= \sum_{i}^{n} \; Y_{i}^{2} − a\sum_{i}^{n} \; Y_{i} − b \sum_{i}^{n} \; X_{i}Y_{i}\] \[= 1,462 − \left(3.1087 \times 110\right) − \left(0.8395 \times 1,261\right) = 61.4335\]

The easiest way of calculating RSS is to deduct the ESS from TSS:

RSS = 252 – 61.4335 = 190.5665

Now, we can use our calculations of RSS and TSS to derive the coefficient of determination, R2.

\[R^{2} = \frac{\text{RSS}}{\text{TSS}} = \frac{190.5665}{252} = 0.7562\]

This tells us that 75.62% of the variation in Y is explained by the variation in X through the regression model. The remaining 24.38% is left to be explained by other factors (or pure random variation). As R2 takes a value between 0 and 1, this suggests the data fits the model relatively well. That is, most of the points in the scatter plot are close to the regression line.

An interesting feature of the R2 statistic is that in a regression of Y as the dependent variable and X as the independent variable, the coefficient of variation is equal to the square of the Pearson correlation coefficient between X and Y. Recall from Figure A2E.3 that this coefficient was calculated as r = 0.8696. Therefore:

\[R^{2} = 0.8696^{2} = 0.7562\]

This relationship helps us to better understand how correlation and regression relate to each other. If R2 = 0, there is no linear relationship between X and Y: all of the variation in Y is accounted for by the error term and none by the regression line (b = 0). This corresponds to r = 0 where there is zero correlation between the two variables.

Likewise, if R2 = 1, then it is the case that the regression line fits the data perfectly, so the error terms are zero. This corresponds to r = ±1 so the data exhibits either a perfect positive or negative correlation.

A2E.8 Testing the significance of the regression coefficients

While R2 provides a measure of how well the regression equation fits the data, we may also be interested in the significance of particular regression coefficients. For example, suppose we were to estimate a linear regression model of the form Y = a + bX because we are particularly interested in how changes in X impact Y. This might be because Y is an outcome variable such as health quality or household consumption, and X is a variable that can be influenced by policy such as health expenditure or household disposable income. In this instance, the coefficient b can provide us with useful information.

We can test the significance of individual regression coefficients using the method of hypothesis testing set out fully in Appendix A2D. When we estimate the regression coefficients a and b (\(Y_{i} = a + bX_{i}\)) we are typically using sample data, in which case the coefficients are point estimates of the true but unknown parameters α and β. These would be the coefficients we would calculate if the model was estimated on the entire population data set (\(Y_{i} = \alpha + \beta X_{i}\)). If our aim is to establish how changes in X impact on changes in Y, we have an estimate of b, but it is really β that we want to know.

standard error of regression coefficients
The standard error of a regression coefficient is the standard deviation of its point estimate.

Hypothesis testing allows us to make inferences about the population coefficients based on sample estimates, such as whether α and β are significantly different from zero. However, to carry out a hypothesis test we first need to calculate the standard errors of the regression coefficients a and b.

The variance of the slope coefficient b is given by:

\[s_{b}^{2} = \frac{s_{e}^{2}}{\sum\limits_{i}^{n} \; \left( X_{i} − \overline{X}\right)^{2}}\]

where the estimated variance of the error term is:

\[s_{e}^{2} = \frac{\sum\limits_{i}^{n} \; e_{i}^{2}}{n − 2} = \frac{\text{ESS}}{n − 2}\]

To make the calculation easier, we can also write the denominator as:

\[\sum_{i}^{n} \; \left( X_{i} − \overline{X}\right)^{2} = \sum_{i}^{n} \; X_{i}^{2} − n\overline{X}^{2}\]

Using our previous calculations for the regression model based on the data in Figure A2E.2, we have ESS = 61.4335, n = 10, \(\sum\limits_{i}^{n} \; X_{i}^{2} = 1,154\) and \(\overline{X} = 9.4\).

Therefore, the variance of the error term is:

\[s_{e}^{2} = \frac{\sum\limits_{i}^{n} \; e_{i}^{2}}{n − 2} = \frac{\text{ESS}}{n − 2} = \frac{61.4335}{8} = 7.6792\]

In which case the variance of the regression coefficient b is:

\[s_{b}^{2} = \frac{s_{e}^{2}}{\sum\limits_{i}^{n} \; \left( X_{i} − \overline{X}\right)^{2}} = \frac{s_{e}^{2}}{\sum\limits_{i}^{n} \; X_{i}^{2} − n\overline{X}^{2}} = \frac{7.6792}{1,154 − \left( 10 \times {9.4}^{2} \right)} = 0.0284\]

And the standard error of b is simply the square root of the variance:

\[s_{b} = \sqrt{0.0284} = 0.1685\]

We can find the variance of the intercept coefficient a using the formula:

\[s_{a}^{2} = s_{e}^{2} \times \left( \frac{1}{n} + \frac{\overline{X}^{2}}{\sum\limits_{i}^{n} \; \left( X_{i} − \overline{X}\right)^{2}} \right)\]

Using the values already calculated:

\[s_{a}^{2} = 7.6792 \times \left( \frac{1}{10} + \frac{ {9.4}^{2}}{1,154 − \left( 10 \times {9.4}^{2} \right)} \right) = 3.2773\]

Hence the standard error of a is \(s_{a} = \sqrt{s_{a}^{2}} = \sqrt{3.2773} = 1.8103\)

Having calculated the standard errors, we can now test whether β and/or α are significantly different from zero at the 5% level. In reality we are generally more interested in the statistical significance of the slope coefficient than the intercept, but we show both here for completeness.

First consider the slope coefficient β. The null hypothesis H0 is for a zero-slope coefficient (that is, X does not influence Y). The alternative (H1) hypothesis is that the coefficient is non-zero, and that X does influence Y (can be positive or negative)

H0: β = 0
H1: β ≠ 0

The test statistic, based on b which is our sample estimate of β, is:

\[t = \frac{b − \beta_{H0}}{s_{b}}\]

Using the calculated values for b = 0.8395, \(\beta_{H0} = 0\) and \(s_{b} = 0.1685\)

\[t = \frac{0.8395 − 0}{0.1685} = 4.9821\]

The test statistic is distributed as a t-distribution with nk − 1 degrees of freedom where k is the number of independent variables in the model. Here k = 1, so the respective critical values for the 5% significance level can be found from the t-distribution table (Figure A2D.11 of Appendix 2D) where v = 8 and α = 0.025 as t* = ±2.306. As the test statistic falls outside this range, the null hypothesis that β = 0 can be rejected at the 5% significance level.

We can also test whether the intercept coefficient is different from zero at the 5% significance level where the null and alternative hypotheses are:

H0: α = 0
H1: α ≠ 0

The test statistic, based on the estimated sample coefficient a is:

\[t = \frac{a − \alpha_{H0}}{s_{a}}\]

Using the values for a = 0.8395, \(\alpha_{H0} = 0\) and \(s_{a} = 0.1685\) gives:

\[t = \frac{3.1087 − 0}{1.8103} = 1.7172\]

As this lies within the range of critical values t* = ±2.306, we are unable to reject the null hypothesis that α = 0 at the 5% significance level.

A2E.9 Multiple regression

So far, we have only considered a regression model containing one independent or explanatory variable X for the dependent variable Y. This is known as a single regression analysis – for an obvious reason. In practice, there may be a number of possible explanatory variables for the dependent variable Y, so ideally, we would like to extend the regression model to allow for these.

Suppose we had k explanatory variables (X1, X2, …, Xk) where the estimated impact of each on the dependent variable Y is given by its own specific slope coefficient (b1, b2, …, bk). This is an example of multiple regression analysis.

\[Y = a + b_{1}X_{1} + b_{2}X_{2} + \ldots + b_{k}X_{k}\]

The ordinary least squares (OLS) method for calculating the coefficients in multiple regression analysis follows exactly the same principle as in single regression analysis, that is, the coefficients are chosen to minimise the sum of squared errors between the actual values of Y and the fitted values from the regression model. However, the calculations are considerably more complex, requiring the use of matrix algebra to solve for all k coefficients simultaneously. This requires the use of a computer and some statistical software.

Examples of statistical software commonly used to model economic data are R, SPSS, STATA, Eviews, TSP, Oxmetrics, LIMDEP, RATS and SHAZAM.

Figure A2E.8 shows an extract of data taken from the Living Cost and Food (LCF) Survey for 4012 UK households:

Household number Total weekly consumption (£) Total weekly disposable income (£) Number of people in household
1 75.78 173.27 1
2 392.16 934.27 5
3 665.37 1,018.29 2
4010 97.04 139.94 1
4011 325.63 320.70 2
4012 276.35 529.52 3

Figure A2E.8 A sample of household-level data for consumption and disposable income

A sample of household-level data for consumption and disposable income

Office for National Statistics – Living Cost and Food Survey 2017

ONS Resource

The Living Cost and Food Survey is a longstanding survey of UK households. It collects detailed information on household expenditure patterns.

Using the data shown in Figure A2E.8, we aim to estimate a household consumption model, which relates household weekly consumption to disposable income and size:

Consumption = a + b1Income + b2Size

The regression model was estimated using the SPSS software and the results are presented in Figure A2E.9.

Variable Coefficient Standard error t-value p-value
Constant 52.470 9.601 5.465 0.000
Income 0.517 0.012 44.086 0.000
Size 36.494 3.661 9.969 0.000

Figure A2E.9 Estimation of a household-level consumption function

Estimation of a household-level consumption function

Note that the dependent variable is consumption.

The estimated household consumption model is:

Consumption = 52.47 + 0.517 × Income + 36.494 × Size

Firstly, we can interpret the estimated coefficients:

In a multiple regression framework, the slope coefficients b1 and b2 are what we call marginal coefficients: they show the respective impact of changes in household disposable income and size on household consumption, while keeping everything else constant.

p-value
A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference.

Secondly, we are interested in the statistical significance of the estimated coefficients, particularly the slope or marginal coefficients. SPSS software routinely reports standard errors and t-statistics which are shown in Figure A2E.9 along with p-values. A p-value is the smallest significance level for which the null hypothesis that the respective coefficient is equal to zero can be rejected. For example, if p ≤ 0.05, we could reject the null hypothesis that the coefficient is not different from zero at the 5% significance level. If p ≤ 0.01, we could reject that null hypothesis at the 1% significance level. The results in Figure A2E.9 show we can accept all the regression coefficients to be different from zero at a very low significance level.

OLS is a standard approach for estimating linear models, but what might we do if we thought the relationship between the dependent and independent variables was non-linear. For instance, it is a widely held view that the marginal propensity to consume is not constant across income levels and is likely to fall as income increases. That is, as disposable income rises across different households, consumption also rises, but at a declining rate.

The relationship between household size and consumption may also be non-linear. For example, the marginal increase in household consumption as size increases from one person to two persons may be different (probably larger) from when household size increases from four people to five people.

In this case, regression analysis can continue to be carried out using OLS, but it might be sensible to transform the data by taking the logarithms of the model variables:

\[\ln\left(\text{Consumption}\right) = a + b_{1} \ln\left(\text{Income}\right) + b_{2} \ln\left(\text{Size}\right)\]

where ln() is the natural logarithm of each variable.

In Appendix 2B, the properties of this log-transformation are shown to linearise the data, making it possible to model non-linear relationships in a linear regression model. When both the dependent and independent variables are transformed in this way, we refer to the regression model as a log-log or double-log model. The results for this new specification are presented in Figure A2E.10.

Variable Coefficient Standard error t-value p-value
Constant 2.105 0.068 30.5759 0.000
ln(Income) 0.570 0.012 48.565 0.000
ln(Size) 0.315 0.16 19.749 0.000

Figure A2E.10 The log-log household consumption function model

The log-log household consumption function model

The dependent variable is ln(Consumption).

The estimated model again shows the coefficients to be different from zero at the 1% significance level. The main difference between the results in Figures A2E.9 and A2E.10 is how we interpret the regression coefficients b1 and b2.

In the original model they show the marginal impact of a change in the independent variable on the dependent variable – an extra £1 of disposable income, or an extra person – on household consumption in £. In the log-log model, the regression coefficients can be interpreted as elasticities. This is a useful property of using logarithms in regression analysis.

The elasticity of household consumption with respect to household disposable income is estimated to be 0.57, meaning a 10% increase in household income results in a 5.7% increase in household consumption. In terms of household size, the estimated coefficient of 0.315 suggests a 10% increase in household size raises household consumption by around 3.15%.

A2E.10 Summary

Correlation and regression analysis are two approaches to measuring the degree of association between two or more variables. A correlation coefficient between two variables can be calculated using Pearson’s coefficient or Spearman’s rank coefficient; the latter may be more useful if the underlying sample data is highly skewed. The statistical significance of both correlation coefficients can be tested. However, the existence of correlation, even if significant, does not necessarily imply causality. Care should also be taken to avoid interpreting spurious correlations as causal relationships.

Regression analysis, on the other hand, is more focused on measuring the effect of one or more variables on another, for example, does X influence Y and, if so, how? The size of this influence can be estimated from a linear regression model of the form Y = a + bX, defined by the intercept a and the slope coefficient b. The technique of ordinary least squares (OLS) estimates these regression coefficients by minimising the sum of squared errors between the actual observations and that predicted by the regression model. A measure of how well the regression model fits the data is given by the coefficient of determination, also called R2 (R-squared).

Of particular interest is the slope coefficient b which measures the responsiveness of Y to changes in X. In a multiple regression model, where there are more than one independent or explanatory variables, these slope coefficients measure the marginal impact of each on the dependent variable. If the data is transformed into logarithms, then these estimated coefficients can be interpreted as elasticities, giving the proportionate response of Y to a proportionate change in X. Hypothesis tests can also be carried out to test the statistical significance of the estimated regression coefficients. Regression analysis with multiple explanatory variables and large samples is best carried out using computer software, and there are a number of packages that can perform these and more advanced analyses.