Appendix 2A Summary statistics

A2A.1 Introduction

Summary statistics can be used to provide information about a data series, specifically about the distribution of the data points. We can estimate the “centre” of the data – known as the central tendency – using the arithmetic and geometric means, the median, or the mode. We are also likely to be interested in the dispersion – how “spread out” a set of observations is. For this, we consider various data ranges and the variance.

population
A population is the set of all data points in the study of interest. This could be the population of all people in a country, the total number of workers in a firm, or the total number of firms in the economy.
sample
A subset of the population for which data has been collected.

Most summary statistics are calculated for a sample of data. So, at the outset it will be helpful to define what we mean by a population and a sample.

A2A.2 Measuring an average

arithmetic mean
The sum of all values in a sample of size n, divided by the number of observations n. The arithmetic mean is often referred to as the average, even though other forms of averages exist.

Suppose there are n observations in a sample (X1, X2, …, Xn). The arithmetic mean of this sample $$\left(\overline{X}\right)$$ is obtained by summing all values and dividing by n:

$\overline{X}=\frac{\sum_{i=1}^{n} X_{i}}{n}$

The arithmetic mean is often referred to as both the mean and as the average, even though it is important to remember that there are other averages, for example, the median, the mode and the geometric mean.

The arithmetic mean gives equal weight to all data points. Sometimes we want to give more importance to some observations than others. For example, if we wanted to calculate average inflation rates experienced by consumers in the EU, we would want to give higher weight to countries with larger populations.

weighted mean
The sum of each value in a data set multiplied by the weight assigned to it, divided by the sum of the weights. When the sum of the weights is 1, this is known as a share weighted mean.

The general formula to calculate a weighted mean is given by:

$\overline{X}^{w}=\frac{\sum_{i=1}^n w_{i}X_{i}}{\sum_{i=1}^n w_{i}}$

Each value of Xi is multiplied by its respective weight wi. It is then necessary to divide by the sum of the weights for each of the n observations.

In the EU inflation example, with n countries, Xi is price change in country i and wi is population in country i. Multiplying inflation, measured in growth rates, by population, measured in numbers, leads to large numbers so it is necessary to divide by the total population across all countries to get back to growth rate units.

A special case of a weighted mean is where the weights are shares that sum to 1. In that case this would give a share weighted mean:

$\overline{X}^{S}=\sum_{i=1}^{n} s_{i}X_{i}$

Where the individual n shares (s1, s2, …, sn) sum to 1:

$\sum_{i=1}^{n} s_{i} = 1$

In the inflation example, si would be the share of each country’s population in the total EU population.

geometric mean
Given n data points, the geometric mean is the nth root of their product.

A third type of average, which is generally used when variables are growing rapidly, for example the exponential increase in human population, is the geometric mean$$\left(\overline{X}^{G}\right)$$. This is calculated for the n observations (X1, X2, …, Xn) by the formula:

$\overline{X}^{G}=\left(\prod_{i=1}^n X_{i}\right)^{\frac{1}{n}}$

In this formula, $$\Pi$$ denotes the product operator. The geometric mean is calculated by multiplying each of the n observations together (X1 × X2, × … × Xn) and taking the nth root of the total.

There is a problem with the arithmetic mean: it is sensitive to very large numbers. When there are outliers or data that increase rapidly, the geometric mean may be preferable. Similarly, with data series that are growing over time, more recent observations tend to dominate. The geometric mean is generally used instead of the arithmetic mean when calculating average growth rates for a data series.

As a general rule, if we calculate a geometric mean and arithmetic mean for the same data set comprising non-negative numbers, then the geometric mean will always be equal to or less than the arithmetic mean. This is often seen when comparing the consumer price index (CPI) and the retail price index (RPI) described in Chapter 1. The CPI makes greater use of the geometric mean, while the RPI makes greater use of the arithmetic mean. This results in a negative formula effect: using the same data and weights, the inflation rate given by the CPI will always be lower than that given by the RPI.

A2A.3 The median

When there are outliers, the median may also provide a more robust measure of central tendency. Data on earnings, for example, often contain a small proportion of individuals whose earnings are far above that of the typical person. These observations raise the arithmetic mean in a way that distorts the story the data tells.

To demonstrate, Figure A2A.1 shows a sample of UK annual earnings for 800 individuals selected from the overall population.

Earnings are shown for each individual, arbitrarily numbered from person 1 to person 800. A few individuals have very high earnings, but most observations are bunched between £20,000 and £40,000 per annum.

The arithmetic mean for this sample is £29,556, but this is influenced by the relatively few very high earnings at the top end of the earnings scale, and so may not be a good measure of earnings for the typical person. Instead, it might be better to find the earnings of the person in the middle of this sample, who has the same number of people who earn more than her as earn less.

histogram
A graphical representation of the distribution of data created by combining variables into a number of bands and creating a bar whose area represents the frequency of results for each band.

If we suspect that there are outliers that are distorting the mean, we might look at a histogram of the sample. This is a convenient tool for showing how data is distributed. This plots how much of the data sample, or the frequency, falls in different bands. When all the bands are of equal width, then a histogram will look very similar to a normal bar chart. In Figure A2A.2, the earnings data for the 800 individuals displayed in Figure A2A.1 is organised into a histogram showing the number of people who fit into bands corresponding to £1,000 of earnings. Note that the horizontal axis shows the mid-points of these £1,000 earnings bands.

From Figure A2A.2, the earnings bands with the highest frequencies are in the range from £24,000 to £35,000. The earnings distribution is also shown to be positively skewed: the observations are bunched to the left and with a long tail of less frequent observations to the right. This eyeball test supports our suspicion that using the arithmetic mean to represent typical earnings might be misleading, as the calculation of the arithmetic mean is affected by the relatively few outliers on the right of the earnings distribution.

median
The median value corresponds to the middle data point in the sample when the observations have been arranged in order from lowest to highest. That is the value of the particular observation where half the observations are below its value and half are above.

Therefore, it might be more appropriate to use the median. When the data is positively skewed, the arithmetic mean usually is greater than the median. In the earnings distribution shown in Figure A2A.2, we calculate the median as £29,150 (to find the median of 800 data points arranged in order, see the box in Figure A2A.4). In this case, the median is below the arithmetic mean of £29,556, as we would expect.

A2A.4 Percentiles

percentile
The value of an observation where a given percentage of all the observations are below or equal to this value. The median value is also the 50th percentile.

The median is actually a special case of a percentile (it is the 50th percentile). Other percentiles can be calculated by ordering the data by value from lowest to highest. The 10th percentile, for example, is the value of the observation where at least 10% of observations are lower or equal to this value. Percentiles are useful measures if we are interested in a particular part of the distribution, such as those who are located towards the bottom of the earnings distribution.

How it’s done How to calculate percentiles

The first step is to order the n observations in the sample from lowest to highest values such that $$X_{1} \leq X_{2}\ \leq \ \ldots\ \leq \ X_{n}$$.

The observation $$X_{i}$$ corresponding to the pth percentile can be found at:

$X_{i}=\left\lbrack\frac{p}{100}\right\rbrack n$

In the example used in this appendix of the earnings distribution, the number of observations is n = 800. The observation which corresponds to the 25th percentile will be the 200th $$\left(X_{200}\right)$$, as:

$\left\lbrack\frac{25}{100}\right\rbrack 800 = 200$

Figure A2A.3 shows the same earnings data from Figure A2A.1, but this time the 800 observations have been ordered from lowest earnings to highest earnings. The median value corresponds to the middle observation, which is located between the 400th and 401st observations (there are 399 observations below the 400th observation and 399 observations above the 401st observation). In this case, an average of the two is taken, giving a median wage of £29,150.

Likewise, the 10th percentile is the value of the observation that is 10% along the ordering, which lies between the 80th and 81st observations. Again, an average of the two is taken giving a value of £13,925. This tells us that 10% of the wage distribution earns this amount or less.

A2A.5 The mode

mode
The value that occurs with the most frequency in the data.

A well-known measure of central tendency, but one rarely used in practice, is the mode.

The mode is difficult to use when dealing with continuous quantitative data and is more applicable to categorical variables, where these take on one of a limited number of possible values, such as occupation or level of highest qualification. For example, in the earnings data, each individual’s earnings are likely to be recorded to the exact pound and pence, in which case relatively few of the 800 individuals in the sample will earn exactly the same amount. Therefore, it might be better to think about the modal class, that is the most frequently occurring earnings band. Looking at Figure A2A.2, the modal class for the earnings data is the £29,000 to £30,000 band.

A2A.6 The range

Useful summary statistics do not just refer to measures of central tendency, but also to how the data is distributed or spread out. To look at this, we calculate measures of dispersion.

Figure A2A.4 shows the original sample of earnings in blue, and a second sample of 800 individuals from the same population in red. Just from looking at the data, we may conclude that the newer sample of earnings data is less dispersed than the original sample. The new sample has relatively fewer observations at the extreme ends, and so it appears less spread out than the original.

range
A simple measure of dispersion given by subtracting the minimum value in the data from the maximum value.

The range is the simplest measure of dispersion in a data sample: it is the largest value minus the smallest. In the original sample of earnings, the lowest was recorded at £3,000 and the highest at £102,400, giving a range of £99,400. In the new sample, the lowest recorded earnings are £10,200 and the highest is £61,600. This gives a much lower range of £51,400 in the new sample.

The range is a crude measure, and being based on only two observations at either end of the distribution, it is obviously sensitive to outliers. Suppose we removed the two lowest and two highest observations in the original sample, so it has 796 observations instead of 800. This small change in the data would reduce the maximum value to £87,400 and increase the minimum to £4,100, giving a range of £83,300 instead of £99,400. The difference in the range between the two samples would now be much narrower. The range can still be useful as a descriptive statistic – ironically, we can use it to check for outliers or errors in the data.

A2A.7 Position measures of dispersion

position measures of dispersion
A measure of dispersion between two positional points in the sample. Two common measures are the interquartile range: a measure of spread that uses the difference between the 25th and 75th percentiles; and the 90:10 range: a measure of spread that uses the difference between the 10th and 90th percentiles.

As an alternative to the range, we could use differences between percentiles of the distribution to measure the spread of the data. Two frequently reported position measures of dispersion are the interquartile range – the difference between 2nd and 3rd quartile values (25% and 75% of distribution), and the 90:10 range – the difference between the 10th and 90th percentiles.

In Figure A2A.5, these measures of dispersion have been calculated for the two earnings distributions shown in Figure A2A.4. Both the interquartile range and the 90:10 range indicate lower spreads for the newer sample, but the differences are much narrower than for the range.

Original sample New sample
Average 200th and 201st observation 22.1 24.0
Average 600th and 601st observation 35.1 34.7
interquartile range 13.0 10.7
Average 80th and 81st observation 13.9 17.3
Average 720th and 721st observation 45.1 43.5
90th minus 10th percentile 31.2 26.2

Figure A2A.5 Annual earnings for two samples of UK population, percentile measures of spread

Annual earnings for two samples of UK population, percentile measures of spread

As with the calculation of the median, these measures do not use information about the shape of the distribution, but instead concentrate on specific points of the distribution. These types of position measures of dispersion can be useful if we are simply interested in a direct comparison of two points in the distribution.

A2A.8 Variance

variance
The expectation of the squared deviation of a random variable from its mean. It measures how far a set of numbers deviates from their average value.

The most commonly used measure of dispersion, however, is the variance. Unlike positional measures of dispersion, the variance uses information on the shape of the distribution and is calculated using all the observations in the sample.

Dispersion measures show how spread out the data are, so a useful way to estimate it would be to look at differences between our observations and the arithmetic mean for the sample. These are known as deviations from the mean. Some will be positive (the observation is higher than the mean) and some will be negative (the observation is lower).

However, it is axiomatically the case that the sum of deviations from the mean will add up to zero.

$\sum_{i=1}^{n} \left(X_{i} − \overline{X}\right) = 0$

Therefore, we cannot simply just add up all deviations from the mean.

Instead, the calculation of the variance is based on the sum of squared deviations from the mean. That is, for each observation we calculate its value, minus the arithmetic mean of the sample, square this, and then sum across the observations. Finally, we divide by the sample size minus 1 (n − 1).

The formula used to calculate the variance (denoted as s2) of a sample is:

$s^{2}=\frac{\sum_{i=1}^{n} (X_{i} − \overline{X})^{2}}{n − 1}$

Note that in this calculation we divide by n − 1 rather than n. This is because we have used up information from the sample to measure the mean, so we only have n − 1 sources of variation of information remaining to calculate the variance.

standard deviation
The square root of the variance. It has the advantage of being measured in the same units as the arithmetic mean.

The standard deviation for a sample (denoted $$s=\sqrt{s^{2}}$$) is simply the square root of the sample variance:

$s=\sqrt{\frac{\sum_{i=1}^{n} \left(X_{i} − \overline{X} \right)^{2}}{n − 1}}$

By taking the square root of the variance to calculate the standard deviation, we get back to units that are comparable with the arithmetic mean.

coefficient of variation
The standard deviation of the sample divided by the mean. It helps to compare dispersion for samples with different means.

We can also calculate the coefficient of variation (CV) which scales the standard deviation (s) by the sample arithmetic mean ($$\overline{X}$$). This can be a useful measure for comparing dispersion if arithmetic mean values vary significantly across different samples.

$\text{CV} = \frac{s}{\overline{X}}$

Figure A2A.6 shows the values of the sample variance, standard deviation and coefficient of variation for the two earnings distributions shown in Figure A2A.4. Each of these three measures show less dispersion in the new sample of earnings data compared to the original.

Original sample New sample
Mean 29,556 30, 010
Variance 170,712,883 91,102,283
Standard deviation 13,066 9,545
Coefficient of variation 0.44 0.32

Figure A2A.6 Measures of dispersion for two samples of UK earnings data

Measures of dispersion for two samples of UK earnings data

A2A.9 Summary

This appendix has covered various summary statistics commonly used to summarise measures of central tendency and measures of dispersion in a sample of data. These are only the most common statistics that you will use for the vast majority of your work with national statistics. Summary statistics that are beyond the scope of this appendix can also be used to describe the shape of a sample distribution, such as skewness and kurtosis. In Appendix 2E we will investigate degrees of statistical dependence between two or more measures.