On 13 August 2007, as the global financial crisis began, the Chief Financial Officer of Goldman Sachs, David Viniar, told the Financial Times “we were seeing things that were 25-standard deviation moves several days in a row”. Taken literally, Viniar’s statement was incredible; our universe has not existed long enough for there to have been several days on which 25 standard deviation events could plausibly occur. Mr Viniar has a degree in economics from Union College and an MBA from Harvard Business School, so he would have known this, unless he had forgotten the content of his elementary courses. But what he really meant, although his claim was wrapped up in statistical jargon, was that the moves in prices were vastly larger than anything his risk managers had previously experienced or thought possible.
The probability of even one 25 standard deviation event – the man who is over 11 feet tall – is so low that several lines of zeros would be needed before you reached a significant digit. If you bought one ticket for the National Lottery, and won the jackpot, and you repeated the trick twenty times in a row, that would be a 25 standard deviation event. Or, more likely, a cause for investigation of the integrity of the National Lottery. If you did win the jackpot on twenty consecutive occasions, most people would question the claim of the lottery operator that each draw was a random selection from all valid entries. If you enter a coin tossing game, and the coin comes up heads fifty times in a row, it is possible that you have encountered a statistical event as improbable the one as Mr Viniar supposed he had encountered. But there are other explanations you might consider first.
Estimates of populations drawn from samples are only as good as the methods employed to construct the samples. Random sampling from a large population of individuals is used for many purposes, of which opinion polling to predict election results has been the most recently controversial example. When polling began, the scale of these sampling difficulties was not well understood. The greatest fiasco in polling history was the Literary Digest prediction of the result of the 1936 US presidential election. The magazine anticipated a landslide victory by the Republican candidate, Alf Landon, based on a survey of the voting intentions of 2.3 million electors. The result was indeed a landslide; incumbent president Franklin Roosevelt won every state in the Union except Maine and Vermont, and secured the largest majority ever achieved in the electoral college.
The magazine had sent out questionnaires to about ten million people, using its own subscription list, records of phone subscribers and automobile owners, etc. But – especially in the aftermath of the Great Depression – these groups were not representative of the American population. And Roosevelt had been a polarising figure. Those that were more likely to be among the 2.3 million people who responded to the Literary Digest enquiry included many more of those who were outraged by his New Deal policies than those – typically poorer households – who supported them.
The landslide buried the reputation of the Literary Digest, which closed soon after. But at the same time it made the reputation of the then little-known George Gallup, who correctly predicted the result by using the methods of quota sampling, which sought to match the characteristics of his respondents to known characteristics of the American population as a whole. Within a couple of decades, the name of Gallup was almost synonymous with political polling.
Quota sampling is not the same as random sampling from a population. Rather, it uses a model to estimate from the answers which are received what the answers would have been if the people giving answers had been a random selection from the population. Modern pollsters know that their sample is not in any sense random, and now use sophisticated and complex models to adjust for their failure to achieve randomness. But this confronts the pollsters, and those who want to use their results, with the problem that Mr Viniar had failed to recognise: the probability derived from the model has to be compounded with the probability that the model is itself true. And we have no means of deriving the latter probability, or indeed of attaching meaning to such a probability. We can usefully say things like “the pollsters are very experienced”, or “the model has worked well in the past”. But these are statements about confidence and judgement, not about probabilities. The attempt to attach a confidence interval to an opinion poll result is thus very difficult to justify.
Nor is that the last of the problems. We need to believe that the answer to a question about voting intentions is equivalent to a question about actual voting behaviour. We know that the honesty of answers depends in large part on the nature of the question to which the answer is sought; for example we know from looking at aggregate statistics that people are much better at reporting their consumption of milk than their consumption of alcohol. And then there is a requirement to translate shares of the popular vote into an electoral outcome. For a referendum – such as the UK vote on Brexit in 2016, in which all that mattered was the vote count on each side – this translation is straightforward (albeit that many pollsters got the result wrong). But when a President is selected by an electoral college, or the composition of the government depends on results in individual constituencies, an additional modelling exercise is required. In both the elections of 2016 – the US Presidential election and the Brexit referendum – the failure of the pollsters to anticipate the result was the consequence of the failure of their models to translate the raw data into an accurate prediction – the same problem that had defeated Mr Viniar.
So mistrust not just the polls, but the confidence intervals supposedly attached to them. Remember Mr Viniar’s error.