Hoping that high quantity can compensate for low quality is a classic mistake in the burgeoning field of big data, says Xiao-Li Meng, a professor of statistics at Harvard who’s the founding editor-in-chief of the 2-year-old Harvard Data Science Review.

In a perfectly random sample there’s no correlation between someone’s opinion and their chance of being included in the data. If there’s even a 0.5% correlation—i.e., a small amount of selection bias—the nonrandom sample of 2.3 million will be no better than the random sample of 400, Meng says.

That’s a reduction in effective sample size of 99.98%.

That’s not just theory: Statisticians estimate that there was a 0.5% correlation contaminating the 2016 presidential polls, presumably because supporters of Donald Trump were slightly less likely to express their preference to pollsters. That’s why so many pollsters were caught by surprise when Trump won. The 2020 polls suffered similar problems.

[...]

Meng compares data analysis to testing the saltiness of a large vat of soup. If the soup is well stirred, all you need is a tiny bit—less than a teaspoon—to tell how salty it is. In data terms, you’re taking a random sample of the soup vat. If the soup isn’t well stirred, you could drink gallons of it and still not know its average saltiness, because the part you didn’t taste might be different from the part you did taste.

Meng isn’t the first to stress the risk of selection bias. His contribution is in quantifying it. He has created what he calls a “data defect index” and has developed a formula that’s simple by the standards of mathematical statistics. It says that relative bias is proportional to the data defect correlation times the square root of the population size. The bigger the population that’s under study—i.e., the bigger the vat of soup—the bigger the potential problem. (It’s explained here.)

To continue the metaphor, if the soup isn't well-stirred, "average saltiness" is less meaningful in that, even if calculated correctly, it doesn't tell you how salty the next spoonful will be,...

To continue the metaphor, if the soup isn't well-stirred, "average saltiness" is less meaningful in that, even if calculated correctly, it doesn't tell you how salty the next spoonful will be, which might depend on if you take it from the top or the bottom.

Similarly, averages over diverse groups are often less meaningful than they appear, even if data analysis found the correct average. For example, averages about COVID are less meaningful for understanding your risk than looking at statistics within your age group, because the risk depends heavily on age. Or a nationwide average might be less meaningful than a statewide average when some states are doing much worse than others.

It reminds me of the story about how the US Air Force tried to design seats for their jets by taking an average across pilots for each measurement. They ended up making a seat that no one fit in...

It reminds me of the story about how the US Air Force tried to design seats for their jets by taking an average across pilots for each measurement. They ended up making a seat that no one fit in comfortably. That failure led to adjustable seats.

From the article:

[...]

To continue the metaphor, if the soup isn't well-stirred, "average saltiness" is less meaningful in that, even if calculated correctly, it doesn't tell you how salty the next spoonful will be, which might depend on if you take it from the top or the bottom.

Similarly, averages over diverse groups are often less meaningful than they appear, even if data analysis found the correct average. For example, averages about COVID are less meaningful for understanding your risk than looking at statistics within your age group, because the risk depends heavily on age. Or a nationwide average might be less meaningful than a statewide average when some states are doing much worse than others.

It reminds me of the story about how the US Air Force tried to design seats for their jets by taking an average across pilots for each measurement. They ended up making a seat that no one fit in comfortably. That failure led to adjustable seats.