Some time ago, a bit of news came out, “Coffee drinkers have a higher cancer rate than non coffee drinkers.” My immediate response was “false correlation.” My wife asked me why professionals would issue a report that was not only flawed, but flawed so badly that it took me seconds to reach that conclusion. That’s actually the tough part of this. Let me explain the easy part, how false corrections can easily produce bad results.coffee

I’ll use a population of 10,000 random coffee drinkers. They are surveyed and discovered to have a cancer rate of 5%. Those conducting this experiment find a 3.3% rate of cancer in the general population, and even with margins of error (I’ll pass on that math for this article) the 5% rate they see is significant, 50% more instances of cancer than expected.

Let’s look at the data that might have revealed the error made.

  • Cigarette smokers show a 10% rate of cancer
  • Non-smokers show a 1% rate of cancer
  • Random population shows 25% smokers

You can already see where this is going. The study was conducted to understand if coffee drinkers had a higher rate of cancer. So far, I’ve jumped to data regarding smoking. To continue –

  • Coffee drinkers reveal a 50% rate of smoking (Not likely the case, I exaggerate a bit to make a point)

Now, in a population of 10,000 coffee drinkers, the 5,000 smokers would be expected to have 500 cancer instances and 5000 non-smokers, another 50. When this is taken into account, the resulting 500 cancer results are actually a bit lower than expected. New headline? “Coffee may reduce cancer rates for smokers.”

The above is a bit of an exaggeration.The truth is, press releases rarely offer the data behind the conclusion, leaving the reader to decide whether the headline is worth researching or giving any credence at all. It was only a few month later that I read new reports that reached the same conclusion I did. That coffee drinkers tended to have more smokers that non-drinkers. In the end, even the so-called pros make mistakes.

Do you have any examples of similar error in data analysis and false correlation? Send me a note, I’d be happy to print it.