English: Modified version of File:CDC-11214-sw...
English: Modified version of File:CDC-11214-swine-flu.jpg for landscape aspect. (Photo credit: Wikipedia)

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.
― Richard P. Feynman

The drive toward “big data” continues to fascinate, but as with other buzzwords that have captured the imagination, it’s incumbent upon the adults to stop and contemplate what’s really going on here. Some have sized up the risks to our society nicely, while others have looked at the more practical issues with analyzing large data sets. I recently came across a story in Science from earlier this year about big data which captured real problems with the concept itself.

The story is about how Google’s flu trends, which uses assessment of undisclosed search terms to approximate flu intensity in particular regions, had substantially overestimated the prevalence of flu for a number of years.

Google’s lack of transparency about how its flu trends works is just the tip of the algorithmic iceberg, as many other problems are identified by those studying the issues involved in this particular implementation of big data. These problems include:

  • “Big data hubris,” or the implicit assumption that big data are a substitute for traditional data gathering and analysis. They are, at best, a supplement to data gathered through actual surveillance of actual behavior or characteristics. As proxies, big data solutions have yet to prove they are reliable, and perhaps always will be proving it, which makes them a supplement, not a replacement.
  • The lack of static inputs into big data sets. In the case of Google’s search algorithm, it received 86 substantial modifications in June and July 2012 alone. The terms used in flu trends were also modified repeatedly over the years, but in ways the users of the data do not know or understand. Finally, users of the Google search engine are entering these search terms for a variety of reasons — it turns out, often searching for cold remedies — further confounding the reliability of the inputs into the data sets.
  • Data sets not purpose-built for epidemiology. In Google’s case, the search engine is built to support speed of response and advertising revenues. Those are its primary goals. As the authors of the report in Science put it, Google flu trends “bakes in an assumption that relative search volume for certain terms is statically relevant to external events, but search behavior is not just exogenously determined, it is also endogenously cultivated by the service provider.” In other words, Google has its thumb on the scale in order to support its primary business goals. You can see this thumb with auto-complete, which guides users toward search terms that others have already made popular.
  • “In the wild” systems are open for abuse and manipulation. Google and other publicly available online services are subject to all sorts of outside pressures, from “astroturf” marketing and political campaigns to filtering by oppressive governments to denial of service attacks. These make the data less reliable. And when data are known to be produced by these systems which carry broad societal and commercial weight, the temptation to manipulate these data may lead to purposeful attacks and manipulations.
  • Replication is not required and may not be possible. This returns us to the lack of transparency with the Google data, as well as its granularity. The data in Google flu trends is presented as-is, and cannot be analyzed by outside parties. With the likelihood that most large data sets will have proprietary or privacy issues around them, these limitations may dog big data for a long time.
  • Big data needs to be validated, and real-world data sets will be needed to provide validation. As mentioned in the first point, the CDC data for flu prevalence were not improved upon by Google flu trends. This is a major point. Big data may not ever do more than generate hypotheses or prove supplemental. How much money and intellectual effort do we invest in things that are inferior or secondary?
  • Good data don’t have to be big data. Not only can smaller data sets tell us things big data never could, but statistical techniques can make relatively modest data sets sufficient for many questions. Data appropriateness and accuracy are better qualifiers.

It’s worth noting that a story in the New York Times covering the limitations of Google flu trends still presents the Google data without qualification toward the top of the page, as if it is completely reliable and legitimate. The siren song of big data solutions may be stronger than our requirements that they are true.

There’s a certain bizarre aspect to these complaints about Google flu trends, namely that while Google’s algorithms weren’t able to beat real-world surveillance in monitoring flu outbreaks, at the same time the company’s facility with a proprietary source of big data without external validation — their AdSense system, which generates billions annually in relevance-based ad matches — is very effective. But what if the AdSense algorithms were put up against an analogous reality-based relevance system? Would they do so well? Sometimes, the ship in the bottle sails the smoothest.

Looking at big data from this perspective, we can see a different issue — namely, a lack of validation of Google’s own algorithms except in the circular realm of their business uses. That is, because Google’s advertising business is commercially successful, the algorithms are assumed to work well. But if they were to be validated by external measures of relevancy, they may not perform to reasonable expectations. Big data standing alone can create a world all its own.

Watching baseball playoffs now, I was reminded of sabermetrics, which revived itself in the modern era with the Oakland A’s, who have been consistently high performing despite a lower payroll, thanks to their proclivity with the practice. One carryover from Bill James’ work that made its way into mainstream baseball statistics has been on-base percentage, or OBP. This came from the insight of an expert — that a walk was as good as a single, essentially — and carried over into data analysis and statistics, providing a marginal improvement over traditional measures like runs batted in (RBI), slugging percentage, and so forth.

Strong theoretical constructs like OBP guide data gathering and analysis, and ultimately provide value because they make sense. “Making sense” is the key to any data endeavor. Freestyle data swimming may generate waves, but to get to genuine meaning, there has to be an initial hypothesis or theory of the numbers. Then, how much data do you need to test the idea?

We can’t turn off our brains yet. We need to think a little harder about data now than ever before — its provenance, what we provide and who receives what we provide, how our data privacy is protected, and so forth. We also need to think hard about the questions we really have about the world. As the brief history of Google flu trends shows, there are major limitations still facing the field, not the least of which is whether big, indirect data can be any better than properly sized direct data measurements aided by statistical extrapolation.

Kent Anderson

Kent Anderson

Kent Anderson is the CEO of RedLink and RedLink Network, a past-President of SSP, and the founder of the Scholarly Kitchen. He has worked as Publisher at AAAS/Science, CEO/Publisher of JBJS, Inc., a publishing executive at the Massachusetts Medical Society, Publishing Director of the New England Journal of Medicine, and Director of Medical Journals at the American Academy of Pediatrics. Opinions on social media or blogs are his own.


6 Thoughts on "More Data, More Problems — Lessons from the Limitations of Google Flu Trends"

Well said, Kent. There seem to be two hype-driven confusions with big data. First is the idea that finding patterns is what science does, which we might call the inductive fallacy. Finding patterns is important but the real job of science is to explain them, not just find them. Explanation normally involves discovering and understanding an underlying mechanism, which requires a lot of creativity. Statistics do not explain.

Then, as you point out, there is the false assumption that all data is equally accurate. If data was not collected for the purpose of finding the pattern in question then it may be highly inaccurate for that purpose. Call this the dirty data problem. Science often involves the progressive refinement of data collection in order to focus on the pattern in question. So once again big data might provide a starting point for science but that is all.

In short big data is not the new science that some people claim.

Yes, but. It has always been my understanding that big data is not intended to take the place of traditional data; it is intended to reveal general patterns that may help to generate a hypothesis in the absence of ones. When such hypotheses are tested, they may well turn out to be wrong but lead us in new directions that wouldn’t otherwise have revealed themselves.

Then you have what I would consider to be a more cautious and appropriate understanding of big data.

However, I don’t think that’s what some proponents hope. They hope that by analyzing Twitter, Google searches, and other large sets of data exhaust that they can substitute big data solutions for real polling, real surveys, real monitoring, real epidemiology, and so forth.

If “big data” were widely believed to be merely supplementary and a source of some possible new hypotheses, then I think the buzz around it would be much quieter.

Indeed, my impression is that big data is frequently hyped as revolutionary, a new science, etc. Perhaps we need some data on this.

The core problem with big data is just that – there is so much data that many of the ‘interesting patterns’ you see are statistical flukes or effects so weak that they can only be detected with vast datasets. By contrast, if you can find strong evidence for some real world phenomenon with a few hundred datapoints you’re onto something – only a relatively strong (i.e. biologically interesting) effect would be detectable with that sample size.

Comments are closed.