“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.”
– Sherlock Holmes, from The Adventure of the Copper Beeches, 1892
Well over a century after Sir Arthur Conan Doyle’s famous detective lamented his lack of data, we live amidst a superabundance: Big Data, the Quantified Self, Evidence-Based Medicine, Article Level Metrics, and Fourth Paradigm Science to name but a few of the data-related buzzwords that are (over) used today. Which is all to say, we sure do like to measure things. And for the most part all this measuring is a good thing. I’ll take medicine based on evidence rather than anecdote. I’m glad we have computers capable of processing the vast and complex datasets associated with climate modeling and particle physics. And I do like to know how far and how fast I go when I get on my bike, except at the beginning of the season when that information is typically a bit demoralizing.
Businesses are using more data than ever to inform decision making, though the truly large Big Data in the business world is limited to companies like Google, Facebook, Amazon and the like that have largely online products and services with users bases in the hundreds of millions. When you try to track all the ways those hundreds of millions of users interact with your products, services, and each other, it creates a lot of data. But as Richard Padley notes in a thoughtful recent article on the hype and substance of Big Data in STM and scholarly publishing, most businesses (and certainly most publishers) do not have to contend with the scale and complexity of truly Big Data.
While the technical challenges may be less daunting with smaller data sets, there remain challenges in interpreting data and in using it to make informed decisions. Perhaps the most daunting challenge is in understanding the limitations of the dataset: What is being measured and, just as importantly, what is not being measured? What inferences and conclusions can be drawn and what is mere conjecture? Where are the bricks and mortar solid and where does the foundation give way beneath our feet?
And while many organizations have created the position of Data Scientist to help answer these questions, I sometimes think that “Data Detective” may be the better term. In homage to Doyle, who gave us the greatest of detectives, I explore these issues through three cases, demonstrating how slippery and confounding, data can be.
1. The Case of the Missing Mediums
My household was recently selected to become a “Nielsen Household,” an experience that I found to be illuminating. Nielsen, of course, are the people that produce the television ratings in the United States. If you hear that a television show was “the most watched on Thursday night” or you hear that Sochi’s was the “most-watched Opening Ceremony for a non-live Winter Games since the 1994 Lillehammer Games”– those data come from Nielsen. Or, rather, from people like me that tell Nielsen what we watched.
There are two types of Nielsen households: There are diary households that fill out a simple, pencil and paper booklet, self-reporting who watched what and when (see image above); and there are meter households, where a device connected to one’s television records the household’s television viewings and relays them digitally to Nielsen for analysis. Mine was the former sort, self-reporting via paper and pencil.
When we agreed to participate, we had no idea what, precisely, we would be recording—and what we would not be recording. But we agreed to participate as we thought it would be nice to be able to give our favorite television shows a big, Nielsen-weighted “like.” Upon receiving the diary, however, it rapidly became clear that our television viewing is largely incompatible with Nielsen’s system of measurement.
Nielsen’s diary (and metering system, for that matter) captures only television shows (broadcast or cable) viewed either in real-time or in “near-time” (within the week viewing window captured by the diary) via DVR recording. Nielsen excludes from the diary any shows watched via DVR after the week viewing window. So, for example, if I record a full season of the BBC’s brilliant new Sherlock series on my DVR and wait to watch it in a single binge weekend next month, that would not be captured by Nielsen. Also excluded are any shows watched via OnDemand (or similar cable company offerings). So if I record a show and watch it on my DVR the next day, it counts as far as Nielsen is concerned; if Comcast records it for me and I watch it via their OnDemand streaming service in the same timeframe, it does not count.
Similarly baffling, any shows watched via an app, like HBO to Go or the new (and wonderful) PBS app on Apple TV are not captured. So if I watch Charlie Rose on my local PBS affiliate, Nielsen counts it. If I watch the same show on the same night via the PBS app, they don’t count it. Any shows watched via Netflix, Hulu, Amazon, iTunes or other streaming services are likewise excluded from Nielsen’s tallies. House of Cards? Forgetaboutit. Also excluded are shows watched on a device other than a television (such as laptops or tablets). This includes shows watched online not only via sites like YouTube but also the websites of traditional media outlets.
Due to the omissions of these viewing mediums, I calculate that Nielsen misses well over 90% of our television viewing. The only things we watch via live or near-time TV are news (and much less so with the recent launch of the PBS app that includes The News Hour and Charlie Rose) and sports (we are moderate sports viewers – usually only watching the occasional big game or special event, such as the Olympics). Everything else we watch via apps such as PBS or HBO to Go, OnDemand, or various streaming services. I doubt we are unique in this regard.
So when Nielsen proclaims that a given show was the “the most watched show on Thursday night,” what they really should be saying is that the show was “the most watched show among those shows that were watched by people in the United States on traditional television sets, using either broadcast or cable connections, either live or in “near-time” via DVR, on Thursday night.” I don’t see how they can say much of anything useful about a show’s total viewership as their methods do not capture large (and increasingly larger) swaths of viewing mechanisms nor do they capture time-shifting outside their viewing window, either via DVR or via OnDemand or other streaming services.
One can only assume that these data limitations are well-known to the real customers of Neilsen’s data, which are the media buyers who purchase advertising space from network and cable television stations. Presumably they are either not buying ads in the many viewing mediums excluded by Neilsen or else are relying on other data sources for those purchases. The continued existence of the Neilsen ratings, despite the exclusion of most modern forms of television viewing, indicates that media buyers are at least reasonably satisfied with the product.
And while the media buyers may be satisfied, the Neilsen ratings (or more precisely, the misuse of the Neilsen ratings) create a significant distortion in public perception of the television market. Neilsen ratings are widely perceived, and often trumpeted by television networks (when the ratings are in their favor) as measuring the television viewing audience, when they are in fact only measuring a limited swath of that audience. So next time you think your favorite show that was just cancelled didn’t get a fair shake and simply must have a larger audience – you might just be right. The problem might not be the audience, the problem might be the network’s ability to quantify and monetize the audience.
2. The Case of Siloed Conjecture
The challenges of ascertaining a clear picture of basic audience data in television, however, are nothing compared to the trade publishing business, where a recent initiative by Hugh Howey highlights just how imperfect our knowledge of that industry is.
Hugh Howey, of course, is the author of the popular (and utterly engrossing) Silo Series. It is fitting that another writer of mysteries (of a sort) is at the center of this case. Howey is well-known, however, not only for what he writes, but for how he publishes his work. The Silo Series (which begins with Wool, a short detective story) was self-published as an ebook via Amazon’s Kindle Direct system. Howey turned the story into a franchise following the remarkable success of Wool and subsequently turned down seven-figure offers from traditional publishing houses (he eventually agreed to a deal with Simon and Schuster for print distribution only, retaining digital rights).
Howey has become an icon of the self-publishing movement and is a self-professed publishing geek. Recently, he helped to drop what can only be described as a “data bomb” on the trade publishing continent, one of the more prominent land masses in the world of publishing. This data bomb (which can be downloaded in Excel format) comes in the form of the website AuthorEarnings.com and consists of data scraped from Amazon’s website, including Amazon sales rank, Kindle ebook sales rank, average rating, and various other metadata including the publisher, title, and author (though the title and author have been redacted by Howey et al.) for ebooks sold via the online retailer. Combined with additional data sent to Howey et al. from various authors over time, they claim to be able to estimate book sales revenue (from Amazon), with reasonable accuracy, if given a book’s Amazon and Kindle sales ranks along with the book’s price. These data have been supplied in two dumps so far, one covering the top 7,000 ebook titles sold via Amazon and the second the top 54,000. A third report covering the top 5,400 ebooks sold via Barnes & Noble.com was just released this week.
Howey’s data project stems from frustration with a lack of accurate data regarding ebook sales. He writes:
You may have heard from other reports that e-books account for roughly 25% of overall book sales. But this figure is based only on sales reported by major publishers. E-book distributors like Amazon, Barnes & Noble, Kobo, the iBookstore, and Google Play don’t reveal their sales data. That means that self-published e-books are not counted in that 25%.
Neither are small presses, e-only presses, or Amazon’s publishing imprints. This would be like the Cookie Council seeking a report on global cookie sales and polling a handful of Girl Scout troops for the answer—then announcing that 25% of worldwide cookie sales are Thin Mints. But this is wrong. They’re just looking at Girl Scout cookies, and even then only a handful of troops. Every pronouncement about e-book adoption is flawed for the same reason. It’s looking at only a small corner of a much bigger picture. (It’s worth noting that our own report is also limited in that it’s looking only at Amazon—chosen for being the largest book retailer in the world—but we acknowledge and state this limitation, and we plan on releasing broader reports in the future.)
The data from Howey et al. – especially with the expansion to include Barnes & Noble – is a welcome addition to the Swiss cheese of data available regarding books sales. However, as Howey himself notes, it also has limitations. Unfortunately, Howey leaps past those limitations to draw a number of unsupported (based on the data he provides) comparisons between the revenues of independently published authors and those published by either traditional publishers or directly by Amazon.
Other commentators, most notably Mike Shatzkin of The Shatzkin Files and Sunita of Dear Author have already pointed out the flaws in Howey’s analysis and their critiques are well worth reading in their entirety (and, in the case of Shatzkin’s article, Howey’s response in the comments). The most salient critiques, in my humble opinion, from Shatzkin and Sunita include:
- The data from Howey et al. are based on a single day of sales activity at Amazon and then extrapolated to an annual basis – a practice that is both conceptually and statistically problematic.
- Howey does not factor in the advances received by authors published by traditional publishers, which can account for most of such author’s revenues.
- The data, as Howey himself notes, are limited to ebooks only and only those sold by Amazon and Barnes and Noble.
- Howey’s analysis only factors in top line revenues, whereas self-published authors have to do the work of both a publisher and an author.
Much as Neilsen’s data is often used to make a claim it does not support (overall viewing of a show) without significant qualification, Howey draws conclusions as to author revenues that are not supported by the data he presents. It is entirely possible that Howey’s claims are directionally accurate and hopefully additional data will be collected from other sources (including a full year Amazon and Barnes and Noble data from the AuthorEarnings.com crawler) that will provide a (more) sound basis for the question he seeks to answer: For which authors and which types of books does it make sense to publish independently versus via an established publisher? However, at this juncture we are left with something of a cliffhanger and must keep this case open.
3. The Case of the Dumbfounding Downloads
Howey’s data sent me searching for a case closer to the landmass of professional and scholarly publishing that I typically inhabit and survey (I picture this landmass as more of a mist shrouded island of numerous ecosystems as opposed to a full continent – the New Zealand of the publishing world perhaps). In casting about, I uncovered several open cases worthy of exploration though the one I kept coming back to is the Case of the Dumbfounding Downloads (The Case of the Confounding Citations was a close second).
The professional and scholarly community, due to the efforts of industry-supported organizations like NISO, COUNTER, CrossRef, and, most recently, ORCID has developed a great deal of clarity around many of its metrics. Even propriety metrics like the Impact Factor are (mostly) well understood albeit misused upon occasion. It is one of the most seemingly straightforward metrics, however, that poses some of the most challenges to use: article usage.
Following PLOS’s lead, an increasing number of journals (including eLife and Springer’s BioMedCentral titles among many others) are providing metrics on article usage. PLOS prudently and accurately calls these “article views” and is careful to qualify where the article view took place, whether on the journal website or via PubMed Central. Moreover, they further qualify views by format, HTML, PDF, and XML and further qualify that the PDF and XML versions were “downloaded” as opposed to “viewed.”
PLOS should be commended for pioneering the use of article level metrics in general and especially this usage information. As an author, I absolutely want to know how many people are reading my article. (In fact, I am probably checking the usage of this Scholarly Kitchen article right now… and now… and again now. Don’t be bashful, hit “refresh” a few times while you read this post). And as a reader, I find these data interesting: “Why are so many people reading this paper?”
And while I am a wholehearted supporter of article level metrics, it occurs to me that there is at least as much information about article usage that is not captured, as is captured, by article level metrics.
For example, as PLOS accurately notes, article level metrics only include downloads from the publisher’s website (and in the case of PLOS, also from PubMed Central). They do not typically include:
- Downloads via aggregators, such as Ovid, ProQuest, or EBSCO.
- Readership of abstracts via indexing systems such as PubMed, Scopus, and Web of Science.
- Downloads from institutional repositories.
- Downloads from the author’s own website.
- Reading via coursepacks.
- Readings via document delivery services such as Reprints Desk or Infotrieve.
- Article rentals via DeepDyve or ReadCube.
- Translations or republication in other venues.
- Copies obtained via sharing – either by requesting a PDF from the author, reading via formal or informal journals clubs, using the Twitter “#icanhasPDF” hashtag, or via sharing with a colleague.
- Reading of individual print subscriptions (which is still a significant method of reading for many clinical medical journals).
- Reading of print library copies.
- Reading of the article in a mobile app—either the journal’s own app (which may or may not be included in article level metrics depending on the publisher) or that of a third party like Zinio or Kindle.
There are probably many other uncounted viewings I am missing. And on the other side of the equation, article level metrics may or may not be inflated by inclusion of downloads by robots crawling on behalf of search engines and other indexing systems, further complicating matters (PLOS is quite transparent about what is excluded and posts a list here, however, in other cases it is hard to tell).
Most importantly, article downloads do not equal reading. An article could be downloaded but never read. Or maybe it was only lightly skimmed. Alternately, an article could be downloaded and subsequently read several times, copiously annotated, and discussed at length over beers in the back of a Thai restaurant just off campus.
Moreover, even if we knew whether an article was read, that would not necessarily tell us how useful the paper was to the reader. To go back to the medical example, the value of a paper is ultimately measured by whether it causes a change in practice that leads to improved care—an assessment that may take years to make.
And so while I am an ardent fan of article level metrics, I am also on guard against their misuse. They tell us a good deal about a couple of particular use cases, but in some cases they may leave out as much, if not more, article reading than they include. Much as Nielsen can tell us only about viewership of live or near-term television, in the US, on traditional television sets, using either broadcast or cable connections, article level usage metrics can only tell us about views and downloads via the journal website and (in some cases) PubMed Central. Unlike the world of television, however, publishers are not yet issuing press releases telling us what the “most viewed article on Thursday night” was (it is the small things I’m grateful for).
We tend to measure what we can. The problem is that we are increasingly called to base business decisions on data. This is well and good and decisions based on data are likely to be more sound than those based on opinion or conjecture. However, not all things can be measured and often things that either cannot, or are not, measured are just as (or more) important than those things that are measured. When making business decisions based on data, it is imperative to not only ascertain exactly what is being measured, but also, what is not being measured. Elementary? Perhaps, but it is sometimes necessary to make sure before building an edifice to test your foundations.
7 Thoughts on "Data Detectives: Investigating What is, and What is Not, Measured"
The quote that immediately comes to mind, from William Bruce Cameron’s 1963 text “Informal Sociology: A Casual Introduction to Sociological Thinking” (often wrongly attributed to Albert Einstein):
“It would be nice if all of the data which sociologists require could be enumerated because then we could run them through IBM machines and draw charts as the economists do. However, not everything that can be counted counts, and not everything that counts can be counted.”
” publishers are not yet issuing press releases telling us what the “most viewed article on Thursday night” was”.
They’re getting close…
Nice post! I find that arguments about what the evidence shows become solipsistic when statements of fact are not preceded by a statement of limitation, such as “While we are limited to measuring readership by counting downloads from the publisher’s site, the data strongly suggest that…” Without that limitation clause, the author invites a barrage of attacks that point out every single source of limitation, however insignificant.
The real challenge for those relying on limited data sources (all of us) is to make a convincing argument that in spite of the limitations of our data source, we can still provide some new and valuable information to support those making business and editorial decisions.
This careful language of limitation is one of the things that often separates scientific writing from the brute force assertions typical of the rest of the world. Of course it drives policy makers nuts. The hard part is estimating the degree of the limitation, which is actually a scientific issue in itself. For example, how good are downloads as a proxy for total usage? In some science fields the accuracy of proxies is a major research issue.
University presses have typically used sales data on previously published monographs to make decisions about print runs for new books. (I’m sure publishers in other sectors do the same.) The more such data are broken down by field or even subfield, and the more they are aggregated by different periods of time (sales in the first year, sales in the first three years, sales in the first five years, etc.), the more reliable they become as bases for making such decisions. The AAUP has also aggregated sales data from its member presses by size of press to allow those presses to compare their overall performance with that of their peers. Changing technology has had some effect on this practice as the advent of POD and SRDP have lessened the need for presses to project total expected sales in determining what a first printrun should be. Digital printing has allowed them to be reduce their risks in inventory, which has also improved their cash flow.