Late last year Nature Publishing Group embarked on an experiment to test the idea of allowing users and selected media outlets to share articles over email and social media in a controlled way. Predictably, this unprecedented action in STM generated a fair amount of interest and some debate. One of the more interesting topics was the link between digital content delivery and data.
Some commentators have asked whether software like ReadCube and Mendeley are really about gathering individual data on users and using it for some nefarious purpose. I can understand why some people might have that concern, particularly in light of revelations regarding the use and misuse of big data by various governments. In this post, I will address the difference between gathering personal data to draw conclusions about individuals (which you could call snooping) and anonymous, aggregated data, gathered for commercial reasons, to allow companies to understand and better serve the needs of their market. While some might accuse me of drawing a pedantic distinction between use-cases, I hope to explain how the difference is extremely important.
It’s been about 18 months since Edward Snowden first began to release details of the extent to which the NSA is able to mine data from Internet giants like Google and Facebook. In June 2103, Just days after the revelations were made, I attended a keynote address at the NASIG conference by Siva Vaidhyanathan, Chair of the Department of Media Studies at the University of Virginia and author of The Googlization of Everything and (Why we Should be Concerned). He likened the activity of the NSA to a secret court that tries crimes in absentia based on circumstantial evidence that the government finds on the internet, punishing people by placing them on a ‘list’ and reducing their personal freedom, without the right to a defense or appeal.
While that description may be dramatic, I have some sympathy for it because evidence suggests that I’m on one such list. Whenever I travel to the US I’m not able to check in online and usually have to check in again for each connecting flight. I also have about a fifty percent chance of being selected for secondary screening by the immigration service whenever I land, in which I’m taken to a waiting room and after some length of time — usually less than 30 minutes (but it once took three hours), I’m thanked for my patience and my passport is returned to me. Although this is a comparatively minor inconvenience, It’s a little frustrating that it keeps happening.
The danger here lies in the fact that using computer algorithms to identify individuals will inevitably lead to false positives. Simply put, the problem is that the number of ordinary people is so large compared to the number of criminals or terrorists, that even with an extremely low false positive rate, the number of innocent people on a computer generated list is still likely to be orders of magnitude bigger than the number of legitimate suspects. Aggregating and anonymizing data not only removes the risk of unfairly or inaccurately targeting individuals, it also delivers information that is often more commercially useful.
Take Google’s GPS navigation app for example. By aggregating the data from everybody’s smart-phone, Google is able to spot traffic jams when they happen and predict future traffic loads. By passing this aggregated data back to the individual users, their own smart phones can re-route them to make better use of the road network, saving both time and fuel. Importantly, when identifying group behavior in this way, outliers (like the person who stopped to buy milk and forgot to turn their GPS off) are small in number and effectively ignored.
Anonymous, aggregated usage metrics for journals is another such example of a data application that is non-threatening, and useful to both publishers and end users. Take the monitoring of turn-aways from abstract pages for instance. Publishers can use this data not only for lead generation but also to to identify emerging markets, enabling them to prepare and develop services that are tailored to new customer needs. Taking this idea a step further, statistical analysis of journal usage and the identification of correlations in readership could potentially be used to identify emerging fields of scholarship. Without invading anyone’s privacy, the use of data here adds value for the academic community and assists in the advancement of scholarship.
There are many other ways in which anonymous aggregated data can be used to benefit both publishers and the academic community. Analysis of altmetrics, such as media coverage and mentions in policy documents can supply a different, but no less rich set of insights. By monitoring trends in what’s being discussed inside and outside of academia, editorial and public relations professionals can keep in touch with what various communities are interested in. For publishers that have public engagement or education as a part of their mission, as many learned societies do, this data can offer readouts on how well the public or policy makers are engaged with the content that they publish.
There is a growing trend towards exchanging data for functionality in many industries and publishing is no exception. At the same time, there is a need to discuss the impact and ethics of internet derived big data. The problem with the current discourse is that it conflates two entirely separate use cases and types of data. On the one hand, we have the creation of personal profiles without the individual’s consent and on the other, there is the identification of market trends using anonymized data. If we are to have this conversation in a meaningful and rational way as a society, we need to be clear about the various types and use cases of such data and the differences between them.
As commercial ventures, publishers need to analyze the needs of their users as a whole, not the habits of individual users. There is not generally a need to maintain a link between the identity of the individual and the data being aggregated, unless it’s for a specific function, like building a social networking profile. Even if there was utility in snooping, publishers aren’t the security services and in my experience with technology solutions, publishers are keen to comply with legal and ethical standards when it comes to data protection. So, to those worried that publishers are spying on them: Fear not: they’re not snooping because doing so would make no sense.