Outliers and the Importance of Anonymity: Usage Data Versus Snooping on Your Customers

Late last year Nature Publishing Group embarked on an experiment to test the idea of allowing users and selected media outlets to share articles over email and social media in a controlled way. Predictably, this unprecedented action in STM generated a fair amount of interest and some debate. One of the more interesting topics was the link between digital content delivery and data.

Some commentators have asked whether software like ReadCube and Mendeley are really about gathering individual data on users and using it for some nefarious purpose. I can understand why some people might have that concern, particularly in light of revelations regarding the use and misuse of big data by various governments. In this post, I will address the difference between gathering personal data to draw conclusions about individuals (which you could call snooping) and anonymous, aggregated data, gathered for commercial reasons, to allow companies to understand and better serve the needs of their market. While some might accuse me of drawing a pedantic distinction between use-cases, I hope to explain how the difference is extremely important.

It’s been about 18 months since Edward Snowden first began to release details of the extent to which the NSA is able to mine data from Internet giants like Google and Facebook. In June 2103, Just days after the revelations were made, I attended a keynote address at the NASIG conference by Siva Vaidhyanathan, Chair of the Department of Media Studies at the University of Virginia and author of The Googlization of Everything and (Why we Should be Concerned). He likened the activity of the NSA to a secret court that tries crimes in absentia based on circumstantial evidence that the government finds on the internet, punishing people by placing them on a ‘list’ and reducing their personal freedom, without the right to a defense or appeal.

While that description may be dramatic, I have some sympathy for it because evidence suggests that I’m on one such list. Whenever I travel to the US I’m not able to check in online and usually have to check in again for each connecting flight. I also have about a fifty percent chance of being selected for secondary screening by the immigration service whenever I land, in which I’m taken to a waiting room and after some length of time — usually less than 30 minutes (but it once took three hours), I’m thanked for my patience and my passport is returned to me. Although this is a comparatively minor inconvenience, It’s a little frustrating that it keeps happening.

The danger here lies in the fact that using computer algorithms to identify individuals will inevitably lead to false positives. Simply put, the problem is that the number of ordinary people is so large compared to the number of criminals or terrorists, that even with an extremely low false positive rate, the number of innocent people on a computer generated list is still likely to be orders of magnitude bigger than the number of legitimate suspects. Aggregating and anonymizing data not only removes the risk of unfairly or inaccurately targeting individuals, it also delivers information that is often more commercially useful.

Take Google’s GPS navigation app for example. By aggregating the data from everybody’s smart-phone, Google is able to spot traffic jams when they happen and predict future traffic loads. By passing this aggregated data back to the individual users, their own smart phones can re-route them to make better use of the road network, saving both time and fuel. Importantly, when identifying group behavior in this way, outliers (like the person who stopped to buy milk and forgot to turn their GPS off) are small in number and effectively ignored.

Anonymous, aggregated usage metrics for journals is another such example of a data application that is non-threatening, and useful to both publishers and end users. Take the monitoring of turn-aways from abstract pages for instance. Publishers can use this data not only for lead generation but also to to identify emerging markets, enabling them to prepare and develop services that are tailored to new customer needs. Taking this idea a step further, statistical analysis of journal usage and the identification of correlations in readership could potentially be used to identify emerging fields of scholarship. Without invading anyone’s privacy, the use of data here adds value for the academic community and assists in the advancement of scholarship.

There are many other ways in which anonymous aggregated data can be used to benefit both publishers and the academic community. Analysis of altmetrics, such as media coverage and mentions in policy documents can supply a different, but no less rich set of insights. By monitoring trends in what’s being discussed inside and outside of academia, editorial and public relations professionals can keep in touch with what various communities are interested in. For publishers that have public engagement or education as a part of their mission, as many learned societies do, this data can offer readouts on how well the public or policy makers are engaged with the content that they publish.

There is a growing trend towards exchanging data for functionality in many industries and publishing is no exception. At the same time, there is a need to discuss the impact and ethics of internet derived big data. The problem with the current discourse is that it conflates two entirely separate use cases and types of data. On the one hand, we have the creation of personal profiles without the individual’s consent and on the other, there is the identification of market trends using anonymized data. If we are to have this conversation in a meaningful and rational way as a society, we need to be clear about the various types and use cases of such data and the differences between them.

As commercial ventures, publishers need to analyze the needs of their users as a whole, not the habits of individual users. There is not generally a need to maintain a link between the identity of the individual and the data being aggregated, unless it’s for a specific function, like building a social networking profile. Even if there was utility in snooping, publishers aren’t the security services and in my experience with technology solutions, publishers are keen to comply with legal and ethical standards when it comes to data protection. So, to those worried that publishers are spying on them: Fear not: they’re not snooping because doing so would make no sense.

Phill Jones

@phillbjones

Phill Jones is a co-founder of MoreBrains Consulting Cooperative. MoreBrains works in open science, research infrastructure and publishing. As part of the MoreBrains team, Phill supports a diverse range of clients from funders to communities of practice, on a broad range of strategic and operational challenges. He's worked in a variety of senior and governance roles in editorial, outreach, scientometrics, product and technology at such places as JoVE, Digital Science, and Emerald. In a former life, he was a cross-disciplinary research scientist at the UK Atomic Energy Authority and Harvard Medical School.

Discussion

15 Thoughts on "Outliers and the Importance of Anonymity: Usage Data Versus Snooping on Your Customers"

A great topic, and one that is developing right under our noses, leading me to question this thought:

“Anonymous, aggregated usage metrics for journals is another such example of a data application that is non-threatening . . .”

A lot of this post is predicated on the idea that there are two types of data — specific and anonymous. Again, I question that idea.

In a recent issue of “Science,” themed as “The End of Privacy,” there was a study showing that very few data points are needed to identify a unique individual even as they change retailers and locations (http://www.sciencemag.org/content/347/6221/536.abstract). Basically, you can track someone with just 2-3 interactions with supposedly anonymous data. Women were easier to track because they had particular shopping characteristics, and people were easier to identify as they became wealthier (because they bought more expensive things, and there are relatively fewer of those on the market). Earlier, the famous story of how researchers used two anonymized data sets (insurance rolls and voter rolls) to identify the medications the governor of Massachusetts was taking still resonates.

In short, it seems it actually takes very little data to identify a person out of a data set, and correlation is easy to accomplish. So I question whether there is really anything such as “anonymous data” and whether cross-journal data analysis using cookies, tokens, or computer fingerprinting (or all combined) can actually lead to non-anonymized usage data.

I initially wondered at the extreme valuation of Mendeley (http://scholarlykitchen.sspnet.org/2013/04/08/a-matter-of-perspective-elsevier-acquires-mendeley-or-mendeley-sells-itself-to-elsevier/), but now I get it. Libraries don’t give Elsevier the data it wants, so it bought a library full of user data, and one that is growing all the time. It’s living up to the old adage, “If something is free, you are the product.” Adding Mendeley data to its own data, and using techniques that are the norm in sophisticated e-commerce like those mentioned above, could give them quite a bit of data. And when computer fingerprinting, Mendeley usage data, and login data are combined, it’s likely pretty easy to tell who is doing what, and even who they are.

Now, does it make sense to go to this level of granularity? It might, especially if computers can do it quickly, cheaply, and easily. Let’s pretend I’m Elsevier. If I get you to login to Mendeley at work and at home, I get a lot more information than just your login if I want. I can tell what browser you’re using, the bit-depth on your monitor, your IP address, the version of OS you’re using (and what OS you’re using), and a few other things. Add this to my data mix, and when you go to a journal of mine through a supposedly anonymous proxied site license, I could backtrack through the data and cross-reference login data from Mendeley to see who you are. Suddenly, I know who you are, what institution you’re affiliated with, and so forth. Over time, I know most of the individuals using that site license. It has the potential to be pretty non-anonymous. Cross-tab this all with a nice third-party data set using 1-3 non-threatening variables (postal code, age, email, phone), and you can see a lot more.

Elsevier isn’t the only one thinking like this. We all are. Digital Science has to be, too. Getting past “anonymous” is a big issue for the industry, and we can do it now.

Is it non-threatening? Probably. But I don’t think digital products, in aggregate, are quite as anonymous as we believe, and they’re getting less anonymous all the time.

By Kent Anderson
Feb 16, 2015, 9:04 AM

How much would Pfizer pay to know what papers scientists at Novartis are reading and vice versa? There may be a new business model here…

By David Crotty
Feb 16, 2015, 10:16 AM

The business model could be a shakedown. A company could say that they WILL reveal what researchers are looking at unless you pay it to turn that feature off. I am sure the pharmaceutical companies have thought about this extensively.

By Joseph Esposito
Feb 16, 2015, 10:29 AM

Which gets back to the notion of “privacy is the new luxury.” Would you pay extra for a private version of Facebook where you could control what’s done with your information and you are never sold out to advertisers? Would you pay extra for a private subscription to a journal where your usage is not tracked and data on what you’re reading is not sold to the highest bidder?

By David Crotty
Feb 16, 2015, 10:34 AM

I think the answer to that is “Yep.” I have long wanted an ad-free NY Times, optimized for my use, not for advertisers. It would be wonderful if this option were available. But I suspect that the cost of these services would sometimes be prohibitive. I don’t wish to be cavalier about the privacy issues, though. The Snowden affair was wrenching, and I certainly don’t like or trust the people at Google.

By Joseph Esposito
Feb 16, 2015, 10:43 AM

Thanks Kent,

I’m not suggesting that it’s not possible to identify individuals from patterns of usage behavior. What I’m trying to convey is that aggregated analysis of market trends is what publishers want and need to understand.

It’s more of a question of use-case in my mind. The NSA are looking to find specific people, whereas Digital Science, for instance, is looking to understand the market better.

By Phill Jones
Feb 16, 2015, 9:52 AM

I think the problem is that, as Kent suggests, one can’t enable one use without also enabling the other. It’s not even a question of trust–I may trust Digital Science or Elsevier to leave things on an anonymous level and that may be their intentions, but all it takes is one bad apple at the company to go off program and you’re suddenly seeing a very different use of the same data (http://www.theverge.com/2014/11/18/7240215/uber-exec-casually-threatens-sarah-lacy-with-smear-campaign).

Even worse, many uses are out of the control of the company collecting the data. We know that governments constantly issue subpoenas for data from internet companies (and try to keep these permanently secret http://www.reuters.com/article/2015/02/06/us-yahoo-privacy-ruling-idUSKBN0LA23S20150206), so no matter how good your intentions, that same data can be used for other purposes.

By David Crotty
Feb 16, 2015, 10:14 AM

I agree that there are absolutely issues to think about and be careful about. What it comes down to in both cases is how much individual data a company is keeping or as a technology vender, keeping and passing back to the content provider. At the end of the day, once data has been aggregated and reduced, there aren’t discrete records for each user that could be used to identify an individual. I personally don’t think that that individual data is particularly useful commercially (unless you’re talking about a social networking site in which case, they’re a user experience feature). Some publishers may want to know things like geographical regions or universities where usage is occurring but that data can be supplied just as a number, rather than as a set of meta-data records.

Publishers in my experience are concerned about data privacy, as this discussion shows, and what I’m suggesting is that in order to get the important value from data without invading privacy, it’s possible to reduce data to it’s various meaningful components and discard the individual records. You don’t have to worry about the NSA making you give them information that you don’t have.

By Phill Jones
Feb 17, 2015, 8:20 AM

Phil, thanks for the thought-provoking post on this timely issue. I wrote about what you might call personal profiles or user accounts earlier this month (scholarlykitchen.sspnet.org/2015/02/05/data-for-discovery/). I didn’t get into opt in vs opt out, but I’m curious: Do you not think that offering these types of services is valuable to publishers and other information resource providers?

By Roger C. Schonfeld
Feb 16, 2015, 11:25 AM

Thanks Roger,

There are two sorts of profiles aren’t there? There’s your facebook profile where you voluntarily tell facebook a whole lot of very personal and identifiable things about yourself and your family, because you want to share those things with the world, then there’s the sort of profile that google creates about you that you can’t edit, see or ask for it to be deleted. Google uses this for targeted advertising, the NSA use it as a way to populate their watch lists and deny me advanced check-in on in-bound flights to the US.

I don’t see an ethical problem surrounding the first one because it’s obvious what you’re doing when you fill the profile in and people can decide what they want to share.

By Phill Jones
Feb 17, 2015, 8:32 AM

I think I see the distinction you are drawing, Phill. But in the Facebook case, there is also a tremendous about of usage/behavior tracking that accompanies the information intentionally shared. Both are used to provide the service and of course to target advertising. I wonder if that is a better analogy to the type of approach that is likely to emerge around scholarly profiles, hopefully in a more transparent and user-controlled way than Facebook.

By Roger C. Schonfeld
Feb 17, 2015, 12:22 PM

There might be a third type – public but compiled from online information rather than created by user? Though which may be able to be “claimed”? E.g., ZoomInfo (which can result in some hilarity as that site has erroneously “inferred” from various information sources that I’m the President of the University of Illinois). If the inaccuracies there started to effect my discovery experience or ethically worse … I’d likely be very irritated – right now it just seems to be why I get some not-so-useful to me mass market mailings.

By Lisa Janicke Hinchliffe
Feb 17, 2015, 7:05 PM

Thoughtful post Phill. While most scientific and technical publishers are likely looking at aggregated and anonymized data (at the moment – this is a fast-changing space) there is a big exception here with regard to digital advertising and medical publishing. Pharma companies (to give one example) are presently seeking to advertise to lists of specific, named individuals (meaning the pharma company comes to the publishers with a list of individuals they would like to reach). There is a lot of pressure to do so or alternately come up with other ways to target advertisements to qualified individuals (e.g. prescribing internal medicine doctors in the US). I should note that this is not a new phenomena, and it has long existed in print. Medical publishers have long sent different versions of their journals (confusingly called “books”) with different sets of ads to different individuals.

By Michael Clarke
Feb 16, 2015, 10:45 AM

Targeted advertising is a great point. I see that sort of thing more like pushing specific content to people based on information that they have willingly provided. Particularly, if you’re using a named list. That’s a very different use-case in my mind to reconstructing the identity of a person based on fragments of information gleaned from usage patterns and then singling them out.

By Phill Jones
Feb 17, 2015, 8:38 AM

Like others have observed, useful anonymized aggregated data requires tracking individual data. I’m much more on the side of “let’s make quality, useful services” than “let’s through away all the data and create a generic experience that no one cares that much for” so I’m glad to hear of data use that improves experience. My props in advance to the publisher/content provider that figures out how to usefully share that data back in a way that helps a subscribing library improve services for their users. No doubt this would add value for the institution on top of the value of the content itself…