filters
Image by P^2 - Paul via Flickr

Online stores like Amazon, online music sources like Spotify and Last.fm and Pandora, movie sites like Netflix, and book sites like LibraryThing and GoodReads all use collaborative filtering — aggregated data with no personally identifiable information — to generate recommendations and “you might also like” suggestions.

A recent paper presented at an IEEE conference by authors from Princeton, the University of Texas at Austin, and Stanford, is a real eye-opener. In it, the authors demonstrate that:

Recommender systems based on collaborative filtering . . . may leak information about the behavior of individual users to an attacker with limited auxiliary information. . . . Our work concretely demonstrates the risk posed by data aggregated from private records and undermines the widely accepted dichotomy between “personally identifiable” individual records and “safe,” large-scale, aggregate statistics.

The researchers constructed their “attacks” carefully — they used only publicly available information passively available from any of the online systems studied. They did not create fake customers or enter purchases or ratings into these systems. Their attacks relied only on indirect access to customer records — “the records are fed into a complex collaborative filtering algorithm and the attacker’s view is limited to the resulting outputs.” They claim this approach, based solely on what anyone else can see, sets their findings apart.

While the algorithms developed to uncover information lurking in large collaborative filtering data sets are similar to collaborative filters themselves, the researchers note that they are not geared to predict future events but rather to infer past events. This actually makes them more accurate than Bayesian predictors — the difference between making educated guesses versus observing the effects of actual transactions.

There are layers of information within collaborative filtering systems — there’s the raw information itself, the changes in a user’s information over time, and a user’s auxiliary information (star ratings, likes, tweets, and shares). By concatenating all this information, a relatively robust set of inferences can emerge (you rated something likely means you bought it, deviations from recommendations are telling, posting something on your Facebook page shows increased affiliation about the item, and so forth).

Then, math, math, and more math. I won’t pretend to understand it all. One tau symbol on the other side of a union symbol, and I’m pretty much done. But the logic I could derive in the midst of the mystery math seemed pretty solid.

As one of the authors explains in a blog post:

Consider a user Alice who’s made numerous purchases, some of which she has reviewed publicly. Now she makes a new purchase which she considers sensitive. But this new item, because of her purchasing it, has a nonzero probability of entering the “related items” list of each of the items she has purchased in the past, including the ones she has reviewed publicly. And even if it is already in the related-items list of some of those items, it might improve its rank on those lists because of her purchase. By aggregating dozens or hundreds of these observations, the attacker has a chance of inferring that Alice purchased something, as well as the identity of the item she purchased.

The section of the paper on Amazon’s collaborative filtering is especially interesting, since the researchers had a hard time testing their inferences because they lacked the “ground-truth oracle” — essentially, true data about what really happened. In three cases outlined, their inference engine suggested that each user had purchased a certain item particular to them. Within a month of each prediction, every user had reviewed their predicted item, a seemingly solid proxy for having purchased it. In one case, it was an older R&B album. In another, it was a gay-themed movie. In a third, it was an action-fantasy movie. While this sounds like the behavior of an inference engine, remember this is about reviewing what happened. It does, however, suggest that by using public data, someone outside of Amazon could have predicted with some accuracy what each of these people would buy later, some personal traits, or a little of both.

The author writing the blog post also notes a few caveats to quell the panic these conclusions might instill:

It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different. . . . We also found that users of larger sites are much safer, because the statistical aggregates are computed from a larger set of users. . . . It underscores the fact that modern systems have vast “surfaces” for attacks on privacy, making it difficult to protect fine-grained information about their users. Unintentional leaks of private information are akin to side-channel attacks: it is very hard to enumerate all aspects of the system’s publicly observable behavior which may reveal information about individual users.

Returning to the language I prefer, there are many great phrases in this paper — sybil history, ground-truth oracle, and AUX tracks. It’s clear this team of researchers really loved this study and are excited by what they’ve unearthed.

We are all publishers now, but in a new, data-rich sense. What we need to recognize is that we’re all publishing data nearly all the time, without knowing it, and perhaps in ways that smart people with strong algorithms can harvest and use — either to help us, or to exploit us. And the traditional dichotomy between personal information and anonymous, aggregated data . . . well, that distinction may be blurring.

Enhanced by Zemanta
Kent Anderson

Kent Anderson

Kent Anderson is the CEO of RedLink and RedLink Network, a past-President of SSP, and the founder of the Scholarly Kitchen. He has worked as Publisher at AAAS/Science, CEO/Publisher of JBJS, Inc., a publishing executive at the Massachusetts Medical Society, Publishing Director of the New England Journal of Medicine, and Director of Medical Journals at the American Academy of Pediatrics. Opinions on social media or blogs are his own.

Discussion

5 Thoughts on "Patterns In and Across Aggregated Data — Is "Anonymous" Collaborative Filtering Really Safe?"

To quote the artist Banksy, “I don’t know why people are so keen to put the details of their private life in public. They forget that invisibility is a superpower.”

And to quote the artist David Kremers, “Privacy is the new luxury”

In the future, everyone will be anonymous for 15 mins. (Also Banksy I believe)

It may as well be a law* that the web collects the properties of its users. A further law could be that given the opportunity, the web tends to expose those properties to others. When we understand this, we’ll have to figure out how to live under those conditions. One might want to lobby for human rights legislation to extend into this area.

*as in nature not legal.

I still think this is the missing business model for social media. Use it for free and get sold out to anyone interested in paying. Or, pay a nominal monthly fee and gain control over how your information is released. Hence the Kremers quote above. I’d be much more likely to use things like Facebook if offered a better level of control and I’m willing to pay for it.

I suspect this doesn’t work due to the whole dataset ceasing to work if the supernodes of the network go dark so to speak. The reason I think this is the case is that Facebook only makes a couple of bucks per user per year. One would think that a freemium model with a 5% conversion rate would significantly alter that revenue stream.

Philosophically speaking, it’s the onward use of the data outside of the systems that concerns me. There’s all sorts of protections for citizens in the physical world which need to be thought about in the digital one.

Probably true–those that advertisers would most likely want to target are those that can afford to opt out of targeting. Though if you have 750M users and you charge $20 per year for premium service and get a 10% buy-in, that’s $1.5B in revenue.

Comments are closed.