Online stores like Amazon, online music sources like Spotify and Last.fm and Pandora, movie sites like Netflix, and book sites like LibraryThing and GoodReads all use collaborative filtering — aggregated data with no personally identifiable information — to generate recommendations and “you might also like” suggestions.
A recent paper presented at an IEEE conference by authors from Princeton, the University of Texas at Austin, and Stanford, is a real eye-opener. In it, the authors demonstrate that:
Recommender systems based on collaborative filtering . . . may leak information about the behavior of individual users to an attacker with limited auxiliary information. . . . Our work concretely demonstrates the risk posed by data aggregated from private records and undermines the widely accepted dichotomy between “personally identifiable” individual records and “safe,” large-scale, aggregate statistics.
The researchers constructed their “attacks” carefully — they used only publicly available information passively available from any of the online systems studied. They did not create fake customers or enter purchases or ratings into these systems. Their attacks relied only on indirect access to customer records — “the records are fed into a complex collaborative filtering algorithm and the attacker’s view is limited to the resulting outputs.” They claim this approach, based solely on what anyone else can see, sets their findings apart.
While the algorithms developed to uncover information lurking in large collaborative filtering data sets are similar to collaborative filters themselves, the researchers note that they are not geared to predict future events but rather to infer past events. This actually makes them more accurate than Bayesian predictors — the difference between making educated guesses versus observing the effects of actual transactions.
There are layers of information within collaborative filtering systems — there’s the raw information itself, the changes in a user’s information over time, and a user’s auxiliary information (star ratings, likes, tweets, and shares). By concatenating all this information, a relatively robust set of inferences can emerge (you rated something likely means you bought it, deviations from recommendations are telling, posting something on your Facebook page shows increased affiliation about the item, and so forth).
Then, math, math, and more math. I won’t pretend to understand it all. One tau symbol on the other side of a union symbol, and I’m pretty much done. But the logic I could derive in the midst of the mystery math seemed pretty solid.
Consider a user Alice who’s made numerous purchases, some of which she has reviewed publicly. Now she makes a new purchase which she considers sensitive. But this new item, because of her purchasing it, has a nonzero probability of entering the “related items” list of each of the items she has purchased in the past, including the ones she has reviewed publicly. And even if it is already in the related-items list of some of those items, it might improve its rank on those lists because of her purchase. By aggregating dozens or hundreds of these observations, the attacker has a chance of inferring that Alice purchased something, as well as the identity of the item she purchased.
The section of the paper on Amazon’s collaborative filtering is especially interesting, since the researchers had a hard time testing their inferences because they lacked the “ground-truth oracle” — essentially, true data about what really happened. In three cases outlined, their inference engine suggested that each user had purchased a certain item particular to them. Within a month of each prediction, every user had reviewed their predicted item, a seemingly solid proxy for having purchased it. In one case, it was an older R&B album. In another, it was a gay-themed movie. In a third, it was an action-fantasy movie. While this sounds like the behavior of an inference engine, remember this is about reviewing what happened. It does, however, suggest that by using public data, someone outside of Amazon could have predicted with some accuracy what each of these people would buy later, some personal traits, or a little of both.
The author writing the blog post also notes a few caveats to quell the panic these conclusions might instill:
It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different. . . . We also found that users of larger sites are much safer, because the statistical aggregates are computed from a larger set of users. . . . It underscores the fact that modern systems have vast “surfaces” for attacks on privacy, making it difﬁcult to protect ﬁne-grained information about their users. Unintentional leaks of private information are akin to side-channel attacks: it is very hard to enumerate all aspects of the system’s publicly observable behavior which may reveal information about individual users.
Returning to the language I prefer, there are many great phrases in this paper — sybil history, ground-truth oracle, and AUX tracks. It’s clear this team of researchers really loved this study and are excited by what they’ve unearthed.
We are all publishers now, but in a new, data-rich sense. What we need to recognize is that we’re all publishing data nearly all the time, without knowing it, and perhaps in ways that smart people with strong algorithms can harvest and use — either to help us, or to exploit us. And the traditional dichotomy between personal information and anonymous, aggregated data . . . well, that distinction may be blurring.