Let us cast our minds back to 2003 and a quote from John Battelle that summed up neatly why Google and a slew of other tech companies down in the valley, would go on to command the astronomical valuations that persist to this day:
The Database of Intentions is simply this: The aggregate results of every search ever entered, every result list ever tendered, and every path taken as a result. It lives in many places, but three or four places in particular hold a massive amount of this data (ie MSN, Google, and Yahoo). This information represents, in aggregate form, a place holder for the intentions of humankind – a massive database of desires, needs, wants, and likes that can be discovered, subpoenaed, archived, tracked, and exploited to all sorts of ends. Such a beast has never before existed in the history of culture, but is almost guaranteed to grow exponentially from this day forward. This artifact can tell us extraordinary things about who we are and what we want as a culture. And it has the potential to be abused in equally extraordinary fashion.
This is one of the most prescient statements I think I have ever read about the world that we live in today. And in looking back a paltry 11 and a bit years since those words, one has to realize that the Database of Intentions was in fact in its infancy.
What John is talking about in the above quote, is the idea that Google et al., would know, in an abstract sense, about human desires. For example – N% of a population is interested in Y and Z and A. Those interests have changed over the last XXXX period from C and D and E. So the algorithms show ‘them’ more Y and less C. And Google makes money off the advertising.
This was a world that existed after the Patriot Act (for the USA) and RIPA (for the UK). This was also a world where Y, Z and A were ‘just’ statistical clusters of co-occurring words; the machine didn’t really have a clue what those words actually were about. A bunch of dumb algorithms and a bunch of parallel processing. It was a world before the rise of the social graph (people, aka Facebook); the geospatial graph (driven by cheap GPS chips in mobile devices, and increasingly ubiquitous wifi); and the knowledge graph (concepts and so on). So let us update that quote with all those things and see what it looks like now.
The Database of Intentions is simply this: It’s a model of you. And everyone and everything you contact in space and time.
It’s the aggregate of your searches, your results and the paths you took, indexed against your location, your time of day; further indexed against the semantic understanding of who you are; what you are interested in and what, or who your interactions are with.
This information, in aggregate form, IS the intentions of humankind – a massive database of desires, needs, wants, and likes that can be discovered, (not even) supoenaed, archived, tracked, and exploited to all sorts of ends. Such a beast has never before existed in the history of culture, but is almost guaranteed to grow exponentially from this day forward. This artifact can tell us extraordinary things about who we are and what we want as a culture. And it has the potential to be abused in equally extraordinary fashion.
And here’s the thing… I really want to exploit that database.
And so do you.
You are a Funding Agency – representing the long arm of the taxpayer, and you really want to do the ROI of science thing; demonstrate that you give money to the most impactful work by the most effective scholars. You totally want at that database, because that’s how you are going to figure out just what impact truly is.
You believe #Altmetrics is the future of measurement of the multitudinous elements of scholarly output. You totally want at that database. Actually, you are building that database – harvesting the data from all sorts of places, trying to find the signal in the noise and show that a positive tweet from professor X from the research powerhouse of Y about a scholar is just as useful a measure as an A-list publication score.
You want to expose data from researchers – from the raw material to the curated and quality controlled outputs, you want to match the who to the what and the how.
You are a University – you want to know who brings in the money from those funding agencies – who’s the rockstar and who needs to clear their lab; How can you ensure you score big at the next RAE?
You are a Librarian – You’ve paid for a bunch of stuff, now what exactly is it being used for and by whom, and does that justify the cost?
You are a Researcher… You need to maximize the ‘bang’ from your research; gotta get it in front of the right people for that tenure process. You need to come up with something that demonstrates your ‘impact’. You could really use something that filters out the cr*p, let’s you scan over the literature more efficiently; let’s you keep an eye out for the opposition sneaking in with a result from left-field; something that tells you if you’ve registered an ‘impression’ with that big prof in your field.
You are a Publisher – You need more usage; more visits; more eyeballs. Whether you are #OA or #Tollgate, #Freemium or #Subscription, you need to get the stuff to the right people at the right time. You need to enhance serendipity, because that drives the value. You need that 360 degrees view of ‘the customer’ so you are building that database as well. You want to push the relevance, and a knowledge graph and a social graph is a wicked combination if you can crack it.
All these things require surveillance. You can’t measure ‘impact’ without it. If you want to measure the ‘impact’ of a persons’ work, you have to gather the data on where it goes, how far it spreads, the impressions, the penetration, and the cascade of actions that it triggered, and you have to set those in some sort of cultural model. A really good relevancy model for usage must relate the semantics of the stuff to the interests of the potential readers, and that requires data be parsed from both. Where do you get the reader interest data from? Why their clickstream of course. This is why we are tracked constantly in our browsers so as to feed the beast that serves the ads. As a (rather important) sidebar, one might want to ponder why it is that with all this tracking, the ad supported news business is in such terrible trouble; isn’t the data worth infinitely more than taking a blind shot on a half page ad in the print newspaper? Shouldn’t we be living in a golden age of journalism powered by surveillance based advertising?
Of course, surveillance has a bit of an image problem right now. Whether you merely find it creepy, or perhaps you worry about the capabilities of the state to blunder around the data, the cost benefit equation is more often than not, tilted firmly against you.
And it’s reached the outer spiral arm of the internet we call home. Us. The community of scholarly publishers and associated interests. It emerged rather strongly in a number of the sessions at the most recent SSP Annual Meeting, a topic of question in more than one session. It was called ‘impact’ or ‘analytics’ or ‘metrics’ but it’s surveillance at the end of the day.
And I believe firmly that it’s an important tool, and that if we can do it right; do it transparently; do it respectfully; explain the benefits; allow the control to reside with the surveilled; be honorable and open with our intentions; then maybe we can all benefit here. But hope is not enough.
I think we should have a debate about how this data should be used in a scholarly context. We should look to derive a set of principles about how such data is to be used. Perhaps we need to think about badges and accreditation for ‘good surveillance’ practices. We must be able to explain to our users why the transaction is in the interests of all the stakeholders, and understand and allow for control over the collection and use of such data.
There’s more here (much more). I hope that whether you agree with my views (and they are very much my personal views) or not, you do conclude that discourse in this area is much needed. We are somewhere close to the general internet conditions that prompted John to write those words in a blog post in ’03. Let’s try and think this through for the benefit of all the members of the scholarly community.