Identity (film)
Image via Wikipedia

It is a truth universally acknowledged — any discussion of identity and identity provision will instantly send an auditorium of conference participants to sleep, make readers of an RSS feed instantly skip to the next item, and allow email update subscribers in hot in pursuit of “inbox zero” to delete the offending message without the slightest twitch of guilt.

Well, I hope you decide to stick around, because it’s time to get to grips with some of the issues of identity in a digital world.

We start, like most things digital and Internet, with Google. But don’t worry, there’s a very direct link to the scholarly information business coming up.

Not long after Google launched their new service, Google+, a furious debate around the subject of identity was ignited. Google had decided that any user of the Google+ service must use their ‘Real Name’. This was described as “the name you commonly use.”

When I signed up for Google+, it was pretty obvious to me that Google was asking for me to identify myself as clearly and unambiguously as possible. However, it turns out that there were some edge cases — people who’s commonly used name was either a single word or contained an unusual combination of characters. Google’s automated “dodgy identity detector” was flagging these as spam or other poor behaviour and, to cut a long story short, Google closed the accounts down. It’s fair to say that Google didn’t handle the issue very well (they’ve basically admitted as such), but in the ensuing furor some interesting debating points emerged.

Question: When is anonymous use of a service to be expected?

Answer: If you are in Germany, the answer is always it seems. The German Telemediengesetz (German Act for Telecommunications & Media Services) specifies anonymous/pseudonymous access in section 4, subsection 13 of the act. Other countries may have their own requirements. Then there’s the cultural aspects to consider. There’s also the question of the circumstances by which a business that is interested in selecting for a particular type of user, should be obligated to provide services to individuals they are not interested in serving (think about that one, it isn’t quite as obvious as it seems).

Question: Regardless of whether it’s expected, is it in fact possible to have truly pseudonymous/anonymous access to a service in an always connected world?

Answer: It’s early days here, but so far the evidence would seem to be that it simply is not possible to be truly anonymous when one is connected to the web. Say you want to search the USA for ‘subversives’. You could use the Amazon wishlist feature in order to locate the addresses of people who read ‘subversive books’. Or you could read this forensic look at how your browser can be fingerprinted to a very high degree of accuracy, and thus used to identify you (pdf). Bet it didn’t occur to you that your browser is effectively a Globally Unique Identifier did it. Now you know why those annoying flash ads keep following you around when you browse the web.

Question: To what extent should the debate about identity be carried out on social network platforms built to leverage that very thing? If we use a service without a full and complete understanding of the consequences of the choice, are we forgoing rights, we might later wish to assert?

Answer: The terms of service we all sign up to when we elect to use a new product are inimical to understanding (assuming they’ve been written to be understood) given that it’s very hard to think through the consequences of ones possible patterns of use, before one has even kicked the tires so to speak. I’d suggest that enlightened statespersons should debate these complex issues, but this is the 21st century, and there don’t seem to be any around in the political arena. This one, will be decided by those who engage most effectively with it.

I find it somewhat odd that some users of Google+ were objecting to Google’s insistance on a real name as if they were in some way able to use Google’s services whilst retaining anonymity. Google certainly has a very good idea of who you are, regardless of whether you are logged into an account or not. They just don’t have your name.

So why would they want to insist on a real name? Twitter doesn’t insist on it, I’m told Facebook does, but apparently isn’t overly zealous about following up on the rule. Other social networks of various flavours allow whatever strange and wonderful combination of letters, numbers, and special characters you care to design.

Well, for one thing, Google+ is an identity service. It just happens that one of the first things they’ve implemented with it, is a method of allowing you to build a network of your contacts, with whom you can exchange information.

If you are on Google+, you may well already be aware of what Google is doing with the identity information they have collected so far. Links shared by your network are being highlighted, even if those links didn’t originate in Google+. Take a look at the image below, where two of my Google+ network show up next to links they have commented on. I note with interest the ability of Google to connect the dots between Google+ and friendfeed (which is owned by Facebook). Sarah Perez is linked to via her Google profile (a component of Google+). Content of a potentially higher value is being highlighted via your network.

(Thanks to Jill O’Neill and John Blossom for permission to use their names and images.)

The intention is pretty clear. Google wants to be able to leverage the database of intentions it has built up about you and your friends and colleagues for all sorts of services. One of those services looks to be a totally personalised Google information delivery experience. I’d guess that a fairly major play on transactional services, won’t be far behind. Throw in a tightly integrated mobile experience and some hardware, and Google has got a rather interesting end-to-end view of their users, to put it mildly. (Update: whilst writing this, I was made aware of Google Wallet, a mobile based payment system.) This is potentially a view that can follow the user seamlessly across their interactions with, well actually, with the modern world. The Internet isn’t a place you visit from time to time, it’s the base fabric of the modern world. I imagine Google has hordes of data scientists chomping at the bit to extract monetisable insights out of the datastream. Allowing for the special characters and other tricks that spammers often use would be a disaster for the data being accumulated from Google+ activity.

I’ve previously written about how companies are trying to create and sell metrics attached to individuals in order to sell the “reputation” that accumulates with any user of a social network. I’m not on Facebook as I don’t want to have anything to do with a company who’s privacy statement was longer than the US Constitution. I am, however, a user of both Twitter and Google+. I’ve been trying to rationalise why I’m comfortable with Google’s use of my data, why I trust Google more with data that can be used to identify me purely on my habits and my particular set of interests. I’ve also been thinking about how the more you use Google the more it learns about your interests in order to serve better results to you. Amazon does much the same, profiling its users in order to better understand what they are interested in, so that it can sell more stuff to them.

Effective personalisation and relevance is of great interest to the scholarly publisher, and there are a number of neat offerings out there. I think a really effective personalisation and recommendation system is of massive benefit to the time poor researcher or student, as anything that can increase the chances of a serendipitous discovery in the scholarly literature brings massive benefits. Allowing users to transit across various scholarly holdings in a meaningful way would also bring massive benefits to all (there’s a reason they use Google first)  But there are a couple of big problems.

Limited Data: They are limited to the offerings of the publisher/institution. This is a problem that cuts both ways, publishers can only profile users based on the data (journals and articles) that they are able to present. Users can only see “relevant” material from within the holdings of the publisher, or perhaps the larger holdings that their institution has access to. Given that any given field of scholarly research spreads across multiple publisher portfolios, all parties are at a disadvantage when compared to Google or Amazon (or Facebook or Twitter, for that matter) in terms of how good, and therefore useful, their offerings can be. Then of course there are the privacy issues to consider.

Spam: To date, publishers have been even worse at leveraging the social graph of their users than Google has. Now it’s true that the reasons for this are many and varied but a couple of publishers experiments had everything going for them, and yet they have been discontinued. I’m talking about Connotea and 2Collab. In a nutshell, these were link storage tools, allowing scholars to store, tag and share links to research. Both had a clear value to users. Both were swamped by spam. 2Collab closed this year. Connotea seems to be still open but looks to be overrun by spam (for a perspective on this, see this article from the Kitchen archives). Two well funded publishers struggled to deal with this. Clearly it’s a very tricky problem to solve.

Seamless access to institutional holdings: IP based recognition has always fundamentally used the wrong tool for the job. IP ranges are there to enable machines to communicate with each other, not to be used as authentication methods. Now it’s the best solution out there, but aside from the major issues in maintaining and updating complex lists of numeric codes, IP addresses identify the machine, not the user. And if the user moves from device to device, then matters get even more complex as IP authentication will cheerfully withhold access to a user who decides to use their iPad or other device, even when they are physically sat at a terminal which does have access. Athens and Shibboleth — not a vision of the future, is it? I know that’s being harsh, and I know a lot of hard work went into the protocols and all that, but it’s basically a suboptimal user experience when compared with Facebook . Just to be fair, OpenID isn’t exactly a barrel of laughs either. Wouldn’t it be fantastic if one could sign in to an identity service and then use that to seamlessly authenticate access to any services that could make use of that identity? I bet Google will be more than happy to allow a Google+ identity to be used for exactly that purpose. But it is a general purpose identity, and not perhaps most suitable for the scholarly community.

Researchers are also showing interest in the possibilities of a well-configured identity service. The altmetrics movement is essentially predicated on being able to append various signifiers of scholarly output and reputation to an identity. In addition, work is being done on additional uses for a researcher identity. At the recent irisc2011 identity workshop in Helsinki, there was a breakout panel that debated additional uses for a researcher identity. They concluded that Researcher id’s would greatly improve the manuscript submission process (this is a less than optimal experience apparently). Researcher profiles, Id’s with the researcher metadata appended to them were also wanted (for grant applications), and of course metrics to support the breadth of a scholars outputs. Just to be clear on this — altmetrics is about tilting at the windmills of peer review and impact factor, two things that act as a bulwark to the disruption of the business of scholarly publishing.

ORCID was the system of choice for experimentation. ORCID is to authors what the DOI is to the articles they publish — a system for disambiguating author names and supplying them with an unambiguous identity that can be used for various things. Like the DOI before, this is one of the most important developments occurring in scholarly publishing. It is a very good thing indeed. But part of me thinks that the current ideas for using it don’t go far enough fast enough.

People adopt things that provide obvious, clearly understood benefits to them. Things that make the pain of learning how to use them worthwhile. So take a look at what Amazon has achieved in terms of providing an identity service for users of it’s offerings; relevance marketing; serendipity analysis; Whispersync. Look at how Facebook and Twitter have colonised the business of sharing links, and how Google has concluded that an identity service is vital in order to capture the same signals in order to further improve its search algorithm. Look at the difficulties a user of our wares has if they want to move from device to device whilst consuming our content.

Now, look at ORCID.

So here’s a quick vision of a possible future:

The researcher wakes in the morning and picks up their mobile device. They’ve already configured it with their ORCID credentials so the device can either supply them upon request, or any read/note/store applications can make use of the same credentials in order to allow them to get on with the business of keeping up with the competition. Speaking of which, there’s a competitve intelligence application that keeps an eye on the outputs of competing researchers. Overnight, it has run a series of searches and sorted and categorised the results for them to scan though. It’s learned what areas they like to pay most attention to. Some important items have already had various sections of text and imagery highlighted for closer inspection. Some articles and snippets of information are queued for later consumption, others are tagged to be distributed to the researchers lab workers.

As the researcher moves from their house to their place of work and switches devices, the information moves with them, again via their ORCID credentials. In fact, the same credentials have not only allowed them access to all of their institutions holdings, but various publisher apps and platforms are updating and reconfiguring information for them based not only on their activities, but the activities of their ORCID network. DOI resolver data, appended to the identity of the researcher allows much better precision and recall algorithms to  help them filter through the torrent of research. The network effect is in full force.

Later, when they attend a conference at another institution, their access to scholarly resources moves with them. They also have control over exactly how much of their clickstream data is to be used to enhance their information discovery activities.

The publisher has had to employ a bunch of data scientists in order to better understand what their users are doing. Usage is up, way up.  Business development is plowing through the data and surfacing a multitude of product ideas and partnerships based on opportunities to derive customised products for the emerging areas of research. Other systems are predicting these emerging areas and listing the most active researchers, ranked by their various scholarly metrics, a self assembling editorial board for a journal that doesn’t exist as yet even though the topics for discussion are already being surfaced.

What I’ve described above, is not only technically possible, it’s already happening in other areas. Identity driven data is big business, that’s why Google just spent over $500 million on Google+. There’s a massive opportunity here to build something that offers clear benefits to both publishers, scholars and libraries. If we don’t  do it, somebody else will.

Don’t believe me? Take a look at what else Google has been up to.

Enhanced by Zemanta
David Smith

David Smith

David Smith is a frood who knows where his towel is, more or less. He’s also the Head of Product Solutions for The IET. Previously he has held jobs with ‘innovation’ in the title and he is a lapsed (some would say failed) scientist with a publication or two to his name.

Discussion

8 Thoughts on "It's About Time We Discussed the Business of Identity"

Hi David,

To play "devil’s advocate" and to pick a few nits:

2Collab and Connotea were not overrun by spam initially. This problem grew gradually worse as each seemed to fall off the radar for their respective owners. There was an initial rush by publishers to put together social media sites like these, and after a while, it became evident that there was little uptake by the community, nor an obvious business model for monetization, and they were mostly back-burnered. That’s when the spam problem really grew, perhaps the online equivalent to weeds growing through cracks in the sidewalk in front of an abandoned building.

Business development is plowing through the data and surfacing a multitude of product ideas and partnerships based on opportunities to derive customised products for the emerging areas of research

Will researcher really be willing to pay for these products? It’s important to remember that the things you’re discussing, Google Plus, Facebook, etc., make no direct money from the actual product itself. Google is an advertising company and it uses its various ventures as a means to sell ads. Facebook has a somewhat mixed business model, but as many have remarked in the past, if they tried to charge for using Facebook itself, the site would likely die very quickly and the users would move on to the next free network down the block. So if these sorts of things are going to be monetized, it won’t come from the things themselves, and publishers must instead try to align them with other activities that actually do bring in money.

These two statements seem contradictory:

Speaking of which, there’s a competitve intelligence application that keeps an eye on the outputs of competing researchers. Overnight, it has run a series of searches and sorted and categorised the results for them to scan though. It’s learned what areas they like to pay most attention to.

and

They also have control over exactly how much of their clickstream data is to be used to enhance their information discovery activities.

Phil Davis wrote a post while back about the secrecy that most scientists practice, about how there’s a competitive advantage (in an extremely competitive marketplace) for keeping one’s activities to oneself until the point where they’re ready for publication.

If one allows the users to control the exposure of their activities, then nearly all are going to select the absolute minimum, which ruins the network effect and the use as a discovery tool.

And finally, I can’t say that I share your trust level for Google (or for any company really). Companies do what’s in their own best interest, and it’s always important to remember the oft quoted, “If you are not paying for it, you’re not the customer; you’re the product being sold.” Your needs come second to the actual paying customer, in Google’s case, the advertisers.

There’s an excellent interview with Cory Doctorow out recently that discusses this misguided trust, how we’re selling ourselves short, and how Google and Facebook don’t present you with the best possible information that you’re seeking, but instead with the answer that best fits Google and Facebooks’ commercial goals.

And that raises a key question here–if these sorts of technologies are being used to sell something else, than can we really trust them for use in scholarly research? If the results are biased toward selling you a product, then that creates a conflict of interest that makes them much less valuable for the user.

Ok, lot to reply to there.

1) The spam thing. One of G+s reasons for a ‘real name’ was to combat spam. I think I did mention in passing that there were other issues surrounding 2Collab and Connotea (In fact when writing that part, I had your article in mind), but once spam gets in to a system and isn’t flushed then the utility of that system plummets. It’s arguable whether a ‘real name’ solves the spam problem, but I think it is a genuine reason.

2) New product ideas based on data mining. I didn’t say you had to sell to researchers! We don’t (much) sell to researchers right now. We sell to others, they (researchers) consume those outputs. This is an argument about value and where to find it I think. I think your point about aligning with activities that bring in money is right on the mark. That’s actually the whole point. Google Facebook etc think there is money to be made in identity control. After I wrote this, I read a story about how the UK Govt is considering the Facebook Id as a sign in service. Let’s just think about that for a moment (assuming it’s true and not wild speculation). Once you have the control you can make the money.

3) My contradictory statements: So this speaks to whether systems can be built to make the trade-off worthwhile to the user. Hide your clickstream and you don’t get the serendipity service that your competitor might be using to read the paper that gives the crucial insight…. Also, there are many ways to look at competitor analysis. I should have perhaps connected that bit to the altmetrics aspect of things more directly – You need more than impact factor for this to work. Also, for length reasons, I left out a chunk on the desire to be credited with work, whilst wanting to comment on it anonymously…

4) Trust. Absolutely. And Cory’s thinking is excellent. There needs to be trust between all players in the scholarly process. An identity service that everybody owns that gives clears benefits to all players is what is needed. I don’t think Google and Facebook will supply that necessarily, for all the reasons you suggest. And as I stated in my 1st para, these issues tend to send people to sleep, especially when “OMG Farmville!” and whatever. But this is no longer a world full of 5 to 15% solutions to issues (3 publishers controlling 42% is not a monopoly people. One search engine with a 90% reach, THAT’S a monopoly!). The network effect tends to deliver one big winner. There are ways to hedge against that outcome. http://store.steampowered.com/about/ for example is an online download system for computer games. Multiple publishers of games use it. It’s an ID system for gamers and it handles abuse, piracy and all sorts whilst still allowing a competitive market for the products. Game publishers seem to have withstood the disruption that’s alledgedly hammer the music and movie businesses. That’s because they were smart at figuring out some things about how to serve their users in a digital world.

Appreciate, I’ve abbreviated some of your points in my answers. Hope you get the gist.

The most recent issue of NISO’s Information Standards Quarterly magazine is all about organizational and people identifiers and includes articles on ISNI (an ISO standard for a Name Identifier), ORCID, the Names Project, and others. Geoff Bilder has an opinion piece on Identities and Trust, which is very relevant to some of the issues brought up in this blog. It’s all available in open access here: http://www.niso.org/publications/isq/2011/v23no3/

Full disclosure: I am the Managing Editor of this magazine.

There’s a somewhat related article in this week’s New Yorker about a drive to provide IDs to millions of people in India, many of whom never have had an ID in the sense we think about ID.

http://www.newyorker.com/reporting/2011/10/03/111003fa_fact_parker [subsciber’s only]

It’s fascinating to think that while we’re worried about protecting our identities, there are others who barely have any at all, and how revelatory the notion is (and how many accommodations they’re willing to make to get one).

Fair enough. Still, it’s refreshing to see such a healthy respect for print.

I love the story about the identity business in India. Raises the same issues about what constitutes informed consent, when some of these people are illiterate. I think this article in Wired, is about the same thing: http://www.wired.com/magazine/2011/08/ff_indiaid/all/1

In James Gleick’s rather marvelous “The Information” there is a section on the arrival of the telephone and the build out of the ‘telephone network’ and the rapid need for a listing of names and numbers to allow people to be called. There was anxiety about a) the effect on privacy of the spoken word being transmitted to another location where anybody could be listening, and b) the impersonal nature of a number instead of a name to enable communication. On reading this section, it struck me how those two things seem to contradict each other. But then privacy and identity has always been a complex emotional business.

Comments are closed.