wall of spam
Image by chotda via Flickr

Of all the plays in Web 2. 0 space, the one with what appears to be real workflow utility has been the notion of publisher linking networks like Connotea and 2Collab. They promised to make it easier to file research, discover related research, and engage in discussions about common research interests.

Recently, two little hints I came across suggested that publisher linking networks may be choking to death on spam.

The first was in the form of a tweet from Matthew Todd, in which Todd complained about spam in Connotea‘s RSS feeds, finishing with “Does anyone still even work there?” Now, this is clearly one person’s perspective, but it comes from an observant and frequent user.

At the same time, I discovered that Elsevier’s 2Collab has ceased accepting new users because the high level of spam was making it difficult to serve users.

I decided to ask Howard Ratner of Nature about the Connotea issues noted by Matthew Todd (Nature developed Connotea). Howard was nice enough to direct me to Ian Mulvany, their Connotea expert, who gave me clear indication that, yes indeed, some very bright and dedicated people are working on Connotea. But they are struggling with the spam issues.

Mulvany notes the two big effects spam has on Connotea. First, it reduces the overall utility of the site as a social ranking engine for science. Second, Connotea has to support an artificially high usage volume, with a significant proportion of the volume coming from automated users, or spammers.

Connotea seems to use the current best method for controlling spam — as Mulvany puts it, “a series of heuristic rules . . . that trigger spammy scores on user accounts. If an account goes beyond a certain threshold, then the account is made invisible to the outside world.”

One difference between Connotea and 2Collab that’s driving a harder reaction at Elsevier’s service is that 2Collab is tightly integrated with Scopus, so spam hitting user profiles ripple through much more of Elsevier’s social software environment. At Nature, Connotea is still separate from Nature.com, so the effects of spam are isolated to Connotea.

However, the presence of spam is something Nature will have to deal with if they ever want to integrate profiles across Connotea and Nature.com.

Scopus and 2Collab use heuristic filtering, as well, but it seems to be based more on keywords and URL strings.

Heuristic filters are notorious for creating a cat-and-mouse game with spammers. They’re built on rules that, under attack, reveal themselves. Once a spammer sees the rules, they can be tricked. This is what’s happening at Nature, according to Mulvany:

. . . we are not coping well at the moment with non-Latin character spam. Russian spam is quite a problem at the moment for us.

These experiences lead me to ask a bigger question: Can publishers sustain these networks with their own technologies?

Heuristic filters are prone to prolonged wars of attrition, and the spammers have the advantage.

Commercial linking sites don’t seem to have these problems. Why? They use heuristic filters, too.

Well, they supplement their heuristic filters with community filtering.

Digg, WordPress, and other major providers of linking and authoring services have filters that learn from millions of users, day in and day out. A spam attack in Digg or WordPress lasts very briefly, and barely even registers as an annoyance. A few thousand users identify it as spam in a matter of minutes, and the system shuts it down.

But commercial offerings have a different problem — when something is wrongly identified as spam, either through heuristics or community action, it can disappear forever. We’ve had problems like that with the Askimet spam filter on WordPress. Commenters have to email us when a comment goes unposted for hours, and we have to dig it out of the spam filter. It’s rarely a problem, and from a user’s perspective, these little glitches rarely intrude on the utility or usability of a site, so they are categorically less problematic. In fact, they’re usually invisible, unlike the more prominent spam the heuristic filters are missing on Connotea and 2Collab.

In the STM community, two noble attempts to advance collaborative reference sharing may be choking on spam. Without the scale to create a community filter to complement their heuristic filters, their future options may be defined by spam more than by their original aspirations.

Reblog this post [with Zemanta]