wall of spam
Image by chotda via Flickr

Of all the plays in Web 2. 0 space, the one with what appears to be real workflow utility has been the notion of publisher linking networks like Connotea and 2Collab. They promised to make it easier to file research, discover related research, and engage in discussions about common research interests.

Recently, two little hints I came across suggested that publisher linking networks may be choking to death on spam.

The first was in the form of a tweet from Matthew Todd, in which Todd complained about spam in Connotea‘s RSS feeds, finishing with “Does anyone still even work there?” Now, this is clearly one person’s perspective, but it comes from an observant and frequent user.

At the same time, I discovered that Elsevier’s 2Collab has ceased accepting new users because the high level of spam was making it difficult to serve users.

I decided to ask Howard Ratner of Nature about the Connotea issues noted by Matthew Todd (Nature developed Connotea). Howard was nice enough to direct me to Ian Mulvany, their Connotea expert, who gave me clear indication that, yes indeed, some very bright and dedicated people are working on Connotea. But they are struggling with the spam issues.

Mulvany notes the two big effects spam has on Connotea. First, it reduces the overall utility of the site as a social ranking engine for science. Second, Connotea has to support an artificially high usage volume, with a significant proportion of the volume coming from automated users, or spammers.

Connotea seems to use the current best method for controlling spam — as Mulvany puts it, “a series of heuristic rules . . . that trigger spammy scores on user accounts. If an account goes beyond a certain threshold, then the account is made invisible to the outside world.”

One difference between Connotea and 2Collab that’s driving a harder reaction at Elsevier’s service is that 2Collab is tightly integrated with Scopus, so spam hitting user profiles ripple through much more of Elsevier’s social software environment. At Nature, Connotea is still separate from Nature.com, so the effects of spam are isolated to Connotea.

However, the presence of spam is something Nature will have to deal with if they ever want to integrate profiles across Connotea and Nature.com.

Scopus and 2Collab use heuristic filtering, as well, but it seems to be based more on keywords and URL strings.

Heuristic filters are notorious for creating a cat-and-mouse game with spammers. They’re built on rules that, under attack, reveal themselves. Once a spammer sees the rules, they can be tricked. This is what’s happening at Nature, according to Mulvany:

. . . we are not coping well at the moment with non-Latin character spam. Russian spam is quite a problem at the moment for us.

These experiences lead me to ask a bigger question: Can publishers sustain these networks with their own technologies?

Heuristic filters are prone to prolonged wars of attrition, and the spammers have the advantage.

Commercial linking sites don’t seem to have these problems. Why? They use heuristic filters, too.

Well, they supplement their heuristic filters with community filtering.

Digg, WordPress, and other major providers of linking and authoring services have filters that learn from millions of users, day in and day out. A spam attack in Digg or WordPress lasts very briefly, and barely even registers as an annoyance. A few thousand users identify it as spam in a matter of minutes, and the system shuts it down.

But commercial offerings have a different problem — when something is wrongly identified as spam, either through heuristics or community action, it can disappear forever. We’ve had problems like that with the Askimet spam filter on WordPress. Commenters have to email us when a comment goes unposted for hours, and we have to dig it out of the spam filter. It’s rarely a problem, and from a user’s perspective, these little glitches rarely intrude on the utility or usability of a site, so they are categorically less problematic. In fact, they’re usually invisible, unlike the more prominent spam the heuristic filters are missing on Connotea and 2Collab.

In the STM community, two noble attempts to advance collaborative reference sharing may be choking on spam. Without the scale to create a community filter to complement their heuristic filters, their future options may be defined by spam more than by their original aspirations.

Reblog this post [with Zemanta]
Kent Anderson

Kent Anderson

Kent Anderson is the CEO of RedLink and RedLink Network, a past-President of SSP, and the founder of the Scholarly Kitchen. He has worked as Publisher at AAAS/Science, CEO/Publisher of JBJS, Inc., a publishing executive at the Massachusetts Medical Society, Publishing Director of the New England Journal of Medicine, and Director of Medical Journals at the American Academy of Pediatrics. Opinions on social media or blogs are his own.

View All Posts by Kent Anderson

Discussion

24 Thoughts on "Are Publisher Linking Networks Like 2Collab and Connotea Choking to Death on Spam?"

Ian underestimates the damage spam has done to Connotea. If I show it to a senior colleague, when they see the spam, they’re gone, as in their view, it has descended into “social” and is no longer a valuable research tool. This is why the anti-spam approach that CiteULike has taken is so much smarter.

I wonder if the problems for online reference management efforts like 2Collab go beyond just spam. Spam is obviously a huge problem. But they’re also faced with questions of limited usefulness due to the lack of a robust customer base and questions of whether this type of discovery is of value to their customers.

Organization of references has always been an annoying and tedious problem for researchers, with EndNote being the standard (and generally inadequate) way of doing it. The new software available, particularly from leaders like Mekentosj (Papers) and Mendeley are vastly superior and tremendously useful.

Papers is wildly popular, but has no online social component. Mendeley has a reported 100,000 members in their online network, but apparently 2-5X that many download and use their desktop program without ever connecting to the online system. From those using the online systems, most I speak with find them of limited value for discovery. And for the unconnected, most I speak with have too much to read already and aren’t interested in adding to their piles.

Mendeley is the only one of these companies that puts out any numbers at all as far as usage goes, and they announced their year-end most-popular articles here. I’m not a statistician, so it’s hard for me to tell what it says that the most common article among 100,000 users was only common to 74 of those users, but that number seems low. I’m willing to bet I could go to a focused science meeting of 300 people and find at least that many there with one common basic paper on their reference lists. So perhaps the issue is that the network needs to be more robust, needs many more members to be more representative and a better discovery tool (assuming that’s something people want). It’s also interesting that 9 of the 10 papers are all focused on one general area of research, computational approaches to biology (systems biology, informatics and the like). Which again shows that different fields are more comfortable/interested than others in using online tools like this, but for those outside of that field, sparse representation may limit their usefulness.

Great, thanks. Is there any information available on the number of active members on CiteULike (accounts used in the last 30 days or something like that)?

Dave

The total number of registered user accounts is 279,307, of those 58,545 are identified as spammers. The number of users who post stuff every month is in the 10s of thousands.
The Website receives around 900,000 unique visitors per month, of which 70% or so bounce (only visit one page) so the real number of browsers is 300,000 or so.

I’m not sure that community filtering is the difference between scholarly sites and places like del.icio.us (where as far as I can tell there’s very little community policing). Rather it’s that Connotea and 2Collab are niche sites with small audiences so it doesn’t take much for spam to get noticed.

IMHO the way forward is to separate the scholarly material from the more general web page bookmarks, like Bibsonomy and CiteULike do.

Your heuristic can then go something like “any papers you bookmark will be publicly visible but if you bookmark a website it won’t be seen unless at least x other users have also bookmarked it, where x is based on how spammy our existing system thinks it is”.

I’ve used Connotea for a long time but not as often these days (I use del.icio.us much more and keep up with Connotea posts with RSS.) Spam must be part of the reason the site slows to a crawl on frequent occassions.

CiteULike is tempting but I prefer the del.icio.us model Connotea is built on. I’ll keep my fingers crossed they work things out.

@David,

At CiteULike, we make all our data (suitably anonymized) available for download and your own analysis. It’s a daily dump so you can get history too.

Excellent, great to know. Like many Highwire journals, we have direct links on our papers to add them to a CiteULike account. But from what I’ve heard from Highwire, these buttons are rarely used. Is this the general case across the board?

I’m assuming this is because so few read papers as the html version and instead download the pdf.

My guess is that most of our users are in the habit of using our own bookmarklet rather than an on-page button. Or they may use other techniques. However, if you contact us “off-line” (support09_AT_citeulike.org) I can extract the actual number of posts for your site. We would match by URL but perhaps there are a few stragglers with DOI only so let us know if there’s a DOI prefix/pattern/regex that’s specific to you. We also have an API for you to get real-time stats on individual DOIs or PMIDs

Is it really all down to spam? I think the spam problem at 2collab and connotea is an excuse for deeper issues.

I would argue (and have argued) that there are some fatal flaws in the concept here.

First, tagging papers is an incredibly tedious and time consuming process, and one which results in a very poor organizational system for the user. Full text search of the pdfs in one’s collection is a vastly superior and faster way to find what you want. Tagging seems necessary for the social parts of these reference managers, for helping other people find papers you think are interesting. And that’s problematic in that tagging things in your own collection does you no good, it’s for the benefit of others. You can reap those same benefits from others without doing the tagging work yourself. Given the crowded schedule of most scientists, it’s hard to justify spending hours and hours, if not days or weeks adding tags to references.

Other issues involve the size of the networks, none has yet to reach a threshold of being a really useful discovery tool and there’s still the question of whether this is a useful function that people want. As I said above, most scientists I talk to are looking for ways to filter out papers, not ways to add more papers to their piles.

Dave

In the last 28 days citeulike users copied 9,859 articles directly from each other’s libraries. That’s the end result of all the social discovery.

I agree that the tagging is a pain for some, but it’s not required. We have search too.

I think it’s horses-for-courses here. I’m an old-fashioned folders person and don’t like GMail for that reason. But we have users with many thousands of individual tags. We get lots of requests for more tagging features such as hierarchical tags. It seems some people really do like ’em.

Search and tags are not mutually exclusive.

We do also have full text search of PDFs but see very few hits on that for some reason (I don’t think that functionality of the site is structured very well but that doesn’t explain the small number of PDF searches, IMHO.)

Yeah, the deeper issue is that the promise of academic social bookmarking hasn’t been fulfilled by web only first-gen system and nowadays CiteULike, Connotea and 2Collab look a bit crap when compared to “best of both worlds” reference management systems like Zotero, Papers and Mendeley.

Yes you can share data between systems but really, why would *most* users bother unless it’s to do a one-off migration?

Making the jump into more full fat reference management is a significant investment and working out how to do it in a way that cares for existing users and offers something truly useful and compelling takes time, I imagine.

Erm Euan, I and our users completely disagree that Citeulike looks a bit crap compared to other tools. We are growing stronger that ever on every metric. For example, our posts via bookmarklet (our main internal metric) just had a record week. Citeulike is the only reference tool that has actually deployed a working collaborative filtering recommendation system. Our pageloads are fast. We continue to roll out new features. We deal with spam effectively. We actually believe that a 100% web based solution is a strong contender in this space and appeals to a significant cross section of scientists and academics.

Yes, sorry, ‘crap’ was too provocative an adjective. I didn’t mean to imply that the web based services don’t do what they set out to do well. I should have said ‘limited’.

> appeals to a significant cross section of scientists and academics

Without taking anything away from what CuL has accomplished so far – it’s a great system and innovative in lots of cool ways – it and Connotea have been going for almost five years. You’d have thought that if a truly significant number of academics were going to use social bookmarking regularly uptake would be better by now.

What do you think the ratio of active EndNote users to CiteULike (& Connotea, not singling out any one site) users is…? Do you think that ratio has changed in a noticeable way over the past four years?

What’s your current growth rate like compared to, say, Mendeley’s? Or Zotero’s?

IMHO the web based plays don’t give average users everthing they expect from a reference management solution (techy users who are OK with a certain amount of faffing are fine).

Not difficult to find evidence to back this up – am guessing that your users explicitly ask for desktop integration quite frequently… Connotea users certainly did a couple of years back.

Don’t want to sound negative, I think this space is tremendously exciting and that social bookmarking will genuinely improve the way researchers work but IMHO web based services need to quit resting on their laurels and raise their game.

… which is why spam filters may be just one of many changes that 2Collab and Connotea have to make to continue to be useful.

” I didn’t mean to imply that the web based services don’t do what they set out to do well.”

I did. Except for citeulike of course 😉

I hope NPG’s strategy wasn’t to replace Endnote with connotea! We encourage our users to use whatever desktop app they choose. And I don’t think that social bookmarking and reference management are the same thing.

As for comparing citeulike to other new solutions when you look at our citegiest posting data and compare it to others, citeulike holds up very well no? (for links See Mr. Crotty’s comment above)

Out of all the many “social” initiatives for scientists we’ve seen in recent years, citeulike has a strong claim to being a leader e.g.:
http://friendfeed.com/danielmietchen/c6254b29/quick-comparison-of-alexa-web-traffic-to-social

I’m excited that our recommender system (http://blog.citeulike.org/?p=11) has an 18% acceptance rate (ratio of accepted to rejected articles), and I’m not aware of any other live collaborative filter system for research papers. It works so well because of the data citeulike has collected.

Now the real challenge that you don’t mention yet we all face is how is it all going to be payed for? I must thank our sponsor Springer at this point.

Notwithstanding my earlier comments about not trying to replace reference managers, they are, it seems to me, institutionally mandated and payed for and really “enterprise software” models. If those decisions and budgets were put in the hands of the end users you would almost certainly see a different picture emerge today (I’m not advocating this because I don’t believe it will happen).

Where citeulike is really strong vs. it’s current competition is that it has managed to do all this with a ruthlessly low cost base. I think that’s the right strategy in this space right now.

5 years on, As Neil kindly says here: http://friendfeed.com/mfenner/ed934b88/are-publisher-linking-networks-like-2collab (we don’t pay people to say nice things) “CiteULike is alive, kicking and going from strength to strength.” I don’t call that resting on our laurels.

I find myself in the extremely uncomfortable position of wanting to quote Margaret Thatcher (“Turn if you want to” etc. ) but I’ll settle for Hunter S. Thompson instead: “Res ipsa loquitor”

Fair enough. There’s certainly something to be said for sticking to your guns! Good luck over the next five years. 🙂

Just catching up here, responding to a variety of comments above. Thanks to all from CiteULike who have posted thoughts and data. I’ve requested access to data downloads, and will have to see what I can make of it. I’m still trying to wrap my head around what some of the numbers posted here mean.

Although CiteULike doesn’t offer a summary of 2009, it’s interesting to me that the top paper in Mendeley is also the top paper in CiteULike’s last 28 day popularity listing (Uri Alon’s “How to choose a good scientific problem”). Given the differences in time scales, it’s hard to judge, though there are 3 papers on Mendeley’s top 10 that turn up in CiteULike’s all time top list, and 2 of them are on the last 28 days list. I wonder how much overlap there is between users of both sites. How many of Mendeley’s 100,000 users are active CiteULike users, listing the same papers on each service?

It’s also worth noting how many of CiteULike’s top papers are about doing science (how to pick a research problem, how to write a paper) rather than being actual data papers showing experimental results. Some of this is likely due to more general papers like this being popular across a wide spread of different types of scientists, but I think it’s also indicative that services like this are heavily used by the same Web 2.0 proponent type researchers I discussed here, people who are particularly interested in the way science is done, and who particularly like talking about it online.

It’s also telling how many of the top papers are in the same sorts of fields as seen in Mendeley, computational biology, systems biology, informatics and the like. Again, it shows that different fields have different cultures and are more or less comfortable with sharing/gathering information in this manner.

Kevin Emamy above lists the number of visits per month at around 300,000, and has around 10,000 articles copied back and forth between different users per month. The 10K number is hard to parse without further details, whether there are 10,000 users each copying one article from someone else, or 10 power users grabbing 1,000 articles each. But if this is a measure of how the site is being used for discovery, then at best, if each article was copied on a separate visit, then it accounts for what, 3% of the site’s use? How much of the site’s traffic is users uploading to their own account or looking up a paper they’ve already listed for themselves? This would give more of an indication whether the site is really being used for social bookmarking rather than for reference management.

I think Euan does raise some good questions as well, particularly on the growth rate of these various services. We’ve been told that it’s early days for these types of technologies and that scientists are a conservative lot. But five years does seem like a pretty long time, particularly considering something like Myspace launching in 2003 and experiencing it’s rise and fall by 2008 when Facebook passed it for membership numbers. Is this all part of a slow, steady climb, is this something that people are not aware of yet (even after all this time) or is this just a niche activity that simply has limited appeal?

David

Thanks for your comments (and Euan), I hope I wasn’t overenthusiastic in my attempt at a “spirited” defense.

To address some of your points David, there is no doubt that the majority usage and benefit of the site is for personal online reference ‘Gathering’, not social or otherwise discovery. The bookmarklet makes it very quick and simple to save references online and that is why people use it. I

In fact I would go so far as to say if all it was a bookmarklet it would be used by many people.

I think the top papers, particularly in the all time list undoubtedly show a bias amongst our users for the subject. I don’t find this surprising.

“How to choose a good scientific problem” has obvious broad appeal, and if you read the comments it sounds like a great read.

What I also suspect is happening is a certain popularity breeding popularity effect that having a list like this engenders. Who knows.

However as you go down the list (the monthly and weekly lists are much more fluid) you see the science papers emerging. I agree, the big subjects seem to be the ones you mention.

Just to be clearer about the user numbers, that 300k is mainly non-registered users who come to the site from the web, search, links on publisher websites etc. (and look at more than one page). The number users who post stuff is much lower, as I said, in the 10s of thousands.

The 10k articles copied number is a count of how many times the copy?article URL is loaded (this URL is loaded when someone clicks the “copy” link on an article page) and is, I think, a good indication of a user copying an article into their library(ies). I have a reasonable arguments to think this is an undercount;

a) It doesn’t count the non-registered citeulike users who browse, discover stuff and go elsewhere,

b) The normal behaviour for our posting users may well be to follow the link to the publishers website, view/download the article and then post from there. That’s what I would do.

The last time I gave out that number was here: http://network.nature.com/people/mfenner/blog/2009/01/30/interview-with-kevin-emamy and I note that it was 6189, roughly a year ago. That’s good growth for the social stuff and is indicative of the real usage growth of the site as a whole.

(I just had a look and It’s doesn’t look like just a few users copying lots of articles, by the way).

At this point I’ll trot out some anecdotal evidence of people sharing stuff elsehwere:

http://twitter.com/twarko/statuses/7369109862

http://friendfeed.com/kochlab/86405cb9/some-good-articles-in-this-person-citeulike

http://friendfeed.com/michaelnielsen/b13a98fb/citeulike-group-on-statistical-machine

http://www.slideshare.net/dullhunk/defrosting-the-digital-library-a-survey-of-bibliographic-tools-for-the-next-generation-web/53

As well as the data, by the way, one advantage of our public by default model is that you can browse the site yourself and get a feel for what is going on.

I repeat that my earlier assertion that the social bookmarking, everything public by default, 100% web based model has many advantages and has a firmly established place.

Where I take some issue with the publisher bookmarking systems is I think they are poorly executed and/or left unloved over time. You can look at the PLOS article level metrics data; a reasonable side by side comparision of the usage of citeulike vs. connotea and see that citeulike (at least for a PLOS audience) is about 5 times more popular. Thats a pretty big difference between two systems that are fundamentally doing the same thing.

As you have said elsewhere, it takes a certain scale for the social stuff to work, but I think that the success of the recommendation algorithms indicates strongly that citeulike has now got to a size where the dataset is genuinely useful in this regard.

Is it a niche activity? Science is a pretty niche activity. Isn’t the number of people who read and write peer-reveiwed literature around 6m worldwide?

It’s noteworthy, to me, reading some of your other blog posts, you are clearly well informed about and interested in this space, yet up to now you had fairly little exposure to citeulike. That’s my failing, but it does show that we have plenty of room to grow.

Kevin, don’t worry, your enthusiasm is appreciated. It’s why a project like CiteULike continues to grow and move forward rather than stagnating or disappearing altogether. You’re right that I haven’t spent a lot of time with CiteULike, other than some initial explorations a few years ago. I have just received access to your data sets and will dig further as time permits (sadly having a “real” job takes precedence). A couple of quick notes:

Good to know that your usage reflects what I hear from scientists, that reference management is still a higher priority for most than discovery. There’s no reason a site/software can’t serve both purposes well.

Great point about the snowball effect of a popularity breeding popularity. I touched on similar issues here.

I’d be willing to bet that any numbers from PLoS are skewed to an audience that specifically seeks out open acces/open source projects like CiteULike and preferentially uses them over commercial/closed projects.

—Isn’t the number of people who read and write peer-reveiwed literature around 6m worldwide?—

The NSF has numbers for the US here, their latest has something like 5.5 million working scientists in the US, with an additional 16.6 million working in “related fields”.

Comments are closed.