Academia.eduIf a free website claimed that you could double citations to your papers simply by uploading them to their file sharing network, would you believe it?

This claim, and the paper supporting it, is displayed prominently on the website, a platform for scholars to share research papers. It was also sent out to all 21 million registered members.

The paper, “Open Access Meets Discoverability: Citations to Articles Posted to,” was authored by six employees of and two members of Polynumeral, a data consultancy company.

In recent years, other large data-driven companies, like Facebook and Google, have conducted and reported on their own research, and it would be unfair to discount this paper simply because of its self-aggrandizing results. Scholars require serious scientific studies to back bold claims. To millions of potential users of, a scientific paper is much more effective than a glitzy vendor stall at a national conference or a glossy brochure, and the investors behind this company undoubtedly know that a return on investment requires more than false promises. It requires hard data.

There are many papers claiming a citation advantage to open access (OA) articles and most are not worth reading. This paper is an exception. The authors clearly understand the limitations of observational research and how correlations are often confused with causation. They analyze their data using three similar models to verify their results. They use covariates in their regression model and look for other explanations. They take a fair and unbiased look at the literature and don’t purposefully obscure or ignore research that contradicts their conclusions. Strangely, these are characteristics absent in most OA-advantage papers.

Compared to a control group of papers, selected at random from the same journals and same years as the group, their analysis finds a positive association between free access and article citations that grows over time. This association should not be surprising, given a decade and a half of similarly reported results. What IS surprising about their findings was that having one’s paper freely available from other freely accessible locations only boosted a paper’s citations by just 3%. Or expressed as a comparison:

We find that a typical article posted to has 75% more citations than an article that is available elsewhere online through a venue: a personal homepage, departmental homepage, journal site, or any other online hosting venue.

What seems to be omitted in the above statement is that other online venues include much more than personal and departmental home pages or journal sites. They include massive open literature repositories like PubMed Central, the arXiv, bioRxiv, SSRN, not to mention their chief competitors, ResearchGate and Mendeley. That a relatively young upstart is besting them all is unexpected, except from a marketing perspective.

In an online survey of social network use by Nature in 2014, only 29% of scientists and engineers responded that they were even aware of and just 5% visited the site regularly, compared to 88% and 29% for ResearchGate, respectively. For social scientists, arts and humanities respondents, the results were somewhat closer. For respondents who use these two services, their main reason was simply to maintain a profile, followed by posting content.

The authors of the paper claim that it isn’t open access that is driving the results, but discoverability. users are actively notified when new papers are posted in their field or by authors they follow. But many other indexes, archives, social media tools, and journal websites have comparable notification services as well, so I find their explanation unsatisfactory, especially when all of these other sources of content, taken together, have just a tiny effect against the mighty power of to boost citations.

This paper suffers from a data problem. And that problem is their control group.

The researchers compared the performances of papers uploaded to to a control group–a random sample of papers selected from the same journals. If the randomization was successful, the control group should be similar in all respects to the group. In this way, differences observed in the data over time are likely attributable to and not some other cause.

If you look at their data (download the file: papers.csv.gz), you’ll notice something odd in the title of the first article: it is an errata and it belongs to the control group. Indeed, if you search the title list, you’ll find editorials, corrections, retraction notices, letters to the editor (and their responses), commentaries, book reviews, conference program abstracts, news, and even obituaries in the control group. As a general rule, these kinds of papers receive few (if any) citations. So, it’s no surprise that papers uploaded to outperformed a sample that included a good proportion of non-research material.

There is other evidence in the paper that the treatment and control groups are not similar. The researchers reported that 45% of papers uploaded to were also found on other free access sites compared to 25% for the control group (Table 5). We don’t know whether this difference is the result compositional differences between the treatment and control group, or whether important papers are more likely to be posted freely or published as open access. is not a publisher but a metrics and analytics company, whose value proposition is to generate valid statistics around the impact of scientific research. So, it’s difficult for me to comprehend how this paper got so far without anyone even spot-checking the article list or attempting to first check that the treatment and control group were similar before proceeding with the analysis. The huge citation boost to the group may have had nothing to do about open access or discovery, but explainable entirely by bad data.

Addendum (17 August 2015): The authors of the paper have addressed my critique of their control group by classifying each of the papers in their dataset and by limiting their reanalysis to articles reporting original research. While the main result of their reanalysis still holds, the effect size dropped from 83% to 73%. Papers posted to are still more likely to be found on other free access sites, however.  A response to my post with a description of their classification and reanalysis can be found here. The authors’ research paper was replaced with the revised copy and the homepage for now claims “Boost Your Citations By 73%”

Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist.


36 Thoughts on "Citation Boost or Bad Data? Research Under Scrutiny"

A minor remark: I think that it appears to be difficult to conclude that “there are many papers claiming a citation advantage to open access (OA) articles and most are not worth reading” as Phil says. I would have been interested to find out more about these publications, not only from the point of scientific transparency but also to provide students in my seminar with some conrete data, but no reference is given in this post. Moreover I doubt in general that there is no effect of OA to any ressource which is made openly available to the public in comparison with content which is kept closed for a certain percentage of users. Just to think about that situation: If you are a researcher who wants to refer to certain aspects in current research – How much effort would you spend to get access to a paper which is not accessible for you at your local institution or library? Or would you simply cite those papers which are immediately accessible (both as licensed and OA journal content), given that you should have read them first before citing? Having talked to many researchers in my academic and publishing career, I could clearly reply to that question with the latter answer. And as a result, OA articles must be read and therefore cited more often than closed-accessed content. This is simple statistics from a logical point of view only. To define the ‘how much’ quantitatively, however, is a much more difficult issue as Phil knows, but this doesn’t destroy that conclusion.
If I am wrong I would love to learn more about the reasons why and also to get some references to the statement in the post above, Thanks Phil!

SPARC Europe maintains a list of studies on the Open Access Citation Advantage, with comments on how reliable each one is. The paper discussed here isn’t in there at the moment. See This is the best source of studies on this topic that I know of.

There is a public Mendeley group called “Open Access Compared with Subscription Articles” with a collection of relevant references.

Thanks, but can’t find that group on Mendeley, could you provide us with a short link, pls?

Kira – Thanks! Just realized that most articles in that Mendeley list are toll-access. Since I cannot use closed-access content in a seminar with my students nowadays, I will set up my own list, with open access articles only. Nevertheless, a good inspiration!

AG: What has led you to the conclusion that OA articles must be read and therefore cited more often than closed content articles! As a researcher I would spend the effort it takes to cite the best not the cheapest. Thus, Journals with high IF and known authorship are read and cited more often than others be they closed or open. As they say in computer science: garbage in garbage out!

Interesting thought because you are starting now a different discussion based on other attributes than mentioned in my approach. We both know that we are starting to move on thin ice when introducing the IF of a journal to evaluate the relevance of an article. A high IF and high relevance is a correlation, statistically spoken, but no direct consequence. There are statistics (e.g. from Nature) which demonstrates that there are few articles per year which were cited several hundred times, but a majority of papers which have been cited (much) less as the IF has let you assumed…
So let us come back to the initial thought which is a straightforward consequence of the fact that openly available ressources are more often used than those which are closed.

Alexander, I do not think we know your last sentence to be a fact. My understanding is that the majority of articles are never read at all, hence not used. Moreover, most researchers have access to many subscription journals. For that matter one can cite an article based simply on its abstract, or a prior citation, which does not require access. So this is really an empirical question, not something to be deduced from first principles.

My research indicates that most citations occur in the early part of an article, when the background for the research is being explained. I am inclined to side with Harvey, in that authors are likely to try to do a good job here, rather than simply citing the most accessible articles. After all, the author is demonstrating their knowledge of the field. My anecdotal understanding is that peer reviewers take this seriously.

When analyzing the literature on this subject, it’s important to look at the quality of the studies, not the quantity. Simple number counting is not an effective approach here, as has been discussed repeatedly, here:
and here:

It’s important to differentiate between observational studies, which can show correlation, and experimental studies, which can show causation. In all cases, careful attention must be paid to performing accurate controls, as this seems to be rare, particularly for selection bias, in this particular subject area.

David is being polite – unless the study has randomly assigned papers to either treatment (OA vs non-OA), it’s pretty much worthless. The dozens of studies showing a citation advantage to papers where authors have chosen to make their paper OA all suffer from the same bias, in that authors may choose to make the investment in OA for their better work. The fact that Phil’s RCT finds no effect of OA suggests that this bias accounts for almost all of the OA citation effect.

Sounds good, Phil. Including a lot of items that are not likely to be cited or posted, in the control group, may well explain the big number finding. Sleuth!

However, Phil, you do need to determine that the postings on do not contain a similar mix of seldom cited items. I do not see where you say that. Interestingly, this is somewhat like the “citable item” issue in the impact factor calculation.

my take on this is that the upload of articles to academia, researchgate, etc., is in some cases a problem because when articles are taken and then uploaded by authors to said websites such takes away from the download counts of the journal itself: on the one hand open-access publishing is defined so that such can be distributed freely, but on the other hand as said the download counts of the original publication is compromised; definitions of open access (in this case meaning no author pay or subscription fees) ought to include situations where while access is open, the article itself should not be allowed to be posted on other websites

This is particularly important if we are going to rely on things like altmetrics for any serious use in researcher assessment. If private, for-profit sites like this aren’t going to share their data freely, then any traffic generated from such a site might harm a researcher’s score.

Um.. isn’t that imposing restrictions on use, and therefore entirely at odds with “open access”?!

thanks! as said, the text is available to anyone with internet connection from the journal: what open access means here is that scholarship&knowledge is in open access i.e., its reading: i do not understand that current definitions and practices of open access do not include a situation where open access is defined as no-cost access to the journal where the text is published, but not “profiting” from its content when it is taken from the journal and distributed elsewhere resulting in the losing of download counts re articles published in said open-access journal

As this discussion shows, the definition of “open access” continues to be quite variable. There are those who insist the term can only be used for very specific conditions (freely available, no restrictions on reuse) but in real world usage, the phrase is used to describe a very wide variety of things.

Just to come back to the initial question from a barely scientific point of view: what’s wrong for example with this Nature study which has been published recently and which says that “Open Access articles received significantly more usage than those published under a subscription mode”? This was exactly my point. The analysis is based on data provided by NPG on the numbers of articles published each year as they were assigned to four subject areas, and of citations to those articles as recorded in Web of Knowledge:
I am keen to follow your feedback in that discourse, thanks!

Most studies do indeed show increased usage for articles that are made freely available. This does not, however directly translate into an increase in citations.

I agree but I assume that if I had 10 papers which may reflect some aspects to which I want to refer in my paper, I will cite those which I have read myself. (I know however that this assumption may be not always valid today…). And I will have read only read the papers which were openly available for me. By definition, an OA paper is accessible without any restriction for me. That was simply my conclusion following a logical approach.
Another different aspect is that citations are also generated for poor papers or wrong results. Therefore I doubt in general that a citation is a useful metrics nowadays because a high citation number does not necessarily reflect a high scientific relevance. We should keep in mind, too.

Citation lists are not directly quantitative based on access. Most authors are going to cite the papers most relevant to their studies, and many journals limit the lengths of citation lists. So even if you find 100 more papers that are in some way related to your work, it’s unlikely that your citation list in your next paper is going to grow by 100 citations. Many studies have shown that most researchers have access to most (if not all) of the journals relevant to their field. The notion is that OA does not provide a whole lot more access to active researchers within a given field, but expands access to a wider range of readers outside of the formal research community. This is a huge benefit of OA, getting papers into the hands of people like clinicians, policy makers, etc. But this is not reflected in citation because these people aren’t publishing papers or doing research. Any increase within the research community, which already has very good access to most journals, has so far not been enough to be measurable.

Obviously, I appear to be am exception: I have no immediate access to about 50% of publications in my current field of research and at my current institution. However, if I recall accurately, I experienced a similiar situation when working as physicist a decade ago, too. We are not discussing the issue of making research publicly available for a broader audience or non-academic professionals but for peers.

Indeed, the concept that Harvey and I began above might be stated thus: Most citation involves a high degree of relevance that makes access largely irrelevant. My point is that a lot of citations occur in the context of an historical scientific narrative, such that certain papers simply must be cited.

Thanks, Phil, this is an important thing to clarify. We understand that people are going to have questions about any study we do that seems to validate our own product. That’s why we published our data and want to be as rigorous as possible. While a quick review suggests that nearly all of the sample consists of original research papers, we’re currently running an analysis to categorize all 44,689 papers and see if that has any effect on the models. Stay tuned for updates.

Thanks Richard. As I wrote in my post, I’m not concerned with who did the study, but in the quality of the data, the methods, and the validity of the results. I’m convinced that the analysis is rigorous; I’m less convinced that the data are unbiased. Your dataset only includes the title name and ISSN, but no DOI, so I’m interested in how you are going to lookup and categorize all 44,689 papers?

Comments are closed.