PubMed Central (PMC) costs US taxpayers about $4.45 million per year to run, according to documents recently obtained by an ongoing Freedom of Information Act (FOIA) request.
Surprisingly, most of the money is spent converting author manuscripts into online publications.
Over the past decade, speculation has been the best anyone could attempt, owing to a consistent lack of responses to budget information requests made to PMC staff and leadership. These new FOIA-obtained communications represent the first time we’ve seen actual figures about PMC’s expenditures. Judging from emails and spreadsheets recently obtained, PMC may have been preparing to reveal its expenditure level, but might also have been looking to low-ball the figure by 10-12%.
Not surprisingly, the bulk of the PMC budget is devoted to outside contractors — this has long been believed to be the case. Of the $4.45 million budget, it appears PMC spends between $3.5 million and $4 million on outside contractors — these figures are a little hard to nail down.
As stated earlier, most of the money spent by PMC ($2.7 million of the entire $4.45 million budget) is spent converting author manuscripts into XML and providing QA for these. Put another way, the deposit of author manuscripts as a source of open access (OA) content costs US taxpayers an additional $2.7 million per year.
It is clear from the enormous effort and expense PMC puts into conversion and editing that author-deposited manuscripts are not adequate on their own.
These author manuscripts (53,818 deposited in 2012, based on parameter searching on the PMC site) accounted for less than 20% of the materials posted to PMC that year (272,409 articles found via search), yet consumed 60% of the expenses. And with a recent push for more compliance, this amount seems poised to double.
In an email dated February 16, 2012, between Ed Sequeira and Kent Smith, these expenses were being pored over, potentially as part of preparations to finally announce PMC’s expenses. In a document labeled “FY 2012 ANNUAL COSTS FOR PUBLIC ACCESS,” the expenses were broken out to some extent. Instead of trying to reproduce the table, I’ve scanned the sheet in, and you can view it here. Essentially, it shows that personnel costs (some Federal employees but mostly contractors) consume $1.2 million of the budget. Manuscript tagging and QA consumes $2.25 mllion, with about $500,000 additional expense coming from overheads. In total, manuscript tagging and QA is given a bottom line figure of just over $2.7 million of the $4.45 million budget.
Smith is Kent A. Smith, a former Deputy Director of the NLM, who in 2003 started KAS Enterprises, LLC, and then departed NLM and NIH after 35 years to run KAS Enterprises in 2004. “KAS” apparently comes from his initials. According to Sequira, Smith works as a part-time consultant to the National Center for Biotechnology Information (NCBI).
In the email from February 16, 2012, Smith attached the document I’ve scanned to the following message:
Ed–Take a look at this and then give me a ring. The manuscript tagging number is what we need to talk about as there is a difference between using the 40-60 formula and the actual contract costs. Under the % basis I am using here $47 per article. John and I looked at this yesterday and based the number on a sampling of a few months billings. It consists on the average of about $34-35 per tagged article plus $10-11 for Q/A plus administrative fees of $2-3, where applicable.
Then there is the issue you brought up before about some articles that are in limbo (author hasn’t responded etc.) and then there are those before 2009 and I don’t know to what extent we have included those costs appropriately.
The “John” above is John Mullican, a program analyst at NCBI.
Sequeira responded a couple of hours later, apparently after giving the calculator a little exercise (NIHMS is the NIH Manuscript Submission system; NIHPA is the NIH Public Access policy):
Kent
If you take the actual PTFS/DCL contract cost (2,711,852) and divide it by your per article rate (47), you get an estimated 57,700 manuscripts handled by the contractors.
Number I got earlier from the NIHMS system: Annual average for 2010/2011 of actual articles [fully processed (i.e., ready for PMC) + partially processed (i.e., waiting for author action)] = 56,900.
The two numbers are roughly equal, which is good. At least we’re consistent. Split the difference and you get 57,300.
So actually processed (57,300) is 9,300 (20%) more than what we estimate will go into PMC based on publication date (48,000).
The 20% extra can be attributed to backfilling for previous years and to articles that are in limbo.
Some possibilities for handling the 20% overage:
1. Include it in the baseline, in which case the total NIHPA cost is $4.45M.
2. List it as a footnote, similar to what you’ve done, with the explanation from above of what causes the overage. In this case we would report to the world a total NIHPA cost of $4M based on the number of NIH articles collected by publication year, but still be able to reconcile these figures with our contract expenses.
3. Since your per article cost of $47 is towards the lower end of your estimate, we could also use an article cost of $50, in combination with option #2. Then we would have a total reported cost of $4.15M and a smaller overage figure.
I have a meeting to go to in a few minutes. I’ll call you when I’m done, if you’re around.
Ed
I confirmed via email that PTFS/DCL only deal with author manuscripts. As Sequeira wrote in his email reply to some questions I asked:
NCBI’s contracts with PTFS and DCL are for XML tagging and QA of the author manuscripts deposited in the NIHMS (NIH manuscript submission) system under the NIH public access policy. They’re not involved in QA of the XML deposited in PMC by journals with participation agreements.
It wasn’t supposed to be this way, as indicated by a budget spreadsheet from 2009. In that spreadsheet, the cost of article tagging and QA in 2009 was pegged to be between $1.5 million and $2.6 million, in a low, middle, and high set of budget scenarios (it seems to have tended toward the high scenario). Planing for the years of 2010-2013, these costs were supposed to fall from $2.3 million in 2010 to $997,500 in 2013. However, as shown above, these cost control plans did not come to fruition.
In fact, PMC may be about to find its expenses exploding, if a recent Nature News article is correct. The NIH’s stricter enforcement of author deposit rules has apparently increased the number of author manuscripts on deposit from what Richard Van Noorden estimates to be 5,100 per month (these emails show that it’s more like 4,800 per month) to about 10,000 per month. At $47 per article for tagging and QA, that doubles the largest part of PMC’s budget, and will cause it to balloon from $2.7 million to $5.6 million. PTFS and DCL will be thrilled, but PMC’s budget will then be nearly all devoted to managing these manuscripts.
This makes it clear that just posting an author’s manuscript in an open repository isn’t sufficient. Turning it into a useful resource costs money. In PMC’s case, it’s $47-50 per manuscript. We’ll have to see if the similar approach in the UK creates a similar expense problem. Will anyone tell us?
The rationale for publishing peer-reviewed author manuscripts has always been a little elusive. Now we know that doing so is also expensive.
Discussion
30 Thoughts on "The Price of Posting — PubMed Central Spends Most of Its Budget Handling Author Manuscripts"
Can we have some other numbers for comparison?
For example, it cost ArXiv $6 per article in 2010 (http://scienceblogs.com/principles/2012/01/30/the-arxiv-is-not-a-journal/ which cites http://arxiv.org/show_monthly_submissions and http://arxiv.org/help/support/2010_budget)
How much does it cost a typical commercial publisher to publish a paper (before and/or after peer review)?
And finally, how much do university libraries in the US spend on subscriptions to the journals covered by PMC?
While interesting, these numbers have no direct relation to the issue of building a cost effective Federal OA system, which is now in progress. The PMC numbers are very big. No other funding agency has this kind of money to put into OA. Budgets are being cut.
I attempted to get some comparison numbers, but comparisons are hard to come by because of competitive reluctance to release general numbers and also because some contracts are bid on a per-page basis, not a per-paper basis.
As to “how much does it cost a typical commercial publisher to publish a paper” that has been a complex question plumbed and debated for years, and is highly variable. The range seems to be $500-$35,000, with XML and QA being only a drop in the bucket of any part of this range. (Hindawi and BMC are commercial publishers, for example, with profit margins that are higher than or as high as Elsevier’s, so remember that “commercial publishers” now encompasses OA publishers.)
Subscription costs are a red herring. Author manuscripts deposited in PMC are redundant with published papers, both actually and temporally. This is why the justification for including them is so tenuous. Combine that with the relatively large expenditure in an age of government austerity, and some serious questions need to be asked.
The figure given here is for conversion of documents to XML. From what I can tell, this is a small part of a larger typesetting process for most publishers so it may be difficult to separate out and get a price value for it. I’d be willing to bet though, that someone like Elsevier, who does this on a much larger scale, is paying less per paper than PMC.
But there’s not really any relevant correlation with the other numbers asked for here. PMC is equally dependent on the journals doing the other things they do in order to get the papers to a point where PMC can convert them. So that can’t be subtracted out.
And the papers that are NIH funded make up only a subset of the articles published in the subscription journals, so again, not a meaningful comparison.
One number that might be interesting though–the Outsell 2012 STM report says that there were approximately 1.9 million STM papers published in 2011 (one assumes this number has increased in 2012 and 2013). But for 2011, if the entirety of the STM literature was deposited in PMC by authors, then the XML conversion costs alone would be $89,300,000.
Some more recent ArXiv numbers may be useful here: 2012 saw about 84,603 submissions (http://arxiv.org/help/stats/2012_by_area/index) and the 2012 budget was for expenses of $772,499 (https://confluence.cornell.edu/download/attachments/127116484/arXiv2012budget.pdf) so that’s $9.13 per submission. It would be interesting to track this over a few years to see how things scale.
These big numbers make the CHORUS approach to federal OA look very attractive compared to the PMC model. PMC is mostly replicating what the publishers already do.
The more relevant way to look at costs of PMC is the cost per manuscript stored. Just focusing on the research funded by the NIH, the cost of preparing and archiving each manuscript in PMC is minuscule compared to the cost of conducting the research the manuscript describes. John Willinsky provides a rough estimate of NIH research costs for each manuscript published based on the research @ $60,000. I have heard estimates in the range of $50 per article for preparing and hosting a manuscript on PMC. If those number of accurate archiving in PMC costs less the one tenth of a percent of the cost of conducting the research presented in the manuscript archived.
My concern with CHORUS is an inherent conflict of interest. Subscription publishers sell content. CHORUS will be providing a somewhat less refined (for lack of a better word) version of the same content for free. That does not exactly provide publishers with an strong incentive to do a good job. I am not saying they will sabotage the system but what incentive do publishers have to do a good job of making sure CHORUS provides efficient easy access to the archived manuscripts when CHORUS competes directly against the product they are selling?
The $4.45 million per year to run PMC in the context of the ~ $30 billion per year overall NIH budget seems like one hell of a great deal to me.
With any expenditure like this, there’s always the question of whether there are other ways to do it better/cheaper. Regardless, more transparency from government agencies is always welcome.
I do agree with you that the CHORUS proposal needs refinement, particularly in areas like compliance and assurance of quality. That’s an important reason why whatever the final policy will be needs to be done through negotiation by the different stakeholders involved, to make sure that each gets what it needs. I don’t think anyone sees CHORUS as cut and dried and final, more as a first step.
And for what it’s worth, many publishers (including OUP) deposit the final, published version of papers into PMC, rather than the author’s accepted manuscript. This saves PMC costs and offers the reader the full version of record. I assume at least some publishers would continue this practice in CHORUS.
Sorry David, but I fail to see the relevance of your arguments to the issue of designing the most cost-effective Federal OA system. NIH is gold plated. The NLM has a budget of $350 million a year while the other funding agencies have as little as almost no budget for dissemination. CHORUS provides tremendous efficiency.
Nor do I see any conflict of interest. The publishers are providing access in order to keep users. That is their incentive. I am sure they would rather just sell subscriptions but the government is threatening to bypass their content so this is a fair compromise.
David, I believe in your post you were referring to the PMC model and I was responding to that. My point was the cost archiving a manuscript in PMC compared to the cost of conducting the biomedical research behind it is what an old professor of mine used to call decimal dust. The cost of research in comparison to archiving might be different in other fields but I doubt by orders of magnitude. Even if the cost of archiving were to come out of research funding, given the ratio, it would be relatively trivial compared in my view at least to the benefits.
PMC is an established model that should in most respects scale to other fields. Obviously there would be some differences due to content but those types of issues, along with a host of others would need to be addressed in CHORUS as well.
True, the incentive for publishers supporting CHORUS is as you state. The trouble is the incentive would largely go away as soon as CHORUS was institutionalized. Once in place there would be little incentive to make it work well. I don’t mean it would stop working, just no incentive for example to provide lots of bandwidth.
As David C. pointed out, there are lot of publishers who have gone the extra mile and made their published copies of articles available through PMC. I expect there would also be lots of publishers who would go the extra mile to make CHORUS work well for their manuscripts. I am far less confident all publishers would do so when there would be very little incentive.
You say “PMC is an established model that should in most respects scale to other fields.” Yet, it’s worth underscoring that increasing compliance as they have threatens to double the expense for PMC. Scaling the model to other agencies is likely to be expensive. And unnecessarily expensive.
For a movement founded on making sure taxpayers get what they pay for, there is a great deal of comfort with having taxpayers pay again here. A main point is that these millions of dollars in expense are mostly redundant, if not entirely redundant. That makes them wasteful.
Imposing this model across the US agency system will only increase the burden on taxpayers.
The continuing motivation for publishers is the ability to publish research by federally-funded authors. As noted above, any implementation would need a system for ensuring compliance and performance levels. Any journal that failed to meet these standards would in essence be blackballed from publishing funded works, as authors would be jeopardizing their funding by submitting there.
I agree with you that publishers have a trust problem here, but it’s not insurmountable. If a level of service and/or bandwidth is important, then that should be a requirement stated by the funding agency that CHORUS has to meet. And then the system has to be built to ensure those standards be met. Trust, but verify (https://en.wikipedia.org/wiki/Trust,_but_verify) seems a reasonable approach.
David C., I know of nothing in the Federal OA program that blackballs journals for CHORUS noncompliance. If a journal does not participate the agency will have to collect the author’s AM as a last resort, that is if they want 100% coverage. CHORUS estimates that it will cover 90% of all funding related articles, which is a lot more than PMC is presently getting. How CHORUS will enforce its Service Level Agreements with its member publishers remains to be established and the agencies may or may not have their own requirements, but I do not see this as a big compliance issue. It is just articles, not drug testing.
The author may run into issues if they publish in a non-compliant journal with terms of service that do not let the author deposit a version of the manuscript in a repository. This gets kind of complicated for authors and it might be nice at some point to make it easier for them, perhaps agencies offering a list of known compliant journals or some sort of “good housekeeping” seal of approval.
But as you note, compliance isn’t a deal breaker and seems like the kind of thing that could be set up and agreed upon.
David C., when you say “trust but verify,” what is it we are trying to verify? A CHORUS publisher’s primary responsibility is to collect and submit the FundRef data. If one fails to do that it will be obvious because they will report no federally funded articles. Given that most non-medical American research is federally funded, all major journals will have some, probably quite a lot. So if someone reports none then we know they are probably wrong. Beyond that there are some administrative issues like making sure the DOI goes to the article and making sure the article is open after the embargo period expires, things publishers already do routinely. Moreover, much of this can be automated.
Note too that CHORUS is vastly superior when it comes to author compliance because that is part of the submission process, which the author cannot avoid, as they can in the PMC model. Determining author compliance in the PMC model requires tracking down every paper co-authored by every grantee and then determining in each case if that paper is related to federal funding, which requires expert judgement. No one can afford to do this.
I think there’s a big trust issue that publishers need to overcome to be fully welcome at the table here. So compliance is something that would likely need to be verified.
For example, here’s a complaint about Nature having promised to make a set of papers OA, but not having done so:
http://phylogenomics.blogspot.com/2011/03/please-help-keep-pressure-on-nature.html
Often issues like this are due more to technological snafus than some secret evil plan to lock up papers, but it’s understandable that funding agencies would want some level of assurance on compliance with any such agreement. There’s also the question of what happens if a journal later withdraws from CHORUS (and stops making previously free papers free), or goes out of business, and how perpetual access to the papers can be assured.
David, first, you seem to be thinking of CHORUS as a portal of some kind, but the agencies will have their own portals. CHORUS is basically just a tagging and linking embargo system based in FundRef. See my http://scholarlykitchen.sspnet.org/2013/06/17/chorus-confusions/.
Second, the fact that research is expensive is no justification for agencies to waste a lot of money duplicating the published literature. These OA systems are distinct federal programs and they are subject to the same efficiency standards as all programs. From a program design point of view CHORUS renders the PMC model obsolete. There is no reason for the government to pay for what it can have for nothing, which is the published article. When PMC was established most journals did not have appropriate embargo periods but now they will, so the PMC model is OBE.
David, I understand how CHORUS is suppose to work but it assumes a lot, like every publisher thinks this is a great idea. At least with grantees you know who they are and that they have signed a contract agreeing to comply with federal assurances as a condition of funding.
There are a few large publishers but a long tail of hundreds if not thousands of other publishers who may or may not have an interest or ability to participate in CHORUS. How are you going to get publishers to comply? Are you going to have “CHORUS certified” publishers and force grantees to submit to only certified publishers? That’s not going to cost the government something to implement and enforce? That’s not intrusive and prohibitively expensive to small publishers who might publisher a few articles a year based on federally funded research?
PMC works and the per manuscript cost is low as compared with the cost of doing the research. Compliance has increased from 75% to above 90% and appears to be climbing just by the threat of increased enforcement.
http://blogs.nature.com/news/2013/07/nih-sees-surge-in-open-access-manuscripts.html
As you noted, there is a lot about CHORUS that has not been worked out yet we are to believe it will all work smoothly and not cost the federal government anything? That every publishers thinks this is a great idea?
CHORUS is offering a 90% solution at no cost. If an agency want 100% they will have to implement a repository search and/or PMC type collection process for the outliers. It is an engineering principle that perfection is expensive. But again, this is no argument against my point that CHORUS is by far the most cost-effective approach for the government.
You noted that “PMC is mostly replicating what the publishers already do.” That is not the case. PMC is providing open access to the literature, while most of the publishers are limiting that access to subscribers only. PMC is providing public access to Federally funded research.
Joe, my reference was to the redundancy of the cost drivers, where both PMC and the publishers are creating duplicative XML from the same author manuscripts. The access distinction you make was true but is no longer, because with CHORUS the publishers now agree to provide open access in accordance with federal requirements, provided the latter are reasonable.
As a production director at a small-to-middling university press that publishes no journals, I’m a bit reluctant to jump into this fray. But I must say that I am astonished at how much PMC is paying for XML tagging. Most vendors looking for the small amount of business my press can offer (say, maybe 10,000 pages a year at most) charge considerably less than $0.50 per page for XML tagging. Assuming a journal article is about 30 pages long, it should cost no more than $15 for XML tagging. Add another few bucks for quality assurance, and you might cross the $20 threshold. Does PMC have to pay a federally mandated minimum rate, like bridge construction projects? Where can I submit a bid?
Do you know any details about the contractors used by PMC? If PMC’s expenses are higher than some publishers would expect, is this work done in the US contributing to jobs or is it likely for other reasons?
Data Conversion Laboratory (DCL) performs XML conversion for NIH/NLM: http://www.dclab.com/pubmed_xml.asp .
A few other points relevant to this discussion: (1) we’re talking about author-submitted papers, which are generally not formatted like a journal article — the author’s original Word doc can easily be over 30 pages. (2) Conversion costs are much higher for figures than for text. (3) Off-the-shelf (i.e. cheaper) XML conversion has too high of an error rate for NLM standards, plus PMC has its own DTD.
Reblogged this on lab ant and commented:
PMC has made its budget public. this article fits nicely into a discussion I started on ResearchGate a while ago. The coasts of publishing papers OA.
Simple solution: Authors deposit in their own institutional repository. IRs can then automatically export to subject-based or other central harvesters or repositories (or the harvesters can automatically harvest or import from IRs). Archiving is managed locally, by institutions, for their own output (and for multiple purposes). Its tiny costs are distributed across institutions instead of focused on PMC, which can then just be one of the (many) harvesters: http://t.co/gSHzrpBm0t
No need to turn to the Trojan Horse of CHORUS to fix this! http://t.co/GhvyHOHpOz
This sounds like the SHARE proposal:
http://scholarlykitchen.sspnet.org/2013/06/26/universities-propose-to-share-federal-funding-based-articles/.
It can hardly be cheaper to have a thousand repositories processing articles than to have one doing it. Configuration control alone will be very expensive.