2500 Creative Commons Licenses
2500 Creative Commons Licenses (Photo credit: qthomasbower)

Recent mandates from funding agencies, including the Wellcome Trust and the RCUK, require funded journal articles to be published using a CC-BY license. Last week, OASPA and PLoS issued articles explaining the need for such licensing terms. But both articles are based on a flawed premise, confusing the rights to reuse the data behind an article with the rights to reuse the article itself.

First, to be clear on the licensing terms being discussed:

CC-BY-NC: You are free to copy, distribute and transmit the work, and to adapt the work. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). You may not use this work for commercial purposes. Any of the above conditions can be waived if you get permission from the copyright holder.

CC-BY: You are free to copy, distribute and transmit the work, to adapt the work and to make commercial use of the work. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

In this discussion, it is vital to understand that we are talking about the licensing terms for the article — for the set of written words and images describing the research in question, not the research itself. And this distinction is where the proponents of the CC-BY license seem to be confused.

Both Claire Redhead, writing for the OASPA, and Cameron Neylon, writing for PLoS, use the same example to explain why CC-BY is necessary: the Human Genome Project.

This is somewhat confounding, as neither of the articles announcing the initial draft of the sequence (published in Science and Nature) were published using a CC-BY license. Why, then, do the authors claim that the article licensing terms matter? Neylon writes:

The human genome project generated US $141 for every dollar spent, but this immense return is widely distributed as a result of thousands of people’s work. It is rare for research teams to have both the academic and business expertise required for commercial exploitation, and they typically require a commercial partner. Open-access publishing effectively increases our chances of finding those potential partners. If we restrict their capacity to use our research by restricting commercial use, we limit the chances of partners, commercial or otherwise, finding and contacting us.

Redhead offers similar reasoning:

The human genome project is a compelling demonstration of the power of open access to research, and reflects a well-established practice within the genome community to make research data publicly available for all reuses via resources such as GenBank.

Both have made the same mistake — confusing the genome sequence, the data behind the studies, with the articles written about those studies. The $141 generated per $1 spent ratio did not emerge through the reuse of the Science and Nature articles; the effect was generated through reuse of the studies’ data.

The licensing status of the data, and the results of a study, are not governed by the copyright terms of the journal article written about the study. Both authors are essentially right — barriers to reuse of research results do block progress, a great example being the patenting of the use of the BRCA genes in detecting breast cancer.

But that had nothing to do with the copyright status of articles published on BRCA. It had everything to do with the University of Utah locking up the information behind a patent paywall.

This seems to be a key misunderstanding in the demand for the CC-BY license, and perhaps something of a hypocritical approach by many research institutions. There’s a drive toward open access for the research articles written by authors on campus, but at the same time those universities are blocking others from reusing the research itself. It’s as if they’re saying it is vitally important that you can freely read about our breakthrough in curing cancer, but if you actually want to use that breakthrough to cure cancer, you have to pay us.

Harvard University, a leader in open access mandates for faculty, made more than $13.8 million in 2011 through patent paywalls. The University of California system made over $100 million. This reeks of NIMBY thinking — advocating for progress that requires others to sacrifice, but refusing to accept any sacrifice on one’s own part.

If the goal is to promote the free reuse of research data, then the targets for change need to be the research institutions and the researchers themselves, not the journal publishers. No journal publisher claims ownership of the facts contained in a copyrighted article. That’s not how copyright works. In fact, many journals that employ standard copyright terms require authors to deposit their data in public databases, making it freely available for reuse. The CC-BY license for articles is irrelevant for this goal.

Things get even more confusing when Neylon points out how important it is for a researcher to let someone else commercially exploit their work. Isn’t this the exact opposite of the reasoning behind the intense anger toward commercial publishers like Elsevier? Aren’t they commercially exploiting research in the manner described? Is commercial exploitation of research results good or bad, or just bad when a company you don’t like does something you don’t like with it? If the latter, then is a license that allows anyone to do anything they want with your results in your best interests?

So why the push for CC-BY licenses?

We are reaching a point where technology is turning the journal article itself into a research tool. New semantic technologies and text-mining efforts represent new avenues for discovery. For a text-miner, the journal article itself becomes a data point the way a gene sequence works for a bioinformatician. Reducing restrictions on reuse of articles in this manner is thus an important goal.

It is unclear, though, what relevance the copyright status of an article has toward individual text-mining experiments. If you have free access to an article, why would copyright stop you from using it (and presumably citing it) as a data point in your study? Again, it is unclear why the CC-BY license would improve things here.

Publishing under a CC-BY license does not guarantee that the publisher will make the article available via bulk download, does not guarantee the development of useful text-mining APIs, nor that the article itself will have freely available, useful, organized metadata.

The CC-BY license does open up the paper for further commercial exploitation. The Wellcome Trust presents some better examples of where the CC-BY license would actually be beneficial, allowing reuse of figures in a commercially-driven blog or letting someone charge for a translation of an article.

From the viewpoint of a funding agency, particularly a government funding agency, there’s a strong argument to be made that maximizing the economic gains from funding a research project is a worthy goal. Governments want to drive the creation of new businesses, which will result in more tax revenue and employment. The CC-BY license offers free raw material for new ventures based on reusing published papers. For example, start-up company X could download every paper published under a CC-BY license and repackage those papers based around a suite of semantic tools and re-sell them back to the research community.

That potential for economic development is increasingly important in economic downtimes. The CC-BY license though, is something of a blunt instrument, and the approach brings up some logical conundrums and some likely unexpected consequences.

First, the logic behind much of the movement here is the idea that with the article processing charge (APC), the article has been paid for. It seems counterintuitive then to set up a system where the “already paid-for” article will then be sold back (perhaps repeatedly) to the research community. In some ways it creates a new subscription economy. The article is free, but if you want to be able to get the most out of it, you need to be able to afford a subscription to the reseller offering the tools.

There’s also the likelihood that new start-ups won’t be the result here, but instead further entrenchment of the current publishing establishment. A well-funded company like Elsevier may be better situated to invest in developing new technologies (and buying up any promising start-ups). Under CC-BY, Elsevier could potentially scoop up every paper published by everyone and put them all under its SciVerse umbrella and become the de facto monopoly source of access for scholarly literature. Every other publisher then becomes merely a feeder of content to Elsevier.

History has shown that the Internet tends toward consolidation on one monopoly source for information (see Google or Facebook as examples). Monopolies, whether startups or entrenched powers, are generally bad for progress, bad for customers, and bad for content creators. If the relatively modest efforts of PubMed Central are drawing away 14% of traffic from other publishers, imagine what could be done with the business acumen, funding base, and marketing budget of an enormous multinational corporation.

If there are specific goals the community is seeking for reuse of research papers, then a license that is clearly geared toward their achievement may be more effective than the broad strokes of the CC-BY license. Can a standard license be developed that allows certain types of reuse (particularly reuse for education and further research) but that requires fees for specific types of commercial reuse?

For example, as noted previously, the CC-BY license essentially does away with reprint revenue. If we are truly trying to move journals away from the subscription business model, particularly to other business models that lower the financial burden on the research community, it seems counterproductive to deny financial support to journals in order to provide pharmaceutical companies with free marketing opportunities.

We need to be clear about the role of CC licenses — what they do and do not provide. They are useful tools in driving business opportunities and economic development. But changing the copyright terms on an article written about an experiment does not, in any way, change the terms for use and reuse of the data behind that article.

If open availability and reuse of research results is the goal, then funders need to re-examine their position on patents, not their position on journal article copyrights. For US government funding, this is going to take an act of Congress as the Bayh-Dole Act will need to be repealed. Publishers should not be unfairly painted as scapegoats when the real targets should be institutional technology transfer offices.

These sorts of changes should be approached cautiously. The Bayh-Dole Act is generally seen as having been very successful. The Economist notes that it is:

perhaps the most inspired piece of legislation to be enacted in America over the past half-century. . . . Together with amendments in 1984 and augmentation in 1986, this unlocked all the inventions and discoveries that had been made in laboratories throughout the United States with the help of taxpayers’ money. More than anything, this single policy measure helped to reverse America’s precipitous slide into industrial irrelevance.

There are clear economic incentives to driving the exploitation of research. But removing all restrictions on that reuse removes the direct benefits the research community receives in return for discovery. Patents and financial reward provide strong incentives for researcher achievement. Technology transfer provides an enormous boost for research funding (more than $1.8 Billion in 2011) for research funding.

The benefits of unrestricted exploitation must be weighed against the subsequent losses in university funding and financial reward to successful researchers. Regardless, the copyright status of journal articles has no bearing on this issue. There are potential benefits to changing the copyright terms for journal articles, but let’s be clear about what they are, and let’s find licenses that are best suited to achieving those goals.

(Editor’s Note: Due to power outages from Hurricane Sandy, David Crotty can’t reply to comments. The other Chefs are going to do their best in the meantime.)

Enhanced by Zemanta
David Crotty

David Crotty

David Crotty is a Senior Consultant at Clarke & Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services. Previously, David was the Editorial Director, Journals Policy for Oxford University Press. He oversaw journal policy across OUP’s journals program, drove technological innovation, and served as an information officer. David acquired and managed a suite of research society-owned journals with OUP, and before that was the Executive Editor for Cold Spring Harbor Laboratory Press, where he created and edited new science books and journals, along with serving as a journal Editor-in-Chief. He has served on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc., as well as The AAP-PSP Executive Council. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.

Discussion

35 Thoughts on "CC-Huh? Fundamental Confusions About the Role of Copyright and the Reuse of Data"

I don’t have the time to comment on all the confusion presented here, but a few points:

Conflating patents and copyright is muddying the waters. If copyright were like patents, it would be valid for a much more limited amount of time. And patents have to be applied and paid for. And the information in patents is fully open (though may be intellectually inaccessible, just like publications). If indeed scientific publications were fully open, CC-BY, by default with the possibility for the author to apply (and pay the fee) for the addition of NC for a limited amount of time, and such addition would be granted, like patents are, on the basis of arguments judged to be sound, then we’d perhaps have a workable situation of sorts. Though I suspect very few, if any, scientific authors would apply for that.

There should be no objection to publishers exploiting the fruits of research. With CC-BY they can indeed freely do that. An individual publisher enriching or enhancing publications from anywhere (the SciVerse example) is not a threat, but an opportunity to combat fragmentation. That already happens with abstracts — PubMed, Scopus, WoS — and could happen with full articles, illustrations, etc. Even charging for enrichment is perfectly fine, since it is a true added value (the clue is in the word en*rich*ment). The objection is not to the exploitation of research results by publishers per se, but to the *exclusive* exploitation of the results of research that’s carried out with public money and meant to be public. It is a complete non-sequitur that such a purveyer of enrichments then becomes the de facto monopoly source of access for scholarly literature. CC-BY would still effectively prevent such exclusive appropriation that hampers further research and its application. Perhaps such a publisher could be the only source for enrichments added with their own private means, but you’d have to be an old-style communist to begrudge them that.

(Private research by private companies, where it has no public safety implications — so not some pharmaceutical research, for instance — can be published in any way the private authors like. For public research, )

But Jan, that’s what David C is noting. The arguments being used by CC-BY supporters conflate data use with research article output use. I’m sure we’d all benefit greatly from the application of your intellect to the argument that David C has so cogently laid down here. I hope you can find the time, and I look forward to your comments.

Contrary to your claim, publishers have never insisted on exclusivity with regard to the publication of results stemming from publicly funded research. In fact, the AAP has long been on record supporting the America Competes Act, which would extend to all government agencies the requirement to make freely available the final reports of all government-funded research projects. It has been the government’s choice, not publishers’, to rely instead on a mandate affecting publication of journal articles derived from those projects. Your characterization of the situation hides the real truth here.

Jan, I think you may have missed the point of this post, as much of what it says is in agreement with your comment. As you note, it is important to understand the difference between copyrights and patents, and the articles cited in my post have failed to do so. But no, I don’t think most would be satisfied if scientific journals prevented anyone from using the information they convey for 20 years as do patents.

It’s interesting that you cite PubMed and WoS as positive forces due to their near monopoly power. Most people are less happy with them, see Kent’s recent posts on PubMed’s abuse of it’s own system and I’m not sure there are too many happy with the private, locked down and completely controlled nature of the Impact Factor. Aren’t both of these groups guilty of the sort of exclusive exploitation you decry?

I wonder where the responsibility for defending misuse of material distributed under CC-BY lies. If the Author chooses CC-BY then it’s up to them to bear the cost of hiring a lawyer should they come across a use of their work that infringes the terms of the license. Now I am not a lawyer and the overwhelming majority of authors will not be lawyers so the only way they’ll be able to figure out whether somebody has used the material inappropriately, will be to hire a lawyer to find out. They’ve then got to weigh up the costs of bringing a complaint assuming the lawyer the talk to advises them that they have a case to bring.

But here in the UK, CC-BY is mandated… So presumably those doing the mandating are taking on the responsibility for vigilance and compliance enforcement? Actually, if you deposit in a repository it gets more complicated doesn’t it? Is the responsibility with the repository owner or the funding agency? If the article is deposited in a repository and an OA journal, who is responsible then? Does the publisher of an OA journal, publishing mandated CC-BY (soon that’ll be many/most UK articles then) have any responsibility to enforce and protect the the rights assigned to the article. I don’t think they do.

Over on Google+, Heather Morrison, Cameron Neylon and others are having a good discussion about related issues to do with the benefits or otherwise of CC-BY. Interested TSK readers, this link is worth your time: https://plus.google.com/u/0/107227186581068282767/posts/YXbYEWBR4mL

By the way, that endorsement bit of the CC-BY license. Whole big can of worms there methinks.

I think these points are very well put. The loss of reprint revenue is a serious one, further undermining the ability of smaller publishers like university presses to continue supporting journals, unless they increase APCs sufficiently to cover the loss. Academic authors, using CC-BY, also give up any control over the quality of reuses of their work, as with translations. The terms of the license prevent poor translations only when there is intentional malice involved, and hardly any authors will find it worthwhile even then to bring suit to stop such translations from being distributed.

I would say that your title is well-suited for this discussion – you clearly do have fundamental confusions about the role of copyright and data re-use in the context of scholarly publications. Moreover, you are either intentionally or unintentionally propogating more confusion here, as ably pointed out by Jan in your muddying of the issues of patents and copyright.

As someone who has had my text-mining research limited by traditional copyright restrictions, I will take apart a different misconception you are repeatedly stating here: the mistaken notion that text and data are different things that can easily be licensed separately. In some rare cases (e.g. large DNA sequences, microarray data), they can. But for the majority of data types, scientific facts exist only in an unstructured format in publications, and do not stand alone in a distinct form separate from the primary text itself.

You go so far as to admit that text could be used as “a data point”, but you fail to understand that it is the textual excerpts from papers that are data points, not the entire paper itself. Anyone who is reasonably familiar with the scientific literature knows that a single paper could contain dozens or hundreds of facts in the form of textual claims, none of which are contained in any database. In most cases, the entire paper cannot be abstracted as a single datapoint, and even when it can, it decouples the provence of a fact from its source.

And here is the rub: if I wish to undertake text-mining to extract textual claims (e.g. “PC4 binds p53 and recruits it to its own promoter”) for reuse or redistribution, I directly confront copyright limitiations *since I am using the text directly as data*.

Without CC-BY or other permissive re-use license, I need to negotiate with every publisher whose text I mine, or risk being sued for copyright infringement. And please don’t come back with a “fair-use” argument either, since the scientific publishing industry has made no clear and legally-binding statement about what fair use in terms of text-mining constitutes (e.g. a quoted/unquoted snippet of 10 words, 50 words, 100 words…?). I’m not going to put myself or my University at risk of a lawsuit by making the wrong call about what “fair use” might be.

What I would recommend is actually obtaining a default NDA from Elsevier or any closed access publisher about the permissions for use and redistribution of facts extracted by text-mining. You will find that these NDAs specifically forbid redistribution of text, unless negotiated otherwise. This negotiation process will take between 6 months to 2 years per publisher, in my personal experience, and often will ultimately fail because the publisher will not allow redistribution of extracted text. I can assure your readers that I have dozens of emails to back the claim up that copyrighted text prevents re-use of scientific literature by text-mining.

When you understand that *the primary, copyrighted text is the data”, you will understand why you are confused about this issue.

You’re so right, Casey. Publishers shouldn’t be spooked by data- and text-mining, but embrace it. The development of ‘nanopublications’ (http://nanopub.org) offers plenty of opportunities, both for researchers and for publishers. Though only for imaginative and forward-looking publishers, of course. The EU OpenPHACTS project (under the umbrella of the Innovative Medicine Initiative) is an early implementation of the thinking behind semantic ‘assertion-mining’ and nanopublications: http://www.openphacts.org/. Watch this space!

Most publishers allow this without a problem, you just have to ask. Sending an email describing your research interest and agreeing to treat the information responsibly and to follow up with results shouldn’t be a barrier. Free-wheeling access without any permission or consultation isn’t such a great thing. Publishers work with authors and editors, and want the record to be preserved and not distorted. A misguided text-mining initiative can distort the record, and not knowing about it isn’t smart.

Also, you want to make sure it’s not for commercial purposes. If it is, that’s a different conversation.

Just because it’s based on an algorithm or is “cool” research or “new” doesn’t mean it’s benign or worth the effort.

I’ve approved initiatives like the ones your describing many times in my career. Most have yielded nothing, by the way. But I’m happy to help a researcher with a clear path they wish to pursue.

From someone who has gone through the proper chanels to request re-distribution permissions with a good number of publishers, I can assure you that permissions are a barrier to progress. It takes 6 months to 2 years for *non-commercial* re-distribution permissions to be granted. Often these fail, since re-distribution is ultimately not granted. It is not a simple matter of just “sending an email”. Most publishers I have dealt with do not grant permissions without customizing a NDA, which needs to be approved both by their lawyers and our University legal team. I have no intention of performing misguided, non-attributive text mining. On the contrary I want to use and attribute text properly, but I want to bypass having to deal with bureacrats and lawyers who know little if anything about the scientific material and technical issues concerning text mining, since they ultimately are obstructionist and slow to understand/adopt this important new technology.

Please explain what constitutes “non-commercial redistribution” by your lights.

I didn’t mean to imply that all publishers are spooked. There are exceptions, but they are rare. Indeed, the Publishing and Chemspider arms of the Royal Society of Chemistry are very forward-looking. But unfortunately they are virtually unique, so far (though Thomson-Reuters have become an associated partner of OpenPHACTS, but they are not a full-text publisher). The RSC should be an example to other publishers and deserve applaus for their efforts.

(Disclosure: I’m involved in OpenPHACTS)

Redistribution for what purposes? If commercial purposes, of course another commercial operation isn’t going to allow itself to be leveraged by another commercial enterprise without getting a cut of the action.

Let’s see, research on patients requires a lot of work. Research in the lab requires a lot of work. And you expect research on text to not require a lot of work? And then blame the publishers because they’re careful with exactly what you want to research? I don’t follow how this ends up being a problem with publishers. It sounds more like someone wanting all the benefits of careful publishers without acknowledging that these benefits accrue precisely because they are careful — and their care is what makes textual research both worthwhile and marginally more difficult.

Right Casey, You got some good points to make in there, but please stop with the insinuations Ok? There’s a great debate to have here but if it descends into a slanging match, none of us will be better off for it. David C isn’t misrepresenting anything, intentionally or not. You can of course disagree with his premise, and that’s where we can all learn something.

To address some of your points:

1) Access to the data you want. Your request for easier access is not uncommon amongst those of you who are pushing at the envelope of what a scholarly article can be used for. Heather Piowar has made similar comments. I think you have a real point here. Your problem, is that you are on the bleeding edge of this sort of work, and thus up against something of an understanding gap with the people you talk to. And those people have to manage risk. As any organisation of whatever sort does when they have responsibilities for particular collections of things. One typically has to ask to use the artifacts in a museum, or a library and generally there’s some checking of bonafides that will occur before permission is given. that’s not unreasonable. To echo Kent’s comments, I too have been asked for permission for material under our copyright to be used, and we say yes just about every time. And we’ve given access to some pretty major collections of stuff that way. I’m sure you can counter with plenty of times when you’ve had trouble, and I sympathise. But let’s look at how to work together to solve these problems. Make a compelling case. One that gives people the opportunity to say yes, not to default to No.

2) The primary copyrighted text is the data… Let’s have a real debate on that issue. Here’s my point of view. Articles are not designed and never were designed to share the underlying data that led to their writing. Statements and assertions, yes. Points for discussion, yes. But the data in figures, graphs, maps images and so on, can not be very easily unlocked. If at all. I’ve just been looking at network representations in some papers and to get the data, I need to go contact the authors and get their files. Articles are designed to be read by humans. So I think you are using the articles for data extraction, because in bulk, there’s no other way to try and do what you want to do. But what I think you are actually needing is a better way to get at the underlying facts, records, observations and the rest. You also want declarative statements about what the authors think about those facts. I think you want a file that contains pointers to the data that underlies the papers. Machine readable logical statements, generated at source, not imperfectly reconstructed by trying to teach machines to read papers like a human would. Better data, better sourced and validated by the publication process of the human readable article. This is similar to the position held by Conrad Wolfram, who argues for the Computable Document Format. I bring that up, because I attended a meeting at Wolfram Alpha on that very thing, and there were a ton of publishers there, trying to work out if it would help them do a better job of dissemination information.

I think you and your colleagues, are what Tim O’Reilly would call “Alpha Geeks”. Operating, as I say, at the cutting edge. You operate way ahead of the ‘normal’ operating parameters and so you are always going to face incomprehension in many cases. Skepticism as well. Also lawyers… But that doesn’t make us publishers the enemy. Please don’t treat us as such. you want investment in your concepts and ideas. So show us the value. Prototypes and concepts. Because if you want this to be sustainable, somebody somewhere will ultimately be asking those very questions.

If I indeed have fundamental confusions, please point them out. I am not the one deliberately muddying copyright and patents. I am merely pointing out that OASPA and PLoS are doing so.

Can you better explain why copyright prevents you from doing your textmining research? What, specifically, do you need to negotiate with every publisher for? Are you republishing and redistributing the entire text of large numbers of papers? It is unclear to everyone here exactly what you are doing that copyright prevents.

As David Smith has pointed out, “the primary, copyrighted text is the data” for only a very small minority of researchers. Can you make a case why CC-BY works better for your needs? Is there a reason you need CC-BY rather than CC-BY-NC?

The reason why we need CC-BY for text- and data-mining from scientific articles, and why CC-BY-NC is not adequate, is found in the extreme ambiguity of what NC actually means or can be construed to mean. The – justified – fear of text-miners is that for data mined from material under any licence more restrictive than CC-BY, there is always a chance that the results of their mining may turn out to be commercially useful and used by someone. In litigious societies and the generally ambiguous copyright domain, the risks of being sued is real. Even if you may eventually win a law suit, just being forced to defend yourself can easily ruin you financially. Clearly, as text-miner, it’s wise not to get into such a situation.

Text- and data-mining may be relevant for “only a very small minority of researchers”, but where’s the limit if you start dismissing scientists specialising in very small niches? Most articles are of interest to only a very small minority of researchers. Why should publishers care about those? Don’t forget that publishing is a service to science, and niche researchers form the vast majority of its clientele.

I agree that the NC part of CC-BY-NC is too undefined. But again, your worries here are very vague. What sort of issues are you imagining that would be stymied by copyright? Again, we’re talking about copyright, about the actual text and figures themselves, not data behind the papers and not date derived from an analysis of the papers. Unless you are planning to redistribute the papers themselves, it remains unclear what you fear here. Would it be possible to create a specific license that opens up data mining for all purposes, commercial and non-commercial, that still leaves in place the potentially useful revenue streams that help support journals without further burdening research funding?

Please don’t take this as combative, there are many here looking to learn more. As noted, these worries are only relevant for the small minority. It is unclear if this will prove a fringe activity, or if it’s the cutting edge of a widespread research movement. We need to better understand the needs of those doing this sort of work to better provide the tools and rights that are needed and to make decisions whether it is worth raising the subscription rates and APC’s on all other researchers to support that currently small minority, as will have to happen if other types of commercial revenue (from things like pharma reprints) are abandoned.

First of all, it is indeed a good idea as publisher to understand the needs of those doing work for which they need text-mining. Whether it is, or will be, a wide-spread research activity or not. The business of a serious science publisher is built on small scientific niches, on the work of “small minorities”, if you wish.

And text-mining is not about redistributing entire articles. It wouldn’t have to be called text-mining then. Here’s an example: http://nanopub.org/wordpress/?page_id=57 . Associations between scientifically significant concepts are mined from a large number of articles and patterns are analyzed to see if there is new knowledge that can be gleaned from that. We all do that analysis on a small scale in our head, with a limited number of articles that we have read, but the information overwhelm which we have in science makes it impossible to do that on the scale needed to discover, for instance, novel protein-protein interactions.

We don’t know, and will not know in advance if such work might have commercial applications down the road. So if we mine from CC-BY, CC-zero or public domain content, we’re safe. But the risk of including CC-BY-NC or even more restrictive material is just too great, however vague they may seem. In fact, it might be easier if one could know concretely what to fear.

So let’s reverse the argument. What does a publisher gain from CC-BY-NC? Or fear to lose from CC-BY? Remember, we’re talking Open Access content here, otherwise CC-BY-NC isn’t even an option. The nature of CC-BY-NC is open, but with what I call a ‘profit spite’ clause. Of course, in the very short term there may still be some pharma reprints, but that’s mainly a game for Toll-Access content. Let’s be clear: the potential for scientific progress with the help of text-mining (the example above is about the important field of gene-disease associations, but text-mining could and would be very much larger and widespread if only it were legally possible with more content) is traded off against a short-term minor income stream from the pharma industry (namely the reprints of CC-BY-NC articles). Fortunately, the science publishing industry does have members with more forward-looking and imaginative ideas, but clearly, and unfortunately, they are still very much in the minority.

I understand what you’re saying Jan, but it may be something of a hard sell to both publishers and the research community. To play the Devil’s Advocate, I’d restate your argument as, “There may be some as yet unknown ways for me to make money from the research I’m doing which is completely dependent on both the publishers and the general research community for raw material. Because of this, I want the publishers to give up a valuable revenue stream, and I want the general research community to pay higher subscription and APC costs, so I won’t have to worry about incurring any costs myself when I find a way to cash in.” It absolves the textmining community from the majority of the financial burden for their own work, and instead shifts it to publishers and the general research community.

In order to speed your work, publishers need to invest heavily in new technologies. They need to set up mechanisms for bulk downloads of enormous numbers of papers (and pay for the bandwidth used), to build new API’s, to continue to improve metadata, to continue to add new improvements both to new papers that are published and to retrofit it to the back content of every journal. That’s an expensive and continous process. I can see two ways this gets paid for, if it is to happen (and I think we both want it to happen to better explore and speed progress in textmining research):

1) It becomes a paid-for product from the publishers. Articles are published, either with regular copyright, or any sort of CC license, behind a toll wall or OA. These articles are available as html or pdf downloads, but they are not made available in ways that are convenient for textminers to gather data. In order to download large quantities of papers or access API’s or metadata, the textminers are going to have to pay a subscription fee to support their research system, just as a wet bench researcher has to pay for reagents and equipment.

2) These tools are built in as standard on every platform and are made freely available. They are paid for by the entire research community through subscription prices and APC’s. With an NC license, at least some of the costs can be covered by selling secondary rights in a wide variety of manners. These go beyond mere pharma reprints and currently serve as an important revenue stream for many society-owned, not-for-profit, university press and commercial journals.

The other big problem I foresee is that to do a really thorough textmining experiment, you want to cover as much of the literature as is possible. That means that just looking at the newest stuff that may be OA and under CC-BY is not going to give you the best possible experimental results. You want to go back and look over the course of the last century of research. And that means you’re always going to be dealing with papers that are under traditional copyright, and you will always have to deal with licensing issues should you find your mysterious commercial exploitation scheme.

David, you completely miss the point with your “mysterious commercial exploitation scheme”. It isn’t at all about finding any commercial exploitation scheme, but the NC clause puts the onus on the user to prevent any possible commercial usage in the future, which is impossible. And because that is impossible, the status of CC-BY-NC to a text-miner is not in essence any different to that of traditional copyright.

And yes, you’re right that not being able to text-mine the vast body of published literature under traditional copyright is a major problem for science. To argue that we shouldn’t at least change that situation going forward because we’ll always have the legacy of the past is extraordinarily unscientific. That would throttle all kinds of progress in any circumstance.

How exactly does “traditional” copyright prevent text-mining? Copyright does not protect data as such, only expression. The 1991 Feist case essentially undercut efforts to secure copyright in databases.

Jan, I’m not suggesting we wouldn’t want to change the situation going forward, just that a CC-BY license won’t solve everything, that even if it’s used from now on, the root problem persists. There may be better, more comprehensive solutions to consider such as cooperative efforts to build coalitions of journals willing to offer waivers for some types of reuse.

And I am not a lawyer, but can you explain how you would be held responsible if someone else made commercial use of your data? Wouldn’t that be their responsibility? And again, unless they are reproducing and redistributing articles themselves, how would copyright be relevant?

David, excellent piece, congratulations. I have recently posted this argument, that copyright is not the same as patent, in several OA blogs, but it hasn’t had much effect. I hope your article finally puts this myth to rest.

Surely someone will figure out how to misuse CC-BY for their personal profit and the detriment of the researchers. CC-BY-NC is the much more reasonable way to go in this transition to OA. 99.99% of people who hit a pay-wall want to read the article, not “re-use” it. Let’s help those people first, the 0.01% can wait a while more before their needs are served.

Thanks David, for this well-balanced assessment of the current vogue for shoe-horning a single use licence for all. One point worth mentioning is that even where the literature approximates some kind of open via APCs in hybrid or full Open Access journal models, publisher policy more often than not is at variance with CC-BY because of a non-commercial element in the latter’s licensing. I can’t see how funder mandated CC-BY will be achieved unless there’s a shift on this by publishers. That could be the blunt instrument you’ve alluded to.

Greetings all, sorry for the lack of response but we’ve had a bit of rain in NY. Will respond fully when access to electricity, Internet and phone becomes available. I want to send a special thanks to LIPA for knocking out our power while the storm was still more than 100 miles away and to AT&T for losing all phone and Internet service for 48 hours during the storm. Hey, it’s not like communication or information is useful during a disaster. Thank goodness for old media and battery powered radios. Perhaps we’re not fully in the digital age yet after all.

Cold Spring Harbor Laboratory has their generators running, so stealing as much electricity and internet as I can fit in my pockets. On to the comments!

The discussion above reveals an interesting and deep confusion due to the revolution. If text itself can be data then people should consider patenting it in addition to copyrighting it. The concept of text has changed.

Has anyone tried to patent a body of text as text mining data? I will have to look at a data patent to see how this is done. I presume the data is patented for specific uses as devices are. The patented text body might have to be large. In any case it is a fun idea.

Patents are for inventions, not for data. In order to be eligible, according to the U.S. Patent and Trademark Office, the invention must have utility; it must be novel; it must not have been obvious to a person having ordinary skill in the art at the time the invention was made, and it must be thoroughly explained and documented so that someone else would be able to make it. Besides, patents are completely open access and in the public domain. How would that relate to data?

David, I like your suggestion of “more comprehensive solutions such as cooperative efforts to build coalitions of journals willing to offer waivers for some types of [text- and data-mining] reuse”. Especially if those could be blanket waivers. That would address two major problems: 1) the fragmentary nature of the scientific literature; and 2) the need for (often lengthy) negotiations with individual publishers. Might we count on you — or even TSK — to take the initiative?

Jan, I’m currently still without power due to the hurricane, so not much immediate help I can offer. But contact me next week. I’d love to try to put together a proposal for session on this for this year’s SSP Meeting. Think about others with reuse rights concerns and we could get you all in front of a large audience of publishers to get the ball rolling…

Jan, I’m slowly limping my way back into the communication network. Can you drop me a line at david dot crotty at oup dot com? I have a few ideas about ways we can start the ball rolling via things like terms of service and or EULA’s that might do the trick.

Comments are closed.