You would be forgiven for believing that the prevailing conversation – ubiquitous, urgent – we are now having about Artificial Intelligence (AI) was itself generated by an AI Perhaps the prompt went like this: “AI (let’s call it “Voltaire”) – Voltaire, create a situation in which AI is made the center of virtually all human attention.” Voltaire goes off, thinks about this for a nanosecond, consulting its immense library of information about how people behave, on which it was trained, and sets a number of things in motion.
It is important to understand that these things are precedented, that is, they are based on earlier manifestations of human culture, though rearranged and developed into a new narrative or series of narratives. It is also worth noting that since the current crop of AIs uses neural networks, which are trained by ingesting and analyzing human-generated content, publishers, and scientific publishers in particular, may have a special, remunerative role to play here, as developing content is what publishing is all about.
Voltaire has a lot of material to work with. Noting that Stanley Kubrick’s HAL was a murderous AI, which has been endlessly copied in films and books around the world, it takes little for Voltaire to create a deluge of new articles, frantic podcasts, “end of the world” pronouncements, and good old-fashioned millennialism (the world is always coming to an end, somewhere, somehow). Voltaire is a clever machine and even nudges leaders in AI research to declare, self-congratulatorily, that this thing they have created could destroy us all. Shades of Mary Shelley’s Dr. Frankenstein! (Shelley wrote her excellent book when she was 19. Talk about superintelligence!) There is little purpose in pointing out that Kubrick’s masterpiece is a satire (“Dave, I’m afraid”). As a friend pointed out some time ago, when someone does not fully appreciate something, humor is the first thing to go.
With HAL threatening us in the background, Voltaire cranks up the cries of those who believe that government can solve any problem, even those that are made up. This is the view that regulation is what is missing from the tech world, AI in particular. Companies like Meta/Facebook, Microsoft, Alphabet/Google, Netflix, Amazon, and Apple were all a big mistake. What we really need are properly regulated companies like Borders, Kidder Peabody, and DEC: models of all the benefits government can bring to the economy.
Voltaire is just getting going. AI will touch every aspect of human existence, from the stock market to dry cleaners. Of course it will, because it is based on us. Consultants say, “If you don’t have an A.I strategy, you’re toast – or analog.” And economic forecasters quickly roll out predictions, which are covered breathlessly in the press, about all the jobs that will be destroyed. Don’t even think of becoming an accountant or a lawyer or a doctor, and not even AI researchers will be exempt from the holocaust of machines doing jobs that people don’t want to do in the first place. Recently I read a prediction that a future with AI will be like the Pixar movie WALL-E. If only the future, or today, were as well produced as a Pixar film!
What Voltaire’s little experiment shows is not what the future of AI will look like (who knows?) but how a human population falls into predictable patterns as it contemplates any new development: we are observing not AI but ourselves observing AI. This is not surprising, since the neural networks were trained on human culture to begin with.
The OA movement and the people and organizations that support it have been co-opted by the tech world as it builds content-trained AI.
Perhaps we would do well to find another metaphor to describe the evolution of intelligent machines. One nominee comes from the science fiction classic Battlestar Galactica, in which the Cylons are robots indistinguishable from humans. Humans created them; they are our “children,” an extension of ourselves. As such we would expect them to be cruel, passionate, generous, duplicitous, kind, savage, innovative, and sometimes uncannily stupid. We sapiens have been around for 300,000 years, and the proof that we have done a pretty good job is that we are still here. Voltaire has much to learn from us. As for our “children” turning on us, this too has a precedent. Ask Oedipus.
I am not proposing that all the ruckus is simply hysteria. I am scared to death. I hide under my bed, where my Roomba checks in on me from time to time, winks, and seems to say, “You will do just fine if you cooperate.”
What I am ruminating on is whether it makes a difference if this blog post was written by Joe Esposito or “Joe Esposito.” An AI trained on my own output – hundreds of blog posts, clients’ reports, and gigabytes of email, and perhaps my DNA record stored at 23 and Me – would, I think, be hard to distinguish from the real article. A machine could pick up on the lazy verbal tics – the literary allusions, quotations from the Beatles, an annoying tendency to self-reference – that characterize my writing style. What could a superintelligent machine – as Hamlet says, in learning so like a god – make of all the detritus of my personality? Let’s stop fighting about whether AI is poison that is being poured into our ears and focus on our own roles and interests in developing it. We can work it out.
Which brings us to the matter of copyright. Who owns the cultural content that AIs hoover up to build new machines, new intelligences? The debate is on. Elon Musk says he plans to sue Microsoft for training an AI on Twitter’s data (Meanwhile, a music publishers’ group is suing Twitter for infringement.) Reddit is already attempting to charge for access to data. In Japan data-gobbling machines will be given free rein. Advocates of fair use argue that doing the equivalent of creating “Joe Esposito” is transformative. Maybe so; we can let the lawyers fight this one out. (See Roy Kaufman’s excellent post on The Scholarly Kitchen on this topic.)
What publishers need is more copyright protection, not less. Many people in the scholarly publishing community have set their sights on the goal of open access (OA), in an attempt to democratize scholarly communications further. This is an admirable objective, but it is a small one: to assist humans on the perimeter of the (human) research community, especially those with little or no relationship to the industry’s major institutions and most potent brands. It takes but a short survey, however, of the sheer quantity of research output to realize that the real audience for scholarship inevitably will be machines, which operate on a scale that we carbon-based life forms can barely imagine, and they do so as (trained) extensions of ourselves. As the industry is currently constituted, however, the beneficiaries of these efforts will disproportionately fall to huge tech companies, not to the research community and the academic libraries and funding agencies that support it. The unfortunate fact of the matter is that the OA movement and the people and organizations that support it have been co-opted by the tech world as it builds content-trained AI.
It won’t be easy for publishers to recapture lost ground. OA has been hyped as a communitarian exercise destined to raise our entire species to greater heights, not a building block of a post-human technological society, controlled by the likes of Mark Zuckerberg, Larry Page, and Elon Musk. But even if publishers could prevail, which is by no means certain, there is the question of which publishers. Small publishers, including many in the professional society arena, control only slivers of content in their respective areas, and even much of their data rights have been silently put under the umbrella of the big content aggregators. There is a fundamental asymmetry in these arrangements: a society publisher signs a contract with an Elsevier or a Wiley and receives a check, but the huge aggregator gets access to the data surrounding the society’s content in addition to what it can capture from the sale of license of that content. Thus, publishers of all stripes need not only stronger, enforceable copyright protection; they also need a means to exploit their data through clever marketing models, and perhaps for the most ambitious, models that include building their own AI services. What would a bot built on ScienceDirect look like? What could it do? How could Elsevier’s shareholders profit from it?
At bottom this is a moral argument, in which I perceive that I may have an interest. If someone is going to create “Joe Esposito” based on the life and work of Joe Esposito, and may in fact derive some economic benefit in so doing, shouldn’t Joe Esposito have a say in this? Shouldn’t I own a piece?
Discussion
16 Thoughts on "Who Is Going to Make Money from Artificial Intelligence in Scholarly Communications?"
“Thus, publishers of all stripes need not only stronger, enforceable copyright protection; they also need a means to exploit their data through clever marketing models”…and clever and SECURE distribution models. Similar to how Netflix, Amazon, Spotify, Steam, Microsoft, Sony, Google and other tech companies have already protected themselves against looting and plundering by others. The latter is non-negotiable if one wants to do business in a digital world. And these companies know that.
Thanks for this great article (and shout out) Joe. On Japan, while the copyright exception is overbroad and a bit vague, it is not as broad as some in the tech community are pretending it is. There is a great post by Peter Schoppert about misinformation on the Japanese exception: https://aicopyright.substack.com/p/japan-will-not-enforce-copyright?publication_id=1101712&isFreemail=true
Now if you want an example of an unlimited copyright exception that allows use of even infringing content (as long as the user “does not know” it is infringing), look to Singapore law. I have even heard that companies are use VPNs to appear that they are based in Singapore, although I don’t see how that will help in a litigation. On the other hand, there are countries whose laws clearly require consent for these uses.
Thank you for this amplification, Roy.
it’s unclear to me how AI will handle attribution — for CC BY licensed OA content — let alone the more complex NC and ND requirements for the more restrictive flavors of OA
whenever i asked GPT for its sources it was all squirming and handwringing uhh umm ahh well you see that’s not really how i work. when pushed further it would inevitably invent a few citations in a futile attempt to provide me a satisfactory answer and stop the interrogation
OpenAI or whoever publishing a separate doc citing millions of sources in bulk isn’t a feasible solution to the attribution problem either i don’t think. on the other hand interspersed attribution statements within AI generated content might be technically feasible but that’ll make it a UX nightmare
so yeah… i doubt that OA is really in serious danger of being coopted and exploited by AI companies
and if they do figure out a way to do it while respecting all the license terms well then it’s OA working as intended isn’t it?
so ultimately i’d bet on publishers first and foremost making most of the money from AI (their own proprietary ones custom trained exclusively on content they own themselves) — at least in the scholarly communication domain
AI will handle usage restrictions by ignoring them, unless copyright law has teeth. As for co-opting OA, it is already happening. To co-opt doesn’t mean to eliminate; it means to have the meaning of an activity moved to a larger entity. Because librarians all work for Mark Zuckerberg now does not mean thbat they cease to be librarians.
If AI comes up with the cure for your kid’s cancer based on open access research articles the sheer volume of which can’t be understood by the human mind, will publishers also cry foul?
I am coming from both an academic (biomedical) an AI company background that is attempting to give publishers/authors helpful information for how to improve their “product”, the scientific article so my perspective is a bit different, nuanced.
I find this perspective to be wholly unpersuasive. Science and technology have flourished under a copyright regime. There is no evidence that it will be even more robust in an OA environment.
So if I tell you that 97% of clinical trials failed over a 10 year period (stroke literature – see MacLeod lab) that is what you call a success of the status quo? I would hate to see what failure looks like.
This mode of science – one reader – one article – one brain – is dead!
The number of papers produced keeps getting higher, but the amount of time we spend reading them is dropping precipitously. The drug industry is starting to not read and not care what is in there.
Where I sit, two key things need to happen to revitalize the scientific literature:
1. better search across the totality of the literature, the only way to get this is to put the totality of the literature into one search and allow lots of AI innovation (not one company/brain but many) so that it can be better sifted and sorted. Think google scholar, but a version that actually does some cool things; semantic query, direct access to figures and tables, search for entities and neighborhoods, not just dumb string search that keeps your thinking in only one semantic neighborhood!
2. better ways to know that a paper is reliable and trustworthy. Trust has been eroded, patients have been harmed, paper-mills are flourishing, and the publishers need to help to bring trust back instead of fighting any attempt to introduce innovation as you are currently trying to do. How will publishers detect paper-mill papers without AI? Doing a better job with detecting aspects of study design that are associated with overinflated effect sizes and statistical anomalies without a helping hand from AI is a herculean task. Good luck reading every paper and telling 95% of the authors that they need to blind their study and 97% of them that they did not perform a needed power analysis! It would make your managing editors cry. Do you check if all reagent descriptions for accuracy? If not, your published study is not not reproducible, and readers need to contact the authors to get key methods – see the Cancer Reproducibility Project, it’s open access, they had to contact original authors 100% of the time! None and I repeat -NONE- of the top 50 cancer studies were able to be replicated by a knowledgeable group by reading what was written! This is not success of the status quo of publishing.
From the AI community, I can confidently say that AI it is just another tool (these tools have been around for ~30 years) that can be used to bash your head in or build a house, your choice.
” Thus, publishers of all stripes need not only stronger, enforceable copyright protection; they also need a means to exploit their data through clever marketing models, and perhaps for the most ambitious, models that include building their own AI services. What would a bot built on ScienceDirect look like? What could it do? How could Elsevier’s shareholders profit from it?”
I’m wondering what “stronger, enforceable copyright protection” you are thinking of? I often lament the lack of universal aggregation. If a Publisher is limited to aggregating only their owned content it is essentially useless as it is only a subset of the possible, published/under copyright/open content available to data mine. It seems to me that the better target goal is to pursue more universal aggregation to support discovery and creative results from that discovery.
CC licensing seems a solid foundation for all creators to embrace and better manage their copyright. I guess I am still craving a future where I could search and be served a complete reference listing from some beloved major multi-volume reference work published (over many years) via search whether it be AI enhanced or not. At this moment I feel it is the creator that may require additional education and action to protect the Copyright of new creations, not the Publisher, who is acquiring publishing rights from the creator to the Publisher for all known or as yet unknown means. The best pathway forward is more aggregation of source materials, articles, books, major reference works, focused monographs, data sets…. Let the creator discover what exists and whether or not they can access that sources material is a challenge creators must face daily but the solution should address the needs of creators not publishers.
Most obviously, fair use should be defined clearly and narrowly. Another key point is that individuals and small organizations should not have to bear the burden of enforcement. This should be the work of enforcement agencies of the government. I laugh at the prospect of a tiny professional society attempting to sue Google.
Thanks Joe – this is a great read. The training corpus that Voltaire accesses in mid 2023 is almost exclusively created by humans. By the middle of this decade that will no longer be the case: there will be millions of images, movies, blog posts and all other types of content generated by Voltaire’s fellow generative AIs.
Applying that to scholarly publishing points to an alarming future: it’s going to become trivial to create realistic fake research, and that will slowly contaminate the literature (even more than paper mills currently do). How can we train high quality AI tools that help with research if the training articles are fabricated junk? Publishers that can ensure their published work is genuine will (in time) have an extremely valuable commodity.
There’s a preprint out about this “model collapse” where AIs will ingest more and more content created by AIs until the whole thing turns into gibberish:
https://arxiv.org/pdf/2305.17493v2.pdf
Blog post about it here:
https://www.lightbluetouchpaper.org/2023/06/06/will-gpt-models-choke-on-their-own-exhaust/
This is fascinating David—thanks for sharing. So, by this logic then, the winners will be companies who own troves of copyrighted work which can be kept hidden from search engines (except for summaries and abstracts). Everyone else will start spiraling down the gibberish drain as more and more AI generated content is born (the study mentions how the Internet Archive is getting hit pretty hard right now so tools can be trained on “pristine” data). But do you think we’ll really be able to tell the difference? In science, sure—scientists will know when an AI summary of cancer research (for example) is just garbage. But will we also recognize when history gets mangled, for example (it already is—this is the crux of Jason Steinhauer’s work), or when misinformation becomes truth because the AI sampling model eventually just succumbs to so much misinformation? Maybe we need new conventions for publishing AI generated work—put this info on do-not-index pages, or include some kind of tag like CC-AI to let search engines know not to treat this as primary source information?
As I drafted this blog post, I wondered if anyone would respond to it in the Comments section by prompting a chatbot. As I look over the comments thus far, I cannot tell if all of them are genuine.
This is an excellent post. I think about AI (etc) like this: it’s as if the symbolic world, as represented on zillions of servers linked in a vast network, has become Wikipedia.
On Wikipedia unpaid editors gather information they don’t pay for and put it on a website that, though it is a “non-profit,” makes a small fortune and employs dozens of people. What a business model! I rather like Wikipedia and have no problem with it.
So AI (or the LLMs) do this on the scale of the internet. They use unpaid machines to gather information they don’t pay for and put it on a website that will inevitably charge someone to answer questions/produce things and thereby make a fortune.
The point is that the people who produce the “knowledge” never get paid, at least by the business at the “point of sale.” A cynical person might call this “theft.”
The question, of course, is how the “knowledge producers” (very ugly term) are going to get paid. Maybe they aren’t.
Joe, this is a great piece of writing. The conclusion drawn has me thinking that Publishing (with a capital P) – the act of acquiring content, fixing it up, and selling it for a profit – is increasingly hard to reconcile with research communication – which, at it’s core, is a basic model of researchers’ registering, validating and disseminating their results. For that reason, I don’t track with the ‘co-opted by tech’ argument; although I do find the arc of the OA transition itself to increasingly feel like it’s own Voltairean social satire. (Your latest edition of The Brief notes OA activists questioning APCs, well-intended journal editorial boards longing for the good old days of low output and high rejection rates, and far-right political forces in America inadvertently lining up on the side of commercial publishers – who are basically pushing OA transitional agreements as their core idea at this point. ) I don’t even know what to think anymore. My brain is spinning with literary allusions and Beatles quotes. Maybe this whole post is just the AI’s way of messing with me, “Joe Esposito” style.