Editor’s note: Today’s post is by John Frechette, co-founder and CEO of moara.io, an AI-powered research tool, and Sourced Economics, a research firm.
Is there anything more boring than reference management?
Seriously — storing and organizing citation metadata sounds about as exciting as doing your taxes by hand.
But as someone who came to the world of academia a bit late in the game, discovering tools like Mendeley, Zotero, and EndNote was a real breakthrough – certainly a big step-up from trying to manage papers in my File Explorer.
Brief History
As with most products, reference managers have gone through a few big cyclical changes in lockstep with developments in the technologies they run on. In the 1980s and 1990s, EndNote, the most substantial reference manager at the time, rode the personal computer wave as a desktop software product. It was — as most applications were — a desktop software product. In the mid-2000s, Zotero and Mendeley capitalized on the rise of web browsers, DOIs, more consistent metadata, and online databases that made, for instance, near-one-click saving from the web possible. And in the 2010s, browser extensions became ubiquitous, cloud tools matured, and external and internal APIs became significantly more available. As a result, reference managers were integrated into more external products for research and writing.
My Prediction
Looking ahead, we can expect that state-of-the-art reference management will be characterized by (1) more AI and (2) more workflow.
The metadata fields available for articles will become nearly endless, thanks to both AI-generated content and data from other vendors’ APIs, and users will import, manage, and export papers with a process that scales. We can expect more collaboration and larger reference lists.
Inevitably, the next reference managers will improve upon many more steps in a user’s typical research process — what I’ll call the vertical integration of reference managers.
To test this prediction, I spoke with William Gunn, an expert on the reference management industry. William was an early employee at Mendeley, which was sold to Elsevier in 2013. Since then, he has been working to improve standards in the scholarly research landscape, with a focus on reproducibility.

Can you fill in the gaps a bit more on the brief history of reference management I provided? Did the industry unfold as you expected when you first joined Mendeley in 2009?
What really set Mendeley apart was the shift to “Web 2.0” — the read-write, social web. Always-on broadband suddenly made it possible for researchers to share and discover work peer-to-peer versus relying solely on publishers’ distribution channels. At the time, I wrote about the parallels to music-sharing and social bookmarking tools like last.fm, del.icio.us, and Connotea, and those ideas helped shape how people saw Mendeley’s potential. Mendeley incorporated the social discovery aspect into reference management, which incumbents had been slow to adopt.
Our broader vision was to accelerate research by connecting researchers and building a shared infrastructure. Using metadata from PDFs across our platform, we created the first open API for scholarly metadata and introduced signals like readership counts, which then helped spark the field of altmetrics. Some ambitions, like answering research questions directly rather than just surfacing papers, weren’t possible at the time but are now emerging with LLMs. Others, like atomizing publishing into claims and figures, remain only partially realized, though projects like System are working on that goal.
What (if any) parallels do you see between the reference managers of that period and the many AI research companies cropping up today?
The similarities today are more about vision than execution. SciSpace is probably the closest to the old Mendeley playbook, mainly in how they try to use a social growth loop. Many other tools take a blunter approach — they summarize the top handful of search results, which means thousands of people end up repeating the same narrow work. Without knowing what fraction of the relevant literature a summary actually covers, you can’t assess disagreements, effect sizes, or risk of bias, so the output can be misleading. I’d love to see tools that let people remix or fork a Deep Research report, and more support for sharing prompts and workflows.
At a high level, what do you envision will be the impact of AI on reference management tasks? Do you think existing vendors are adapting quickly enough to this technological cycle?
At a high level, researchers want three things: to find all the relevant literature, to understand how the papers relate to each other, and to cite the right work when writing. What they don’t want is to worry about formats or citation styles. At Mendeley, we tried to abstract that away with features like keyword search at the point of citation and by getting Elsevier journals to accept any reasonable reference style.
AI could finally deliver what we hoped to do back then — commands like “cite the first paper to make this claim” or “cite the paper that introduced this technique.” Vendors are starting to move in this direction, but established companies have large user bases they need to carry forward, and newer entrants often don’t have enough real researcher feedback. As a result, many tools focus on what’s easy with current technology (summaries) rather than what researchers actually need.
Do you have any big concerns when it comes to managing and executing literature reviews with AI?
The biggest issue with using LLMs for systematic reviews — as opposed to narrative reviews — is that these reviews can change policy or clinical practice, affecting millions of people. Cochrane takes this responsibility seriously and has involved patients, clinicians, policymakers, and others directly in shaping their process. That engagement is fundamental to their legitimacy, and the black-box way LLMs produce reports lacks this key ingredient.
After hundreds of hours testing lit review and Deep Research tools, my view is that they can be useful — but only if you understand their limitations. Recall is still a major weakness. Retrieval bias is another big issue. Even in Deep Research mode, consumer LLMs will cite press releases or company blogs, producing lines like “[company] is the undisputed leader in mitochondrial medicine.” They also hallucinate citations or mix up study details — for example, reporting the wrong drug dose from a clinical trial. Tools like Elicit do better at sticking to primary sources and using a transparent and reproducible process.
So, while I’m optimistic about the direction of these tools and use them myself, you have to treat their output with caution. A report that looks convincing at first glance can still contain serious errors.
Much of our discussions on this topic center on “enrichments” – the ability to source and pick from a huge number of metadata fields for a given paper. Is this the way of the future in your view?
Retrieval and relevance are still the biggest challenges. Researchers need confidence that a tool hasn’t missed important studies, and they need help understanding what the retrieved papers actually contribute. That’s tough, because retrieval often surfaces many narrative reviews whose value is hard to judge without knowing who wrote them or where they were published — a Cochrane review signals something very different than a fifth MDPI review by unknown authors, and LLMs don’t always have that context.
Where these tools do help is in taking a systematic approach. Much of a systematic review involves extracting study design, methodology, outcomes, effect sizes, and other details in a consistent way. LLMs are good at this tedious work, which lets researchers focus on interpretation instead of manual extraction. Enriching a library with this structured information makes teams faster and improves review quality. I’ve seen tools like your moara.io product start leaning into this enriched, structured approach, and that’s the direction I expect more products to go in. Some tools still miss items like confidence intervals or p-values, but careful prompting — or built-in scaffolding — can address that.
Lots of features exist for collaborating on a Mendeley or Zotero library. But in practice, is reference management currently a team sport or a solo activity — football or golf?
Engaging with the literature is as much about the process as it is about the outcome. You only really understand a paper once you’ve worked through the data or proofs yourself — that part can’t be outsourced. But discovery is increasingly collaborative. Even during my PhD, there was already too much literature to track alone, and the volume has only grown. Researchers already do this through journal clubs, conferences, and informal networks, and tools that make this collective sense-making easier are likely to do well.
You experienced the Mendeley-Elsevier transition firsthand. What lessons from that period still matter for companies building research tools today?
The biggest lesson is not to underestimate how hard it is to sell to institutions. Individual researchers will discover a tool, like it, and share it with their group almost instantly. Universities are the opposite — long discussions, long trials, lots of coordination. It’s extremely high-touch, and most startups can only manage a few of these cycles at once, so you need a clear strategy for how you approach institutional sales.
What advice would you give to someone building in this space today?
My main advice is to build a real moat. If what you’re working on could be shipped as a new feature in ChatGPT, Gemini, or Claude next month, it’s not enough. These companies are moving incredibly fast and will keep absorbing anything that’s easy to replicate. What they don’t have is a deep connection to academic researchers – it’s a niche market, hard to satisfy, and not a priority for them. Building trust with researchers and creating tools that genuinely fit their workflows can be a durable advantage.
But there’s a flip side: if the major AI labs secure strong licensing deals and fully leverage systems like Wiley’s AI Gateway while startups rely on open datasets like Semantic Scholar and OpenAlex, they could outperform smaller players on retrieval even if their accuracy still lags. So whatever you build, make sure it’s something the big models can’t easily replace.
Final Thoughts
My takeaway from the discussion with William is that while new technologies like AI can dramatically improve research, the changes won’t be fundamental. Late nights spent sifting through papers and models will not — perhaps to many students’ dismay — disappear anytime soon. As William notes, “engaging with the literature is as much about the process as it is about the outcome.” And this is where many tools in the space fall short: they overlook the grunt work worth automating, such as organizing papers, cataloguing them, annotating, logging activities for reproducibility, and creating references. They also miss opportunities to introduce social elements and support team collaboration. In the end, the next era of reference management is about supporting rather than replacing the research process.