Editor’s Note: Today’s post is by Hong Zhou and Hiba Bishtawi. Hiba is a Senior Product Manager at Simma.io, responsible for next-generation product discovery and fintech-integrated checkout solutions that are transforming cross-border e-commerce. She previously served as a Product Manager for information discovery at Atypon. Disclaimer: A proprietary AI tool assisted in polishing this post, with all facts verified by the authors.
The way researchers discover information is changing again — and this time, the shift feels more structural than incremental. For decades, discovery has revolved around keywords: carefully chosen terms, Boolean operators, and increasingly sophisticated relevance ranking. Today, generative AI systems are introducing a different interaction model altogether. Instead of asking how to search, researchers are starting to ask what they want to know — and expecting the system to figure out the rest.
This raises a set of practical questions for publishers, librarians, and tool providers. These questions are not theoretical. They affect platform investment decisions, metadata strategy, interface design, licensing models, and even how success is measured. As discovery increasingly happens through AI-mediated layers rather than directly on publisher platforms, understanding where different approaches succeed or fail becomes strategically important, not just technically interesting. Is keyword search becoming obsolete? When does natural language outperform traditional approaches — and when does it fail? And what do current AI-powered discovery tools actually do well, as opposed to what we hope they will eventually do?

To explore these questions, we conducted a comparative analysis of four widely used AI-enabled research discovery tools — Elicit, Typeset.io (SciSpace), Consensus, and Scite.ai — looking at how they perform across different types of research queries and discovery contexts. The results suggest not a clean replacement of keywords, but the emergence of a hybrid future in which precision search and AI-driven synthesis coexist, sometimes uneasily.
Why Keywords Are Not Going Away
Predictions about the “death of keywords” tend to underestimate how deeply they are embedded in research workflows. Modern keyword search is already far more semantic than its early predecessors. When a user types “best Italian restaurant nearby,” the system interprets intent, location, and preference — not just string matches. In scholarly environments, the same is increasingly true: controlled vocabularies, metadata enrichment, and semantic indexing have quietly improved keyword search for years.
More importantly, keywords still solve problems that today’s AI systems struggle with. Precision remains paramount for specific use cases such as error codes, product numbers, technical specifications, and exact phrases that require deterministic matching. Professional researchers — including lawyers, analysts, and academics — often depend on precise Boolean logic to construct comprehensive and reproducible queries (for example, “climate change” OR “global warming,” or “machine learning” AND “healthcare” NOT “image processing”). These capabilities are not simply habits; they are essential tools for transparency, recall control, and methodological rigor.
There is also a practical consideration: keyword-based indexing remains computationally efficient and comparatively cost-effective at scale. Pure semantic or embedding-based search is resource-intensive and opaque, making it difficult to debug or optimize in large production systems. As a result, most large discovery platforms already operate as hybrids, blending lexical, semantic, and behavioral signals behind the scenes.
Rather than disappearing, keywords are increasingly becoming infrastructure — working “under the hood” even when users interact through natural language interfaces. In many systems, natural language queries are quietly decomposed into weighted keyword terms, entities, and embeddings before retrieval even begins. The interface may look conversational, but the retrieval stack remains deeply rooted in lexical logic. The real question, then, is not whether keywords survive, but whether users retain meaningful ways to invoke precision when they need it. The real question, then, is how well AI-driven tools complement (or obscure) that infrastructure.
Four AI-powered Discovery Tools
The tools examined here represent different philosophies of AI-assisted research discovery:
- Elicit (Ought) focuses on literature review, using semantic search and structured extraction across academic databases.
- Typeset.io (SciSpace) positions itself as a general research copilot, supporting reading, writing, and analysis across a very large corpus of scientific literature.
- Consensus aims to answer research questions directly, synthesizing findings from peer-reviewed studies with a strong emphasis on empirical claims.
- Scite.ai approaches discovery through citations, analyzing how papers reference one another and classifying citations as supporting, contrasting, or mentioning.
While all four are evolving rapidly, they provide a useful snapshot of current design trade-offs in AI-enabled.
How the evaluation was structured
To move beyond anecdote, the experiment examined tool performance across three layers of discovery:
- Publication level — queries about known articles or specific papers
- Library level — searches within defined collections or topical corpora
- Global level — open-ended discovery across each tool’s full content base
Queries were grouped into six common research scenarios:
- Mathematical equations and scientific formulas
- Chemical compound searches
- Image- or figure-dependent discovery
- Implicit or contextual questions (e.g., emerging trends)
- Direct factual or yes/no questions
- Research analytics and bibliometric questions
Each query was evaluated for accuracy, completeness, and relevance, using both Claude and ChatGPT as independent evaluators. While using AI models as evaluators introduces its own biases, employing two different systems helped reduce single-model blind spots and highlighted where answers were consistently strong or weak. The goal was not to rank tools competitively, but to surface structural strengths and recurring limitations across categories. Scores were normalized to allow comparison across tools and query types. While no evaluation framework is perfect, this approach helped surface consistent patterns rather than one-off successes or failures.
Where Traditional Search Still Wins
The results make one thing clear: traditional keyword search remains superior for exactness. Known-item searches — specific titles, phrases, DOIs, or chemical formulas — consistently perform better with keyword-based systems. Metadata filtering (by author, year, journal, or document type) also remains a weak spot for AI-first tools.
Chemical structure queries are particularly challenging. Without descriptive context, molecular formulas and IUPAC names often produce incomplete or inconsistent results in AI systems, mostly at the global search level. Similarly, image-based discovery remains largely unsupported. Figures, charts, and diagrams are effectively invisible unless explicitly described in accompanying text.
Perhaps most frustrating for experienced researchers is the loss of control. In traditional systems, a poorly performing query can usually be debugged: terms can be adjusted, fields constrained, logic refined. In AI-driven systems, failures are harder to diagnose. Users are often left unsure whether a missing result reflects a gap in coverage, a prompt interpretation issue, or a retrieval limitation. This opacity changes not only how people search, but also how much they trust the results. Boolean operators and quotation marks are frequently ignored or misinterpreted. For users accustomed to building precise, reproducible queries, this represents a meaningful regression.
Where AI Tools Clearly Add Value
At the same time, the strengths of AI-powered discovery are equally clear. Natural language questions that require interpretation — such as, “What are the recent breakthroughs in renewable energy?” — are where these tools shine. AI systems are particularly effective at summarizing bodies of literature, identifying themes, and synthesizing evidence across multiple papers.
Among the tools tested:
- Elicit performed exceptionally well on factual and literature-discovery queries, particularly when logical constraints were expressed conversationally.
- Typeset.io delivered consistently strong performance across most categories, making it a reliable general-purpose research assistant.
- Consensus stood out for empirical and statistical questions, often providing clear, well-grounded answers backed by cited evidence.
- Scite.ai offered unique value through citation context, helping users assess not just what is cited, but how it is cited.
In short, AI tools excel when the task involves interpretation, synthesis, or sense-making areas where traditional keyword search has always been weakest. They reduce the cognitive overhead of scanning dozens of abstracts, help users orient themselves in unfamiliar literatures, and lower the barrier to entry for interdisciplinary exploration. For early-stage research questions or rapid situational awareness, this represents a meaningful productivity gain.
Shared Limitations
Despite their promise, the tools also share notable limitations. None handles image-centric discovery well. All struggle with exact phrase matching and technical precision. And all impose a degree of opacity that makes it difficult for users to understand why a particular result was returned.
These are not minor issues. In research contexts, trust, reproducibility, and explainability matter. A system that produces plausible but imprecise answers may be acceptable for exploratory discovery, but it is not suitable for systematic review or regulatory work.
What This Means for Discovery Going Forward
The broader implication is that we are not moving from keywords to AI, but from keywords plus AI. Discovery is becoming layered: conversational interfaces on top, retrieval systems beneath, and evaluation workflows alongside. Designing for this reality requires acknowledging that no single interaction model serves all research needs equally well. Flexibility — not replacement — is the defining characteristic of effective discovery going forward. The most effective discovery environments will be hybrid systems that combine:
- Natural language interfaces for exploration and synthesis
- Keyword and metadata controls for precision and verification
- Transparent signals that help users assess confidence and coverage
At the same time, discovery itself is shifting from conversational search toward task completion. Researchers increasingly expect systems not just to retrieve information, but to help generate bibliographies, extract data, compare methods, or draft outlines. In that sense, conversation is becoming the entry point, not the end goal.
This has implications beyond tooling. As discovery becomes more AI-mediated, traditional metrics such as clicks and downloads are no longer sufficient. New indicators — AI retrieval frequency, citation surfacing, hallucination rates, and workflow impact — are beginning to matter.
For publishers and libraries, the message is familiar but increasingly urgent: content must be not only discoverable, but AI-ready. This goes beyond exposure to search engines. It includes structured abstracts, consistent section tagging, high-quality references, persistent identifiers, and clear licensing signals that enable lawful machine use. Content that cannot be reliably parsed, attributed, or reused by AI systems risks being bypassed, regardless of its scholarly value. Rich metadata, clear structure, standard ontologies, and machine-readable rights are becoming prerequisites for visibility in AI-driven workflows.
The uncomfortable question is no longer “Can humans find this content?” but “Will AI agents surface and use it?” In an agent-mediated discovery economy, invisibility to machines increasingly means invisibility to people.
As with previous shifts in discovery — from print indexes to online databases, from databases to web search to agentic discovery — the winners are unlikely to be those who declare a single paradigm victorious. Instead, they will be those who understand the strengths and limits of each approach, and design systems that let researchers move fluidly between precision and synthesis.
Discussion
7 Thoughts on "Keywords Are Not Dead — But Discovery Is No Longer Just Search"
This was very interesting!
I do want to point out that discovery was never just about boolean/keyword searching for advanced researchers, eg. faculty and grad students. For over 30 years, I have been teaching such researchers about two additional ways to discover more material relevant to their topics. The first involves “footnote chasing” – backwards, forwards, and co-citation analysis. Once you find a great citation, follow its footnotes back in time, use a forward tool like the Web of Science Citation Indexes (in print, then on CD, and now online), Google Scholar, or Scopus to find more recent articles citing that one, and Co-citation analysis in the Web of Science tools (and I suspect Google’s “related articles” tools is also doing this but I don’t know for sure). Back in library school in the 90s, we were taught the concept of the “invisible college” and that these citation tools were uncovering them.
The second non-keyword method is simply to trace related author’s work. The principle is that much of the time, researchers work on the same topic area for a good portion of their research, so once you find one, do author searches on those people to see if there’s more they’ve published that might be useful.
And of course these two techniques can be combined, with the references leading to more authors to follow, and vice versa.
This is exactly how I teach it, also! Great minds think alike. Also, one search is never enough to get the BEST answer. Sure, you will get AN answer, and that is likely where most people stop. Getting them to move on to find all this other material is excruciatingly painful. That’s why Librarians (yes, proper noun) should always have jobs and be exalted for their work. I’m manifesting that!
Love to see this. As the man said, “A New Dimension in Documentation
through Association of Ideas”
Yes, this generally matches what I’ve observed as I’ve analyzed performance of these systems for clients. One thing that makes cross-comparison difficult, though, is that the different services use different underlying indices. Even excellent retrieval can lead to poor results if the service doesn’t have access to key studies. Summarization services are improving in capturing the main ideas and representing uncertainty faithfully, but too often they don’t include key insights because the service doesn’t have access to the content. So far, these specialized tools have avoided the fate of becoming just a feature of a more general AI model like Gemini, but that’s only because the consumer model developers haven’t been interested in the academic market. When they do start turning to specialized niches continue growth, that could change in a hurry.
Just ask a marketer….. it is always ‘AND’ — ‘YESTERDAY’ … usually coinciding with ‘do more…with less’. Now, as ever, collaboration FOR smart shifts in budgets, creativity, and speed are worth conversations!
Hi,
Would it be possible to get more details about the study conducted by the authors?
I echo the previous comments: really interesting and timely article, and I’d also love to see the research data from the tests that were carried out. That would facilitate the assessment and comparison to other AI-empowered discovery platforms out there, like Zendy that does combine a traditional discovery search with AI research assistance.