Today’s post is by Anita de Waard. Anita is Vice President Research Data Collaborations for Elsevier. This blog discusses her research in new forms of publishing, sponsored by the Netherlands Organisation for Scientific Research, and is a writeup of her presentation at the EuroScience Open Forum held in Toulouse, this July.
To create better systems for knowledge extraction from scientific papers, it is useful to understand how humans glean knowledge from text in the first place. Studying the language of science delivers some surprising results: fairytale schemas help us to understand the narrative structure of articles; the study of verb tense reveals common linguistic patterns of sense-making between science and mythology, and tracing hedging in citations allows us how citations work to spread claims much like rumors.
Scientific Papers Are Written Like Fairytales
Ever since I joined Elsevier, in 1988, my colleagues and I worked on what we see to be one of the key roles of publishers: to improve scholarly communication through the use of information technologies. Over these thirty years, we’ve witnessed a series of hypes, and jumped on many bandwagons: some were much more successful than we ever imagined (er, like, the web….) some, we wisely invested time and effort in (like RDF and MarkLogic) and some, which we were gunning for, never lived up to their promise (we had high hopes for XLink and SVG as carriers for scientific content, for instance…).
The Semantic Web has been a particular interest since it started, and offered a tantalizing idea: surely, with ‘smart content’ and clever agents, we must be able to finally let go of the centuries-old narrative structure of scientific articles, and invent a format that allows computers to consume knowledge directly? And scientists won’t have to read and write all those (bloody) papers? More recently, we are told that Artificial Intelligence (AI) and Natural Language Processing will do all the reading and understanding for us: AI is going to eat the world’; Alexa will make publishers obsolete, and new software will deliver not just knowledge, but Facts. Surely, the time has come to finally bypass this narrative nonsense, once and for all?
In 2006, I received funding from the Dutch National Foundation for Scientific Research which allowed me to spend half of my time for four years to develop ‘a semantic structure for scientific papers’. And it turns out that when you sit down for it, it is actually very difficult to represent all the knowledge contained within a research paper in a structured, computer-legible format (I’ll challenge anyone to represent the full richness of the knowledge contained within a paper in a database entry, or triples, or acyclic directed graphs!)
On the other hand, it is very easy to map a typical scientific paper to the schemas developed throughout the millennia for oratory or narrative text. Table 1 shows a snippet of a paper from Cell, mapped to the fairytale schema developed by Rumelhart in the 1970s, next to the fairy tale ‘Goldilocks’, mapped to this same schema. In both texts, we start with a Setting/Introduction, move to a series of Episodes/Experiments, and conclude with an Ending/Conclusion. Similarly, the IMRaD structure of a research paper closely follows rhetorical schemas such as those proposed by Aristotle and the Roman Orator Quintilian.
In stories and rhetoric, a statement exists not as a separate entity, but plays a role in the overarching narrative. It only makes sense to hear about what obstacles the hero is facing, once you know why she went on a quest, in the first place; likewise, reading a description of experimental conditions only makes sense if you know why the experiment was done in the first place. A paper has a narrative structure, because stories are how humans transmit (complex) knowledge. The narrative context provides the conditions within which the components of the story (or article) get their meaning.
Scientific Facts are Explained Like Myths
But surely we can extract the facts from an article and use those, without any context? In research on so-called ‘fact extraction’ techniques, two very different types of statements are referred to as ‘facts’: first, experimental results (e.g., ‘After treatment A, half the samples had deteriorated’) and second, interpretations of those results (e.g., ‘X is a factor in Pathway Y’). In my research, I identified seven distinct clause types in biological text, including Result and Implication, which these two clauses would be classified as, respectively (see Table 2 for a summary of all seven types).
When we look at the linguistic properties of these different types of clauses, we find that they are generally written in one specific verb tense. In particular, experimental segments (such as Methods or Results) are usually given in the past tense, but interpretative statements about models (e.g., pathways or disease models) are generally written in the present tense. Authors regularly switch back and forth between tenses even within sentences¸ when referring to a model within a sentence that is largely about experimental outcomes. This connection is so strong, that changing the tense will lead readers will assume the clause type is different.
Interestingly, this is similar to the use of tense in mythological text: here, experiential discourse is most often provided in the past tense, while the mythological explanation for these experiences is written in the present tense. In cases where a further description of the reason for a human experience is needed, tense switches from clause to clause, exactly as it does in scientific text. In short, our tense use in experimental research articles mirrors that of mythological sense-making (see Table 3).
In summary: throughout the ages, our models for explaining phenomena have changed, from Greek Gods to cellular pathways, but how we speak about them has not.
Facts Are Shared Like Gossip
We find that scientific ‘facts’ are often described as an experimental outcome (a Result) followed by an explanation (an Implication). These two statements are often connected by a specific clause, of the form ‘Indicating that…’ or ‘These data suggest that…’ or ‘These results could imply…’, or ‘… possibly indicating…’. These connecting clauses take the form of a ‘pointing word’ (these, this result, these data) plus a hedged reporting verb (imply, suggest, indicate, etc.) or a reporting verb with a hedging phrase (possibly suggest, could imply, etc.). The author does not usually state that a given interpretation is correct: It is presented as an option, which might be true, or it might not be. In other words, Implications are given as hedged claims.
If we look at how these claims are cited, we find that in general, the hedging is lessened by a citing author. A corpus study of 50 cited papers with 10 citing papers each shows that ‘These results suggest that the APC is constitutively associated with the cyclin D1/CDK4 complex’ is cited as ‘…the anaphase promoting complex (APC) is responsible for the rapid degradation of cyclin D1 in cells irradiated with ionizing radiation ’: in other words, there is no suggestion and no reference to the experimental results. Similarly, ‘These data suggest that the RxxL destruction box in cyclin D1 is the major motif that renders cyclin D1 susceptible to degradation by IR’ was cited as ‘In the case of cell response to stress, cyclin D1 can be degraded through its binding to the anaphase-promoting complex…’, again omitting the data reference and the hedge. In the whole corpus, the hedging was weakened in about half the cases, without there being any suggestion that results outside of the cited paper were the reason for the greater confidence which a citing author gave.
In short, scientific facts are based on a game of telephone between cited and citing authors. Besides being pulled from its experimental context, claims are often validated (and turned into ‘known facts’) by the simple act of being cited (what Latour and Woolgar call ‘persuasion through literary inscription’).
So What Can We Do?
All these examples show that viewing papers as texts can help us understand how humans make sense of science. But can this insight also help build systems to support sensemaking? Absolutely. For once thing, these linguistic analyses can help set the direction for finding the key claims in papers. In the DARPA Big Mechanism project, the goal is ‘‘to develop technology to help humanity assemble its knowledge into causal, explanatory models of complicated systems”. Our group, lead by Ed Hovy at Carnegie Mellon, was able automatically classify discourse segments in these texts into seven categories, split the paper into separate experiments, and identify the individual type of assay for each experiment. That’s not quite the same as ‘mining facts’, perhaps: but it’s a start.
To improve the ability for readers to trace author’s claims, two further developments are key: first, current efforts to support the linking of methods and data to papers (such as the TOP Guidelines, the Scholix Framework for linking papers to data, and recent work on Enabling Fair Data) can greatly improve a reader’s ability to check the veracity of the author’s claims.
Second, it is important that we collectively work on better ways to cite these claims. Authoring tools that allow for a more fine-grained citation than that at the level of the paper, would allow the development of interfaces that show cited claims, within the context of the original paper. This would let the reader decide what level of certainty a cited claim deserves.
In short, we believe that understanding language can help us to build better systems for scientific knowledge transfer. It helps us to understand that scientific articles are not impartial observations or bags of computer-interpretable facts, but persuasive stories, written by and for humans. This proves, again, that the study of what humans do and how they do it (in other words: the Humanities) is needed, to develop better systems of scholarly communication.