In the early decades of the 20th Century, there was a big problem with the Universe.
Well, not so much the universe, as our ability to measure it. We didn’t know how big it was. In fact we didn’t know whether the ‘nebulae’ visible in our telescopes were of the Milky Way, or collections of multitudinous stars and worlds way beyond our galaxy. In fact, we didn’t know if there was a “beyond our galaxy”. We had Einstein and his theory of general relativity, which predicted an expanding universe, but we lacked the means to take accurate measurements of astronomically distant objects and so figure out our place within it.
And then we discovered the Cepheid Variable. Henrietta Swan Leavitt figured out that this particular class of stars had a luminosity or brightness that changed periodically and the variability of the star directly defined its luminosity. Once we knew the apparent brightness of one of these stars and its distance, we could use that information to work out the actual brightness of other stars of the same class. A quick bit of maths with the inverse square law gave us the distances to those stars and we had the first ‘standard candles’ that allowed us to start determining the dimensions of the universe.
In 1923, Edwin Hubble used this class of stars to show that the Andromeda Galaxy lay outside the boundaries of our Milky Way. Today, we know the age of the universe to an incredible precision (13.798 billion years +/- 37 million years) and we have a remarkably accurate map of where stuff is, out to a few billion light years. When one wants to accurately measure something, standards matter. But, and as the age of the universe has been constantly refined since the first estimates using those first Cepheid stars, it’s important to get them right.
The other thing about standards is that they get used as the measure of achievement. Surely that’s the point, I hear you cry. Yes, but compliance with a standard can be a double edged sword. Here’s a trenchant observation by the economist Charles Goodhart:
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes
If a standard is a measurement of a physical reality, then working to achieve the specification is a good thing. But when a standard is more arbitrary, say the statistical regularity derived by the number of citable items in a journal, divided by the number of times those items were cited, things can get rather interesting. This observed statistical regularity drives the behavior of the overwhelming majority of scholarly research. It also drives the money. Impact factor is regarded as a proxy metric. But a proxy for what…
Scholars want to get published in high impact journals and so the proxy here is that the clustered articles that drive a given Impact Factor for a given year are indicative of better quality, higher significance work. So if you can get into that collection, you can argue that your work is of a higher quality and significance, regardless of whether that is in fact the case. If you are the editor of one of those high impact journals, you’ll be assessing the inbound manuscripts with one eye on whether they’ll be a net benefit or loss to your allocation of citable entities for the period in question. And I bet the same goes on in the mind of the referee who is selected to offer up an opinion on the manuscript.
Leaving aside all the other complaints that are raised about the Impact Factor, looking at it logically, it’s not a very good tool for assessing scholarly achievement. But right now, it’s still the only game in town. It doesn’t get much better when you take a look at the H-index or other measures based on citation. One is taking the complexity of the work that leads to an output – the manuscript, and defining its importance in the scholarly milieu based on the number of inbound connections made to that output over the course of some arbitrary temporal component.
“If only there was a better way” goes the cry. The altmetrics movement offers up alternative measures of impact predicated on other signals. For the most part, these alternative signals are being looked for in the information streams of the Internet. Social network platforms, forums, blog posts and comments, places like that. The theory is that these signals will provide additional metrics of the value of a given piece of work, and certainly, for things like computer code and similar outputs, the citation metric is utterly useless. Here, one is definitely looking at usage. I’m really excited by the potential of altmetrics. Some really smart people are working on it, but it’s early days. There are a multitude of issues to understand before these metrics can become truly significant as measures of ‘impact’.
For example, none of the current social media networks supply signals for negative sentiment. Retweets, Likes, +1s are all positive signals. In fact, the platforms seem to be engineered to deliberately avoid that option; not surprising perhaps when one considers the advertising based business models that drive the platforms. It’s often the case that if you use Twitter, you will have to make the point that “a retweet is not an endorsement.” Indeed, there’s data out there that indicates that many people retweet things based not on an understanding and opinion of the information in question, but for other reasons, reasons that are complex and poorly understood in the business of figuring out the significance of the dissemination of the information.
It’s worth noting here that citation suffers a similar issue, we all know of papers that are cited “just because”. Current citation analysis makes no attempt to quantify the sentiment that surrounds each act of citing. The other issue that affects both citation and altmetric measures of impact is the difficulty in quantifying the magnitude of the cite or the tweet in accordance with the significance of the person who is doing it. Now it’s easy to make statements about how such social network data simply provides a popularity measure, but I think there are more complex network interactions to be discovered here, and these may well have some utility in describing the fluxes of information exchange around a particular output or collection of outputs.
But does this get us any nearer to a ‘standard candle’ of scientific impact? Science progresses because bluntly, ideas that match to reality survive, and ones that do not, die. Epicycles, The Ether, Steady State Theory, countless other neat solutions to observed phenomena, all have perished as the relentless tide of observation has uncovered their fatal flaws.
And in that statement, perhaps we can see something about how to generate genuine impact metrics. Ones that are resistant to Goodhart’s Law.
I’ll cut to the chase. If a paper was to contain within it a set of machine readable statements that described the content of the paper, then one could start to collect information on how that paper sat within the wider context of research in the area. For example…
Paper 1 – Result: X causes Y;
Paper 2 – Result(contradiction): X Does Not Cause Y but Z does;
Speculation: Because Z causes Y it could also cause A
Paper 3 – Result: Z causes A in [Species1]
Speculation Z may cause A in [Species2].
Paper 4 – Result: Z does NOT cause A in [Species 2]
Here our four ‘papers’ have encapsulated the debate and the evidence about whether X or Z cause Y, and perhaps A. As the subsequent evidence builds up, the evidence would accrue to the authors of the papers. And that would include speculation and soft assertions as well as the hard results. Now at this point, I’d like to acknowledge some of the inspiration for this thinking, namely this presentation by Anita de Waard which contains much of interest in the area of machine readable scientific papers.
A world in which papers contain both human readable context and machine readable context could provide some truly informative metrics. As new papers come in either supporting or refuting the assertions made previously, one could have metrics that propagate both forwards and backwards through the research stream. The original paper(s) could have constantly updated metrics of support for or against the results, and who is producing those results. Contextual results of similar results (observations made in analogous systems for example) could be seen alongside the research. Later research could be crosslinked directly to the work that led to these results, via the assertions. Again, these would be dynamic links, where the strength of the linkage is predicated upon the totality of the research in that particular area. Even the good old citation could be folded in, providing some further interesting context as to which papers get cited in downstream work, and which do not. And so could the altmetrics measurements.
So our successful authors start to build up a batting average, based upon the the number of times their conclusions match to a peer validated reality. A validation that doesn’t require that the paper be cited, only that it be discoverable via a common machine readable ontology for the subject in question. It would be inherently resistant to Goodhart’s law as the metrics would be directly related to an emerging reality as the science coalesces around the observable facts.
Of course, there is one big issue with all of this; how to get the statements, in a machine readable format. Scientific papers are complex things, and teaching machines to read them is right up there in the really hard category of natural language processing. I’m told that authors are very unlikely to want to put the time in to do this. I’m inclined to believe that, though it was once said that authors wouldn’t deposit DNA or protein sequences in international databases.
If only there were people out there who could take the author output and derive a set of declarative statements as part of say, an editorial process. This is something that publishers could do. Some publishers already spend considerable time and money on maintaining ontologies that could be used for the purposes of standardizing the machine readable output. Perhaps scholarly societies could play a role here as well. I’m constantly struck by the fact that for all the noise about making the research output available for mining, the fact is, that the current article format is written for humans, by humans. Surely, it would be a better to generate the computable elements of a given paper, minus all the human needed cruft, and simply supply that rather than try to retrospectively extract statements and sentiment from the complexity of a modern research article. Machine readable articles, built on author/editor/publisher curated declarative statements and the associated data (or links thereto), could be a way of generating metrics that get us nearer to a ‘standard candle’ of scientific research output.