In the early decades of the 20th Century, there was a big problem with the Universe.

Well, not so much the universe, as our ability to measure it. We didn’t know how big it was. In fact we didn’t know whether the ‘nebulae’ visible in our telescopes were of the Milky Way, or collections of multitudinous stars and worlds way beyond our galaxy. In fact, we didn’t know if there was a “beyond our galaxy”. We had Einstein and his theory of general relativity, which predicted an expanding universe, but we lacked the means to take accurate measurements of astronomically distant objects and so figure out our place within it.

And then we discovered the Cepheid Variable. Henrietta Swan Leavitt figured out that this particular class of stars had a luminosity or brightness that changed periodically and the variability of the star directly defined its luminosity. Once we knew the apparent brightness of one of these stars and its distance, we could use that information to work out the actual brightness of other stars of the same class. A quick bit of maths with the inverse square law gave us the distances to those stars and we had the first ‘standard candles’ that allowed us to start determining the dimensions of the universe.

In 1923, Edwin Hubble used this class of stars to show that the Andromeda Galaxy lay outside the boundaries of our Milky Way. Today, we know the age of the universe to an incredible precision (13.798 billion years +/- 37 million years) and we have a remarkably accurate map of where stuff is, out to a few billion light years. When one wants to accurately measure something, standards matter. But, and as the age of the universe has been constantly refined since the first estimates using those first Cepheid stars, it’s important to get them right.

The other thing about standards is that they get used as the measure of achievement. Surely that’s the point, I hear you cry. Yes, but compliance with a standard can be a double edged sword. Here’s a trenchant observation by the economist Charles Goodhart:

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes

If a standard is a measurement of a physical reality, then working to achieve the specification is a good thing. But when a standard is more arbitrary, say the statistical regularity derived by the number of citable items in a journal, divided by the number of times those items were cited, things can get rather interesting. This observed statistical regularity drives the behavior of the overwhelming majority of scholarly research. It also drives the money. Impact factor is regarded as a proxy metric. But a proxy for what…

Scholars want to get published in high impact journals and so the proxy here is that the clustered articles that drive a given Impact Factor for a given year are indicative of better quality, higher significance work. So if you can get into that collection, you can argue that your work is of a higher quality and significance, regardless of whether that is in fact the case. If you are the editor of one of those high impact journals, you’ll be assessing the inbound manuscripts with one eye on whether they’ll be a net benefit or loss to your allocation of citable entities for the period in question. And I bet the same goes on in the mind of the referee who is selected to offer up an opinion on the manuscript.

Leaving aside all the other complaints that are raised about the Impact Factor, looking at it logically, it’s not a very good tool for assessing scholarly achievement. But right now, it’s still the only game in town. It doesn’t get much better when you take a look at the H-index or other measures based on citation. One is taking the complexity of the work that leads to an output – the manuscript, and defining its importance in the scholarly milieu based on the number of inbound connections made to that output over the course of some arbitrary temporal component.

“If only there was a better way” goes the cry. The altmetrics movement offers up alternative measures of impact predicated on other signals. For the most part, these alternative signals are being looked for in the information streams of the Internet. Social network platforms, forums, blog posts and comments, places like that. The theory is that these signals will provide additional metrics of the value of a given piece of work, and certainly, for things like computer code and similar outputs, the citation metric is utterly useless. Here, one is definitely looking at usage. I’m really excited by the potential of altmetrics. Some really smart people are working on it, but it’s early days. There are a multitude of issues to understand before these metrics can become truly significant as measures of ‘impact’.

For example, none of the current social media networks supply signals for negative sentiment. Retweets, Likes, +1s are all positive signals. In fact, the platforms seem to be engineered to deliberately avoid that option; not surprising perhaps when one considers the advertising based business models that drive the platforms. It’s often the case that if you use Twitter, you will have to make the point that “a retweet is not an endorsement.” Indeed, there’s data out there that indicates that many people retweet things based not on an understanding and opinion of the information in question, but for other reasons, reasons that are complex and poorly understood in the business of figuring out the significance of the dissemination of the information.

It’s worth noting here that citation suffers a similar issue, we all know of papers that are cited “just because”. Current citation analysis makes no attempt to quantify the sentiment that surrounds each act of citing. The other issue that affects both citation and altmetric measures of impact is the difficulty in quantifying the magnitude of the cite or the tweet in accordance with the significance of the person who is doing it. Now it’s easy to make statements about how such social network data simply provides a popularity measure, but I think there are more complex network interactions to be discovered here, and these may well have some utility in describing the fluxes of information exchange around a particular output or collection of outputs.

But does this get us any nearer to a ‘standard candle’ of scientific impact? Science progresses because bluntly, ideas that match to reality survive, and ones that do not, die. Epicycles, The Ether, Steady State Theory, countless other neat solutions to observed phenomena, all have perished as the relentless tide of observation has uncovered their fatal flaws.

And in that statement, perhaps we can see something about how to generate genuine impact metrics. Ones that are resistant to Goodhart’s Law.

I’ll cut to the chase. If a paper was to contain within it a set of machine readable statements that described the content of the paper, then one could start to collect information on how that paper sat within the wider context of research in the area. For example…

Paper 1 – Result: X causes Y;

Paper 2 – Result(contradiction): X Does Not Cause Y but Z does;

Speculation: Because Z causes Y it could also cause A

Paper 3 – Result: Z causes A in [Species1]

Speculation Z may cause A in [Species2].

Paper 4 – Result: Z does NOT cause A in [Species 2]

Here our four ‘papers’ have encapsulated the debate and the evidence about whether X or Z cause Y, and perhaps A. As the subsequent evidence builds up, the evidence would accrue to the authors of the papers. And that would include speculation and soft assertions as well as the hard results. Now at this point, I’d like to acknowledge some of the inspiration for this thinking, namely this presentation by Anita de Waard which contains much of interest in the area of machine readable scientific papers.

A world in which papers contain both human readable context and machine readable context could provide some truly informative metrics. As new papers come in either supporting or refuting the assertions made previously, one could have metrics that propagate both forwards and backwards through the research stream. The original paper(s) could have constantly updated metrics of support for or against the results, and who is producing those results. Contextual results of similar results (observations made in analogous systems for example) could be seen alongside the research. Later research could be crosslinked directly to the work that led to these results, via the assertions. Again, these would be dynamic links, where the strength of the linkage is predicated upon the totality of the research in that particular area. Even the good old citation could be folded in, providing some further interesting context as to which papers get cited in downstream work, and which do not. And so could the altmetrics measurements.

So our successful authors start to build up a batting average, based upon the the number of times their conclusions match to a peer validated reality. A validation that doesn’t require that the paper be cited, only that it be discoverable via a common machine readable ontology for the subject in question. It would be inherently resistant to Goodhart’s law as the metrics would be directly related to an emerging reality as the science coalesces around the observable facts.

Of course, there is one big issue with all of this; how to get the statements, in a machine readable format. Scientific papers are complex things, and teaching machines to read them is right up there in the really hard category of natural language processing. I’m told that authors are very unlikely to want to put the time in to do this. I’m inclined to believe that, though it was once said that authors wouldn’t deposit DNA or protein sequences in international databases.

If only there were people out there who could take the author output and derive a set of declarative statements as part of say, an editorial process. This is something that publishers could do. Some publishers already spend considerable time and money on maintaining ontologies that could be used for the purposes of standardizing the machine readable output. Perhaps scholarly societies could play a role here as well. I’m constantly struck by the fact that for all the noise about making the research output available for mining, the fact is, that the current article format is written for humans, by humans. Surely, it would be a better to generate the computable elements of a given paper, minus all the human needed cruft, and simply supply that rather than try to retrospectively extract statements and sentiment from the complexity of a modern research article. Machine readable articles, built on author/editor/publisher curated declarative statements and the associated data (or links thereto), could be a way of generating metrics that get us nearer to a ‘standard candle’ of scientific research output.

David Smith

David Smith is a frood who knows where his towel is, more or less. He’s also the Head of Product Solutions for The IET. Previously he has held jobs with ‘innovation’ in the title and he is a lapsed (some would say failed) scientist with a publication or two to his name.

Discussion

28 Thoughts on "The Measurement of the Thing: Thinking About Metrics, Altmetrics and How to Beat Goodhart's Law"

“It’s worth noting here that citation suffers a similar issue, we all know of papers that are cited “just because”. Current citation analysis makes no attempt to quantify the sentiment that surrounds each act of citing.”

By coincidence, I wrote about precisely this two days ago, and proposed a simple approach to fixing this. (As always, the hard parts aren’t technical, but social). For anyone who’s interested, it’s here.

There is a long history in the literature on the meaning of the citation and whether it is possible to create meaning out of a classification system.

First, like Facebook and Google+ negative citations are rare in the scientific literature. It is easier to ignore a bad piece of research than cite it. While attempting to overturn an established dogma certainly requires negative citation, few texts take the time to point all the weaknesses out in other texts.

In the 1960s and 1970s, a number of classification systems were published in an attempt to better understand the meaning of citation. For example, Weinstock (1971) provides 15 motivations, from “paying homage to pioneers,” to “identifying methodology, equipment, etc,” to “disputing priority claims of others.” Even without large computers, these classifications could be recorded on index cards.

During this time, researchers attempted to see if the meaning of citation could be discerned from their context. Using wet computers (i.e. brains) readers attempted to classify the meaning of each citation in a paper and came to the conclusion that many citations are ambiguous and perfunctory (see Chubin and Moitra, 1975; Moravcsik and Murugesan, 1975)

Terrence Brooks (1985, 1986) took recently published papers back to their authors to see if the authors themselves could classify their own citations. He discovered that authors are also not so clear on why they cite, have multiple motivations, and that persuasion (even for scientists) was a principal function of citation. Based on theory and practice, getting a classification system off the ground looked much harder than initially conceived.

This did not stop computer scientists in the 1990s and 2000s to see if computer-based semantic analysis could do the work. Not surprising, computer programs did not perform any better than humans, and indeed they performed much worse.

In sum, citations are private acts that hide complex author intentions. Like Google+ and Facebook, they do not completely reveal how scientists feel about each others work. New machine readable technology, sadly, cannot solve this problem.

You might want to take into account the issue tree structure of science, as described briefly in my two articles — http://scholarlykitchen.sspnet.org/2012/07/17/how-does-science-progress-by-branching-and-leaping-perhaps/ and http://scholarlykitchen.sspnet.org/2013/07/10/the-issue-tree-structure-of-expressed-thought/.

This structure implies that most articles are articulating prior concepts rather than seeking to prove or disprove prior hypotheses, as your model seems to assume (but perhaps I am wrong about this). In this regard I have long thought that a simple impact metric would be to map and measure the spread of the new language introduced by important articles. Machine readability is certainly applicable here but the machine does not have to understand the science, as it were. The impact of an article lies in its helping people think (and therefore write) about the world in new ways.

+1 to Phil’s point about negative citations being the exception to the rule.

Computers haven’t been taught (yet) to sum up poz/neg citations/tweets/etc, but we _are_ able, with existing technologies, to much more easily find the citations and go read them for ourselves, using our wet computers (+1 to Phil for that term, as well).

Publisher websites (the larger ones, anyway) often include a “Citing papers” section, where you can find the papers that have cited the one you are currently reading. All it takes is a few clicks and Ctrl+F to find the context of the citations to the original paper.

Similarly, Altmetric has done a great job of displaying both the raw counts for social media attention, and enabling readers to click through to read the tweets (which in some cases can help get at sentiments surrounding a paper). We here at Impactstory have done the same for altmetrics related to products other than journal articles, and are daily finding ways to enhance a) what metrics we offer and b) what accompanying qualitative information we can display to give context to the numbers.

That said, if we want to avoid a Goodhart’s Law scenario, perhaps we should only use our wet computers for sentiment analysis.

Hi all, Just to be clear – if the article wasn’t… I’m proposing that Machine readable statements are generated that can then be interrogated by machines to generate a variety of useful metrics. I’m NOT saying that we have to teach machines to read the papers first. In fact, because that bit is so darned hard, I’m suggesting that there are human powered ways to generate the machine readable data. Or Wet Computers to use Phil’s nifty phrase!

David W – I think you raise a good point about the spread of new language – no reason that couldn’t be captured – possibly it’s easier to do this bit with machines rather than wet computers.

Either way is fine: once an appropriate metadata scheme is in place, the corpus of existing untagged becomes a playground for researchers who want to try out ways to automatically induce appropriate tagging. Our goal should not be to build such tools, but to remove barriers from those who can.

Perhaps leveraging some expertise in the TEI/digital humanities community (as to how to easily encode entire texts, not the specifics of devising metadata schema for citations) might be a good place to start.

So, some of this is making my “wet computer” hurt, but I’ve a couple brief comments.

Perhaps I’m oversimplifying, but it seems it would actually be really easy to create a set of tags for results, causes and so on which are machine readable. We already do this in scholarly publishing and these tags are used for many things.

Also, as far as the discussion around citations, it seems that Social Cite, which I think we will be hearing a bit more about soon, addresses some of these questions.

Unfortunately tagging articles and measuring impact are very different challenges. The latter involves seeing the relational structure among the articles in the context of emerging knowledge (a problem I have done a lot of work on). Thus the question is how the various articles are related, preferably in real time, not what a given article says, which is what tagging is about. Just as blog comments build upon one another into a body of reasoning, so does science, but the articles are published in such a way that this underlying reasoning structure is largely hidden.

Machine readable summaries of results, using simple sentences might be helpful, or it might not. It is an intriguing concept. But most of a typical research article is not about the results. First there is a review of the issue, followed by a description of what was done, finally followed by an explanation of the results, with perhaps some speculation at the end. How tagging might tell us where the ideas came from (or even harder, where they are going) is difficult to see.

I think so… It needs the network effect for it to be useful of course. The citation thing – perhaps I didn’t express myself so well – but in a mature system… I think one can arrive at a given paper and, irrespective of what citations accrue to that paper, start to see indications of where that paper sits with respect to what it is about. What papers are ‘related via similar declarations of cause and effect, and whether there might be contra indicators and so on. In fact, one could envision a pre-publication service… “how original is my paper…”

The fundamental problem here and one of the great conundrums of the social sciences is that the object of study is aware of being studied and is often sufficiently interested in the outcomes of such study as to attempt to influence it. Imagine cell biology being confounded in this way.

Indeed Frank. In fact making a prediction known to the group whose behavior is being predicted may be sufficient to make the prediction fail.

Interesting… But what I’m proposing is essentially a distillation of a given paper down to a set of axioms and similar that can be added to the metadata of a given article. There’s no novel information being added here, so it’s no more susceptible to gaming/influence than the paper is – possibly less as in theory (and this is the point) papers will be easier to measure via their match to an eventually derived reality, regardless of whether they are cited, or perhaps even read by another human.

On reflection I am inclined to think that there is an implicit model of science here that may not be correct for the general case. That is, science is a system building process, not a simple confirm or disconfirm process, except in cases of controversy. If so then your axiomatic approach might be best reserved for controversies. In that regard your central concept of an “eventually derived reality” needs to be explained. Mind you my team did some work on this and discovered what they think is a topological transition in the co-authorship network that occurs when a community goes from considering multiple hypotheses to picking one as confirmed. That may be close to what you are talking about.

This proposal appears to rely on the ability of machines to determine what results a scientific experiment has produced that lend support to, or tend to disconfirm, a hypothesis. But unless I’m mistaken, this process makes no allowance for whether the methodology used to obtain the results was sound or not, right?

Hi Sandy,
It relies on humans being able to turn the verbiage of a given paper into a set of ‘triples’ and other associated tags (see Adam’s comment above). And then have machines do neat stuff with the various algorithms that are out there to analyse this stuff. Your point about a methodology is well taken – for the sake of some brevity my description above was something of a starting point for what we might want to describe this way. So methods could well be important to include. Or one might choose to focus on the results and the speculation, because reality will bring out which ones survive and which ones don’t regardless of whether the methodology was sound. Mind you, presumably ALL incorrect results are a result of a flawed methodology (tongue somewhat in cheek)…

Actually semantic web triples are typically logically equivalent to simple sentences, but your concept is intriguing. The deeper challenge is to discern the relational structure among the papers, not just to understand the central claims of each alone. As I like to put it, science is a system of knowledge, not a pile of knowledge.

Triples are only equivalent to extremely simple sentences, but even a low tech parser can get very good semantics from Simplified Natural Language (hereafter SNL). There are a few rules to follow, for example, you have to unpack ambiguous anaphoric references (“The lizard’s tail has fallen off! Don’t worry, it will grow back.” 🙂 and things like that, but the SNL approach has been used for decades by companies and the military to write stuff to be automatically translated into many languages. (And it works WAY better than Google’s mostly terrible auto-translation systems!)

Regarding meta knowledge, I agree, but that is an ever-changing domain, and needs to be updated regularly. Moreover, this is often expressly expressed in review papers, so if I were writing the grant proposal, I’d only bother encoding review articles, which would provide continuous updating of both the first order and meta knowledge.

Thanks for this! Actually my method would be to ask authors to submit a set of declarative statements and speculations written in good old plain english. A human would then basically manually turn those into triples. It would be part of the publication process. One might be able as you suggest to turn a machine onto simplified text – an interesting idea. Of course if this idea works, one builds up a corpus of material that one can use to train a machine.

But really, my point is this – if we want machine readable papers… we should do it at creation time, not post publication. Mining papers post publication, shouldn’t be for the purpose of extracting basic facts about what the paper is about.

I mostly disagree. First, you need high quality standards of translation. A human manually encoding triples from JPE (just plain English) isn’t going to be consistent enough. A human translating into SNL and then a machine translation from SNL into RDF is, and you get to keep the SNL along the way in case there was a mis-xlation. Moreover, as the algorithms improve, the SNL can be re-slated. Moreover, the meaning of papers changes with time, and it will be important to be able ot go back and interpret the important ones. The SNL mostly won’t change, but the xlation from SNL to RDF will as ontologies float. So, the part I agree with it getting the editors to do the initial SNL, but let the authors q/a it, and let a machine xlate it…and keep everything around.

“Of course if this idea works, one builds up a corpus of material that one can use to train a machine.”

This was the basic idea of Yale-school case-based reasoning. Its main fruition was automated phone trees.

Thanks a lot for the informative post!

If I haven’t misunderstood your last paragraph, //”… If only there were people out there who could take the author output and derive a set of declarative statements as part of say, an editorial process…”//, I guess, I have an answer:

There are BIOCURATORS already (http://www.biocurator.org/ & https://tinyurl.com/po57fhv – E-PAPER), who do many such things: (including, but not only) text mining, annotating, and helping scientists in “writing” Structured Digital Abstracts (SDAs). I think many – in fact, all – publishers of scholarly materials must have a handful of biocurators – who are essentially bioinformaticians – in their editorial team. This shall also get more jobs for those churned-out PhDs, who have a background in bioinformatics, or are quick to learn the tips & tricks of biocuraton so as to smoothen and transpire the scholarly publishing further!

And the machine-readable publishing, called “Semantic Publishing” (https://tinyurl.com/pjwo8oo) is already in practice, for instance in FEBS Letters, and many are likely in pipeline.

Indeed, it’s a lot more work for the scientists who might have one manuscript per finger to handle. But once the publishers make it mandatory for the authors, and when people get used to the practice, it will all be good. I think, this is only a starting trouble, just like as many would have felt until a few years ago that it’s difficult to move ahead with online articles from big & hard-bound ledgers, of journals.

Altmetrics (https://tinyurl.com/puw8ae5) indeed works well in this regard in analysing the publication bookmarking, accession demographics, and news-worthiness, besides others. So, my theory is: as soon as biocuration and semantic publishing move ahead in full-swing, the modern metrics as well would co-evolve and shine!

Here we go again. The huge mistake made by all bibliometricians is that they fail to consider individual papers. As soon as you do that, you find that altmetrics promote, totally uncritically, the trivial and trendy. We show that in “Why you should ignore altmetrics and other bibliometric nightmares” http://www.dcscience.net/?p=6369

Altmetrics are numbers generated by people who don’t understand research, for people who don’t understand research. They are the alternative medicine of science.

Given that there are over a million papers published each year it is hard to read them all. Some of us are trying to understand what is going on beyond what we can read. Science is important so it deserves study. There are in fact trends in science, even what I would call fads. It is useful to know this.

David Smith: The term machine readable is pretty vague. Even simple “related article” algorithms read the articles so we have lots of machine reading already. It sounds to me like you are talking about a heavy duty artificial intelligence algorithm that analyzes the logical relations between articles, especially the inductive relations of confirmation and disconfirmation, based on axiomatized summaries. Whether this can even be done is open to question, but hey that is what makes a problem interesting. The thing is that machine readability per se is the least of it.

“We had Einstein and his theory of general relativity, which predicted an expanding universe….”

GR itself predicts nothing of the sort. The cosmological constant was only introduced to preserve staticness; expansion and contraction were (and are) valid solutions. It took Friedmann, Lemaître, and Hubble to cement the former.

I would further note that mentioning the dispute over the calibration of the Cepheid distance scale would have been a salient preface to the rest of the discussion. Without a zero point, the period-luminosity relation is of very limited value.