I took advantage of a special opportunity to study altmetrics recently. My kid was graduating from college in upstate New York, and I built into the trip an extra day to visit Cooperstown, home of the baseball Hall of Fame and the spiritual center of the altmetrics movement. Baseball, as fans and dispassionate observers know, is a game of statistics, and statistics are the national pastime.
As we in scholarly communications struggle with how to measure the merit of different publications and the researchers behind them, I thought it useful to study how the baseball community does it. What is impact factor for one community is batting average for another, though far, far more work has been put into the calculation of batting averages over the years. Indeed, baseball may be the most studied phenomenon of all, putting aside the analysis of the pages of Facebook. Perhaps the umpires of one discipline can assist the referees of another. The questions are analogous: How valuable is a particular article? Who is the best baseball player of all time?
Part of the reason that statistics have become such a big part of baseball is that there is some growing discomfort–call it post-modern unease–that perhaps the greatest of the greats were propped up by myths and folklore, stories of charisma instead of the hard data of performance on the field. Often human judgment and other superstitions were deployed in the place of statistical correlations, and out of this subjective stew came a list of the usual suspects: Ty Cobb, Babe Ruth, Warren Spahn, Ted Williams, Mickey Mantle–Mickey Mantle!–Willie Mays, Henry Aaron, Ichiro Suzuki, and a small number of other familiar names. If we could get our hands on the right measurements, we would assess baseball greatness differently, and the legendary heroes of yesteryear would be shown to be just that, legends.
For many decades the standard measure of a ballplayer’s impact was batting average. No one really disputes the value of this metric; what is in dispute is whether it tells the entire story. A .300 hitter is, well, a .300 hitter, someone who reliably manages to hit his way on base almost a third of the time, either driving in runs or putting himself in a position to score. Had the community stopped there, however, we would be stuck with the uncomfortable fact that no one had hit .400 in many decades (Ted Williams hit .406 in 1941), which could lead us to the unlikely but oddly satisfying myth that baseball players were better once upon a time. No, batting average tells a story, but it does not tell the entire story.
To batting average we have to add other ways a player gets on base, which gets us to the on-base percentage, a measure that makes the true but unromantic point that a walk is as good as a base hit. But even there we know there is more: it’s not just a matter of getting on base but also of how many bases a player crosses. This then allows us to bring in the hard numbers for extra-base hits and home runs, through the slugging average. Suddenly someone like Mantle, who only (!) batted .298 over his career looms larger, as his impact is now lifted by the achievement of 536 lifetime home runs. We are comfortably in the world of altmetrics now, where we find ourselves measuring more things and establishing new sets of correlations.
To be the best player means living and playing in context; a batter hits not only against a pitcher but against a particular pitcher. To the remarkable record of Carl Yastrzemski or the now-dominant Miguel Cabrera, we have to factor in whether those performances would vary if pitted against a Bob Feller or Cy Young. It takes nothing away from Joe Dimaggio–or does it?–to say that he never had to bat against Mariano Rivera. So as we begin our analysis of Stan Musial, we must take each pitcher he faced into account; and from there we have to determine where that pitcher was in his career: the young fastballer? The canny veteran? And how much of the average for Pete Rose can be accounted for by looking at the weary efforts of players in the second half of a doubleheader?
This prompts an entirely new set of questions that have to be built into our model. It’s one thing to get a base hit, quite another to get a hit when someone is on base. How do we factor in the “clutch” factor? And what about the inning of a game when a batter successfully connects with the ball, the ability of a pitcher to strike someone out late in the game, the vexed question of games called on account of rain, and the moral ambiguity posed by switch hitters? Did a particular player steal a base on a full count or earlier in the at-bat, giving the batter more time to get a hit and drive the base runner home? It’s not enough to say that Jackie Robinson stole bases. We have to know when, how many, and at what time of day.
A great hitter, in other words, is not an abstraction but a set of quantifiable performances that take place in specific contexts. A right-fielder may lead all other right-fielders in sacrifice flies, but the even more important statistic is to lead in sacrifice flies in a particular inning, with the score tied, against a specific pitcher, on a balmy day in July. When we look at the data closely we see the deconstruction of our heroes into the sets of measures and exploited opportunities that they are.
The irrefutable evidence of this is given pride of place at Cooperstown, where the baseball museum includes an exhibit of Roger Maris. Maris notoriously hit 61 home runs in 1961, breaking the beloved Babe Ruth’s record of 60. Maris’s record, however, is marked with an asterisk to denote that the 1961 season had 162 games, whereas 1927, the year Ruth set the old record, had only 154. Well, who is the greater home-run hitter? Does it matter that Ruth in 1927 had more at-bats than Maris in 1961? Is the number of games the right metric? The number of at-bats? The number of pitches thrown? And do we factor in that Ruth had Lou Gehrig batting behind him while Maris had Mantle? What pitcher would be willing to pitch around Ruth or Maris knowing that Gehrig and Mantle were lying in ambush? That asterisk is an expression of humility, as it marks the point where one statistical calculation runs afoul of another.
I have barely scratched the surface of the statistical analysis of baseball, but professionals have drilled into this in depth, beginning with Nobel Laureate Bill James. This mode of analysis was brought to a wide audience with the publication of Moneyball; a movie version followed. Moneyball shows how the incorporation of statistics into the management of a team could lead to success. Before Moneyball who knew that Kevin Youkilis could make the outstanding contribution he did or that the otherwise hapless Oakland Athletics could win a pennant? Statistics could be brought to bear on the real world and change the world. The lesson is clear: quantify everything; whenever you see two numbers, add them together; never judge, only deduce.
The success of Moneyball has led other teams to work with the statistical method. Now we find that what was once the competitive advantage of the Oakland A’s is the norm everywhere. Statistical analysis of baseball, in other words, is no longer something that sets one team apart but is simply the cost of doing business; it’s part of the overhead. But this foray into quantitative analysis has taught us something. It has broken down the legends of baseball into their component parts; it has substituted analysis for intuition. And now, after this analysis, we can see what the numbers tell us so clearly: the great players were those with names like Cobb, Ruth, Spahn, Mantle, Mays, William, and Suzuki. We never would have known this otherwise.
18 Thoughts on "Cooperstown, Ground Zero for Altmetrics"
Stephen Jay Gould examined both the 400 hitter (regression to the mean, he argued) and Dimaggio’s streak (helpful umpiring on the judgement of errors, in his view). Well worth a look.
But the analogy with scholarly publishing doesn’t quite work. Once a baseball play is over then it’s done with and the nerds can do their work. But once a paper is published then when is the citation analysis complete? When do we know it will no longer be used.
And what do we mean by, and what do we record for, ‘used’?
That said it would be fascinating to somehow grade a citation. Is a cite by a nobel prize winner worth more than by a novice? What happens when the novice becomes a nobellist – do we reset the records? We could keep a good bar in profit for years with this.
Now we need a similar analysis for cricket so at least parts of the rest of the world can follow this…
A framing problem with this is that scholarly communications are currently measured by the impact factor, which is akin to measuring the effectiveness of a team, not individual players. Cooperstown’s statistics — and baseball’s statistics — are largely about measuring individual player performance, not team performance. (For instance, the Oakland A’s used sabremetrics (alt-metrics) to optimize their OBP:$ ratio, but they never won a World Series with it.)
Should we be developing more author-based metrics (in addition to h-index and m-index) to predict author effectiveness? What would be our “accepted paper percentage (APP)” for authors? Given the way journals don’t share rejection data, would this be even possible.
Baseball’s statistics work because the game is available for all to see. You can score a game as effectively as any other observer, and derive the same statistics. Journal publication is full of opaque silos of rejection, and authors want to keep it this way, for obvious reasons. If they’re going to strike out, why show everyone?
Although it’s an interesting metaphor, I agree with Martin, and then some, that scientific publication would appear to be way more complex than baseball. Add to his analysis multiple authorship, peer review, the difficulty of putting together the appropriate data (if we even knew what it was), the sheer size of science, and who and/or what is actually being ‘opposed’ (if you will)? Other scientists? The reviewers? The domain itself??? None of this is to say that we shouldn’t address science statistically – of course we should, and certainly not enough of this is done – but it’s pretty unsurprising that it hasn’t got anywhere near the state of Moneyball. One of the most important missing factors, to my mind, is the personal histories of scientists. We have this only for the elder statesfolk, but in baseball we know everything about the up-and-coming players, which is where it matters. But that is, as they call it “The Show”, and scientists have a greater expectation of, and desire for, privacy. So it’s not clear to me that one will ever have the data one needs to do this right.
Why at all make the comparison between baseball and scientific publications? Baseball is a sport. After every game, there is a winning team and a losing team. Do we really want that science becomes a sport? What purpose does it serve to develop ranking list for scientists? It is certainly not good for science, or for the diffusion of science. As interesting as the metrics for baseball are, let’s keep them for baseball.
Whereas I agree emotionally with this sentiment, in practice there are real decisions that have to be made regarding grants and tenure and the direction of research. Better these should be based on data than sentiment.
The choice is not between data and sentiment. The choice is between the misapplication of data and judgment.
Firstly, that’s not what you said in the post I was replying to, and if you had, I’d have disagreed with you even more strongly. Whereas data ought not to be misapplied, posing the misapplication of data against judgement seems to me to be irrelevant. What about the correct application of data? The “Moneyball Falacy” is that judgement (whatever that is) trumps the correct application of data.
(And the “Apple Fallacy” is depending upon spelling correctors rather than re-reading one’s own writing! 🙂
I think most of the comments so far are on the right track. In baseball, there is a clear and obvious goals: winning games. This makes it much more straightforward for designing metrics. For offensive players, the point is not to make an out and to score runs. For defensive players the point is to record outs and prevent runs. Scholarly research, however, has no such clear set of goals. Is the point of the research to add to the world’s knowledge? To inspire further research? To drive economic development? To cure disease? The answer is all of the above and more.
Regardless, there are still lessons that can be learned from baseball as Joe points out above. First is the near-constant evolution of metrics in baseball. Rather than sitting on one metric for decades (the Impact Factor), baseball has, at least in recent years, moved from the batting average to OBP and Slugging Average, and from there to OPS as the measure of an offensive player (https://en.wikipedia.org/wiki/On-base_plus_slugging). That became obsolete and was replaced by Adjusted OPS. Even that is less used these days, and the more common stat is WAR (Wins Above Replacement https://en.wikipedia.org/wiki/Wins_above_replacement), which takes fielding and baserunning into account, though I’m told even that is becoming old hat.
Similarly, baseball offers us extremely tailored statistics for specific needs. Rather than one blunt IF to cover everything, pitchers for example can be measured by Wins, Saves, ERA, WHIP, K/9, FIP, xFIP, BABIP, LOB%, HR/FB, and so on, and all can be adjusted for individual parks. The promise of altmetrics is in offering custom statistics for different use cases.
But perhaps the key lesson above, as stated in the closing paragraph, is that advanced statistics often end up telling us what we already know. Calculating Willie Mays’ WAR confirms that he was a great player, but if you ever saw him on the field, you already knew that. There are some levels where the “eyeball test” is good enough. Advanced statistics can separate out minutiae, and let us know that Mike Trout is a better overall player than Miguel Cabrera, but they’re not needed for us to determine that they are both great players. The question raised is what level of detail we need for researcher assessment. Do we really need an IF that goes out to three digits past the decimal point to understand the value of a journal, or are we better off with our innate sense of the rankings of the journals in our field? Does having a numerical altmetric score tell us more about a paper than a detailed reading by an expert in the field? Does knowing that paper X scored .002 higher on some chosen scale than paper Y really tell us that much about the relative merit of each researcher’s work?
I was excited to see the Moneyball comparison in your post–it’s a favorite movie for all of us at Impactstory, for obvious reasons. 🙂
I think you don’t take the analogy far enough, though. To me, Moneyball’s takeaway message is that not everyone has to be amazing at the One Thing (be it batting average or h-index). Rather, there are going to be some players that are great at the One Thing and others who are good at Other Important Things (getting on base, etc/teaching or getting grant money or community outreach etc). When we can devise metrics to measure the impact of those other things, and when we use a more complete suite of metrics to judge our team/department/institution, we can start to fill in the gaps left by years of trying to hire star players who do the One Things well, but not a lot else, and make a team/department/etc that is well-rounded and helps us achieve our larger goals as an institution.
It is disheartening to see that most commenters on this post fail to see that this is satirical.
Huh. It doesn’t read as satire at all to me. Instead, it reads as though you used a different lens to look at the issue of metrics–and their uses–in academia.
Bummer to hear that your intent was a satire. I was hoping that SK folks were starting to see the light w/r/t altmetrics. 🙂
Does anyone anywhere on the internet care what the OP literally wrote?
(A joke, slightly on point: You are driving along in your Google self-driving car, or perhaps one should say that it is driving you along. Anyway… Suddenly, a child darts out from behind a parked car chasing his dog into the street. The can can’t possibly stop in time, but it can swerve to hit one or the other, the child or the dog. In the blink of an eye the car “phones home” to the GooglePlex to ask Ma Goog what to do. She looks in her deep and broad knowledge of everything, and noticing that you buy lots of dog toys, and search for dog parks, and are generally fond of dogs, it makes the obvious choice.)
Satire depends heavily on what one believes, Joe, so I too missed it. Were you making fun of altmetrics or baseball? It may not work if you were making fun of both. Perhaps you would be kind enought to explain the joke, heading for home as it were.
I am reminded of the story that the bagpipe was a joke played on the Scots by the French, but the Scots have yet to get it.
Good story and I enjoyed the satire. Some times we take ourselves too serious. By the way what is the impact of steroids on the statistics. Not sure there is a comparison in scientific publishing. Maybe we can consider Elsevier to be on steroids. See you at SSP.