When metrics are adopted as evaluative tools, there is always a temptation to game them. Without rules and sanctions to prevent widespread manipulation, metrics lose their relevance, become meaningless, and are quickly disregarded by those who once believed that they stood for something important.
Why count Facebook Likes and Tweets when you can purchase thousands of them for just a few dollars? For these metrics to remain robust indicators of something meaningful, it is important to keep the cheats out of the system.
Each year Thomson Reuters, producers of the Journal Impact Factor, puts dozens of journals in time-out for manipulating their numbers through self-citation. This year, for the first time, they delisted several titles for engaging in a citation cartel.
Thomson Reuters has a vested interest in keeping their citation database clean for a simple reason — they profit by selling their data and services to universities, publishers, governments, and funding agencies. Sanctioning those who wish to game the system from their database puts Thomson Reuters in a position of authority, and not all researchers are pleased with their monopoly of power. Some researchers and organizations support Google Scholar Citations as a free alternative.
In a recent paper uploaded to the arXiv, “Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting,” free may come with some serious data quality issues. Indeed, as the researchers write, the effort required to radically alter citation counts to one’s papers (and thus increase one’s h-index) are open to anyone who can cut, paste, and post:
It is not necessary to use any type of software for creating faked documents: you only need to copy and paste the same text over and over again and upload the resulting documents in a webpage under an institutional domain.
For the purposes of their experiment, the researchers created a fake researcher (Marco Alberto Pantani-Contador — a reference to two infamous cyclists, Marco Pantani and Alberto Contador, each of whom was accused of blood doping). Copying and pasting text from a website, adding a few figures and graphs and lots and lots of self-citations (774 citations to 129 papers), the researchers created six fake documents, translated them into English using Google Translate, and uploaded them to a new webpage under their university’s domain. It was a process, the authors explain, that took less than half a day’s work.
All the authors needed to do was to sit back and wait for Google Scholar to index them.
Less than a month later, the citations came back, boosting everyone’s citation profile and the profiles of several journals. After demonstrating that Google Scholar could be gamed, the researchers took down the faked documents and waited for the citations to return to their previous numbers. Unfortunately, they are still waiting. Versions of their documents are still available in Google as cached documents. This suggests that the system is both vulnerable to gaming and resistant to correction. Google Scholar is on a trajectory toward chaos.
While the authors do concede that a free tool for evaluating the impact of research is empowering to those traditionally disenfranchised by commercial products, there is a tradeoff between data integrity and free. They write:
Switching form a controlled environment where the production, dissemination and evaluation of scientific knowledge is monitored (even accepting all the shortcomings of peer review) to a environment that lacks of any kind of control rather than researchers’ consciousness is a radical novelty that encounters many dangers.
Pursuing an open and unregulated evaluation system means having all of the filters at the production end. Yet without any central authority, the authors of this paper can only appeal to researchers’ good “ethical values” and guidelines for acceptable behavior. Unfortunately, the authors’ own experiment demonstrates how easy it is to game the system and how impervious the system is to self-correction.
The barrier to entry for indexing in Google Scholar is an academic domain, which, as many in academia and publishing understand very well, is largely unregulated. Students are often given webspace for their projects; departments and labs and individuals are allocated space or are permitted to run their own servers. Most institutional repositories require that those submitting new documents merely click through a generic copyright page. Subject-based repositories, like the arXiv, provide only cursory review of submitted documents. The approach for these spaces is to intervene only when someone complains. While uploading documents into these spaces is considered “publishing” in the broadest interpretation possible, these spaces lack the same filtering that goes on in the journal space. Combining citations from a largely unregulated space with a tightly regulated space is not just problematic, it corrupts the citation as an evaluative metric.
Calling on Google to tightly regulate their citation index is a call to deaf ears. Google prefers algorithms over humans, and at this time, it is still very easy to trick an indexing software to think you’ve created an original scholarly document. Moreover, there is no reason why Google, unlike Thomson Reuters, would want to invest huge amount of human resources into fixing their citation indexing problem. Google is in the business of selling advertisements to companies, not metrics to scientific organizations.