A few years ago, I helped institute a journal’s first “Most and Top” lists — Most Read, Top Searches, Most Blogged, Most Cited. They seemed to work pretty well, but the Top Searches list had an anomaly — the term “biological” was coming in ranked in the Top 5 week after week. It didn’t fit. Every other term was a medical condition — diabetes, sepsis, hyponatremia, and so forth. Why would “biological” rank so high? Was there something about our audience we didn’t understand?
Because it was a raw count, it took the team a while to really question its presence. I fell into the same trap. We’d worked hard to create the counters and lists and categories and interface, and the data were the data. Who were we to question what was a pure reflection of search engine behavior?
Finally, after weeks of strangely consistent results, we started digging. We entered “biological” into our search engine, and only one article showed up as a likely result, an article with “biological” in the title. Some IP address detective work ensued. Fast-forward, and it turns out that the author of the article in question had once hired an industrious assistant who had set up an automated search for the term “biological” in the belief that this would help him know when the paper was cited so he could impress his boss. The assistant? He’d moved on a few years earlier, but the automated search was still running. It took about a week for the new staff to find it and shut it down. “Biological” immediately dropped out of our Top Searches list.
The edge of the network can have unexpected power.
Recently, Google learned about the power of the Internet’s Oort cloud, that seemingly endless cluster of small sites orbiting the major players like the outer comet cloud slowly orbiting our solar system, launching the stray comet our way every now and again. As explained in a New York Times article entitled, “Search Optimization and Its Dirty Little Secrets,” J.C. Penney’s search engine optimization (SEO) firm probably indulged in “black hat” optimization — tricks used to deceive Google by promulgating links at the fringe’s of the Internet so intensely that they overwhelm the PageRank algorithms.
Where did the black hat SEO firm sprinkle J.C. Penney links? Some were those bastions of affordable household goods and practical family fashions like nuclear.engineeringaddict.com, casino-focus.com, bulgariapropertyportal.com, usclettermen.org, and elistofbanks.com. (Notice, I’m not linking to those — no need to perpetuate any SEO problems for Google.)
Google responded by instituting correctives that sank J.C. Penney’s rankings well below the radar. It was a less severe reaction than when Google delisted BMW for a short period in 2006 when that company was caught spamming for links.
Of course, scholarly publishers should watch any malfeasance with Google rankings with some interest. After all, our practice of citation is what Google’s PageRank is expressly built upon. And with new approaches like the Eigenfactor being built to make scholarly citation take on some of the network effects Google has achieved, the relevance only increases.
The Eigenfactor boasts of its approach’s breadth and inclusiveness while denigrating the non-networked approach to calculating impact:
Our algorithms use the structure of the entire network (instead of purely local citation information) to evaluate the importance of each journal.
As the Google/J.C. Penney story reminds us, relying on the entire network can be a double-edged sword. With journals proliferating in number and with more journals launching in remote locations, the network is becoming a potential liability.
We’ve talked here before about the Eigenfactor, going all the way back to a 2008 post by Phil Davis summarizing a paper of his in which the Eigenfactor and the impact factor mapped nearly identically, raising the question of whether popularity and prestige are purely reflective in a relatively closed system like scientific communication. As more online-only or online-mainly journals are developed, the power of the network effect on citation systems will grow.
The argument of increased virtue based on network reliance is dubious at best. It’s entirely possible that the errors of the old remain while adding the novel weaknesses of the network.
(Hat-tip to Marie McVeigh for the pointer.)
2 Thoughts on "J.C. Penney's Black Hat SEO and Google — Why the Network Doesn't Justify Impact Proxies"
If JC Penney had asked for a monthly update on the linkbuilding campaign their company was conducting, it wouldn’t have been an issue. Of course, I believe JC Penney knew exactly what was going on.
The JC Penney story that broke last week was a fascinating one indeed.
But I’m afraid you’re drawing exactly the wrong message about network rankings and vulnerability to SEO or to mere noise in the periphery of networks. The key point is this: while Google’s PageRank is a network-based ranking and while people do use SEO against it, PageRank is far less vulnerable to this sort of exploit than are raw counts that do not take into account to network structure.
The reason is straightforward. The SEO company working for JC Penney had to place many thousands of links to sway the rankings, because Google’s PageRank algorithm assigns only a very low weight to each of those links from sites of minor importance. If Google relied on raw counts instead of network rankings, it would be far easier for an SEO company to boost the rankings with this sort of trick, because a link from each inconsequential site would have as much influence as a link from any major website does.
The sample principle underlies the advantage of network based rankings in bibliometrics. Such rankings are hard to manipulate by the analogous method of placing numerous citations in third-tier journals, because while this practice increases total citations counts greatly, the citations from those third-tier journals are almost entirely ignored in network based rankings.
In bibliometrics, noise is still probably a greater concern than the scholarly equivalently of blackhat SEO. Network-based measures help a great deal in this regard as well. Raw citation counts are highly susceptible to the choice of what journals are indexed. Imagine that the Thomson-Reuters JCR were to lower its threshold for inclusion and thereby double the number journals that it indexes. As a result, impact factors might shift considerably due to chance fluctuations and/or differences in general citation trends coming from these additional–and largely lower-tier–journals. But adding even a large number of these smaller peripheral journals will have only a minor effect on Eigenfactor scores, because each citation from these peripheral journals will be weighted much less than are the citations from the core journals already indexed in the current JCR.
As a final note, your readers might enjoy reading a further discussion of Phil Davis’s Eigenfactor correlation analysis to which you linked above. Last year we published a paper JASIST to explains the statistical fallacies that tripped up Phil in his analysis: Big Macs and Eigenfactor Scores: Don’t Let Correlation Coefficients Fool You, available at http://octavia.zoology.washington.edu/publications/WestEtAl10.pdf