Study: Recommender Systems May Increase Citations

Recommender systems that promote related articles across publisher platforms increase citations according to a new study.

The paper, “The Citation Advantage of Promoted Articles in a Cross-Publisher Distribution Platform” was published in JASIST online on 23 Dec 2019 by Paul Kudlow, Co-founder of TrendMD.

TrendMD is an automated system that operates directly from journal websites to recommend relevant content across the TrendMD network. Recommended links are based on the content of the paper, the reader’s history, and the history of other readers. It is not unlike the recommendations one would see when online shopping, streaming movies or music. Recommendations appear at the bottom of an online paper following the authors’ references.

Before I start my review, I need to note that four of the five authors of this paper are founders or employees of TrendMD; the fifth was the lead author’s faculty advisor when he was a graduate student. These conflicts of interests do not invalidate the study, only that this paper deserves a closer and more critical evaluation given these relationships.

In other posts, I’ve critiqued research reporting that posting one’s papers to Academia.edu boosts citation performance, that AI software was better at selecting manuscripts than editors, or that Google has resulted in authors citing older papers. I also took issue with a paper written by one of Kudlow’s co-authors, for his study on article tweets and citations. Anyone worried that I won’t take a fair, but critical, approach to this paper should stop reading this post and focus instead on the company’s marketing literature. I’m an analyst, not a promoter.

Remarkable methods

The TrendMD paper is remarkable for several reasons: First, the authors set up their study as a randomized controlled trial with foresight into what they wanted to measure (citations and Mendeley saves), for how long (6 and 12 months), and the necessary size of their dataset to find a difference (sample size calculations). This was not a case of indiscriminately digging up data and mining for significant results.

Like other quality medical research, they followed the CONSORT guidelines for reporting their results. Because of this structure, their results come with greater strength of evidence and a better indication that TrendMD is a cause of, and not merely correlated with, better article performance. This paper is valuable for its methods alone and should be read by other companies interested in researching their efficacy of their products and services.

Unfortunately, the way Kudlow and others analyze their data overstates the benefits of TrendMD and, in some ways, obscures their findings.

Unequal comparisons?

A properly-conducted randomization will result in groups that are similar, in all respects, with each other. This is important because differences in the baseline characteristics of each group could result in performance differences by the end of the study. Without baseline similarity, you can’t be sure whether those differences were the result of the treatment or the groups themselves.

There is some evidence that the control and intervention groups were not equal at the beginning of the study. Papers randomized to the TrendMD arm were published in higher impact journals than the control and had accrued more citations by the start of the study (shown in Table 1 of the paper). This difference was large for some subject categories. For example, TrendMD papers in the Health and Medical Sciences started with an average of 3.1 cites per paper compared to just 1.9 for the control. Kudlow did conduct baseline analysis, he described to me by email, only it didn’t make it into the paper.

Good reporting or strategic marketing?

In this study, outcomes were reported as mean (average) performance, which I found odd, given that most distributions in science communication are highly skewed. Like household income, it makes little sense to report averages, as a very small number of super-rich families greatly distort the rest of the population. This is why the US Census Bureau is adamant about reporting median income performance in its reports. If the distribution of citations were more like, say, the distribution of systolic blood pressure, I would be much more comfortable with accepting average citation performance as a comparison metric. By choosing mean performance to report their results, Kudlow and others may have greatly exaggerated the effect of TrendMD. If the authors were not employees or financially tied to the company, I would consider this oversight to be the result of inadequate statistical training and not strategic marketing.

Overstated results

To his benefit, Kudlow has been open and transparent with his analysis and provided me with additional details. Unfortunately, these details underscore how selective reporting leads to radically different performance measures.

For example, as reported in their paper, mean citation differences between TrendMD and control papers at 12 months were 10.1 vs. 15.2 — a difference of 5.1 citations, on average, or a 50% benefit. Yet, median citation performances were 5 and 6 — a difference of just 1 citation or a 20% benefit. You don’t need to guess which metric made it into the paper’s Abstract.

The authors of the TrendMD study also included a regression analysis, which should have provided the reader with a more appropriate estimate of treatment effect (see Multivariate regression model, Table 6). However, it is not clear from their paper (or with my correspondences with Kudlow) that the authors understood how to properly construct their regression model, report, or interpret their results. TrendMD clearly has a positive effect on citations; we just don’t know how much.

Lacking context: what this paper adds to what we know

There is a long history to the study of how scientific results are distributed and promoted to other researchers and to the lay public. There has also been much work on the information seeking behaviors of readers. Less is known about why researchers cite one paper over another. Clearly, this is a very complex system that involves information science, sociology, economics, communication, government policy, technology, and infrastructure.

While I’m convinced that TrendMD has some beneficial effect on the dissemination of related research (as measured by Mendeley saves and Scopus citations), I have a hard time putting this paper in sufficient context to understand these benefits. For instance, we don’t know how TrendMD compares to other article recommender systems. Does TrendMD’s recommender algorithm do a better job (click-through rate) than other algorithms? Are TrendMD links more likely to be selected than the author’s own reference links? Does placement matter? Does the relevancy of recommendations change as the TrendMD network expands, and is their algorithm biased to provide preferential treatment to some journals, publishers, or authors than others? Without attempting to answer any of these questions, we are left with the efficacy results of a single commercial service taken at a single point in time. Should these papers be considered sound science or just another form of marketing?

Conclusion

The TrendMD paper provides some evidence that they are getting readers to related papers although its effect may have been greatly overstated. It is a methodologically strong paper, but weak on statistical reporting. And while published in a peer reviewed journal, readers should understand that these vendor-promoted studies may contain reporting bias and lack appropriate context.

Phil Davis

@ScholarlyChickn

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist. https://phil-davis.com/

Discussion

10 Thoughts on "Study: Recommender Systems May Increase Citations"

Phil,

Thank you very much for your critical appraisal of our study, and acknowledging the methodological rigour we took. I know I addressed these comments in our recent correspondence, but since the majority of my answers did not make it into your piece, I will restate here for the readers of TSK:

(1) Unequal comparisons:
There is no evidence to suggest that the control and intervention groups were different at baseline.

We went back and forth of whether or not to include statistical tests for baseline variables in our analysis, and in the end we opted to remove the analysis based on both reviewer comments and the CONSORT guidelines cited in your piece. In case readers are interested, I’ve provided Tables 1 and 2 with p-values comparing citations, Mendeley saves, JIF, and OA at baseline; for Table 2, I just reported the p-values for the categories that had significant results at 12 months for citations. We also need to use the Bonferroni correction for multiple comparisons for Table 2; a two-tailed p < 0.00625 should be considered statistically significant for Table 2 (p=0.05/8).

See Tables 1 and 2 here: https://share.getcloudapp.com/YEudBJZj

As you can see, there were no statistically significant differences at baseline for the groups in terms of JIF, OA, citations, or Mendeley saves. Again, this result is not surprising because we used a random number generator to randomize the articles to intervention or control.

Many readers of TSK may not be familiar with CONSORT as their guidelines are more applicable to the world of medicine; RCTs really originated from clinical medicine, so naturally there are guidelines on best practices.
To quote from the CONSORT guidelines: "unfortunately significance tests of baseline differences are still common(23) (32) (210); they were reported in half of 50 RCTs trials published in leading general journals in 1997.(183) Such significance tests assess the probability that observed baseline differences could have occurred by chance; however, we already know that any differences are caused by chance. Tests of baseline differences are not necessarily wrong, just illogical.(211) Such hypothesis testing is superfluous and can mislead investigators and their readers. Rather, comparisons at baseline should be based on consideration of the prognostic strength of the variables measured and the size of any chance imbalances that have occurred.(211)“ http://www.consort-statement.org/checklists/view/32–consort-2010/510-baseline-data

Bottom line, there are no group differences at baseline, and the reason why we didn’t include statistical tests to prove that there were no differences was in accordance with guidelines that are endorsed by over 50% of the core medical journals listed in the Abridged Index Medicus on PubMed. http://www.consort-statement.org/about-consort/endorsers1

(2) Good reporting or strategic marketing?
We were very careful not to over-interpret or overstate our results. I will quote directly from our paper on this: “The effect size of TrendMD on citations at 12 months was small (Cohen's d = 0.16).”

In our initial draft, we include medians in all of the tables. We subsequently removed the median values during the peer-review process. Instead, we provided cumulative distribution graphs which actually nicely show the distributions of citations and Mendeley saves. For example, looking at figure 6, 75% of articles in control had 13 or fewer citations, and TrendMD had 18 or fewer citations. I drew some lines on the figure to help show you how to properly read the cumulative distribution graphs – https://cl.ly/2c2f304086c8 A right shift in these distribution curves indicates that the differences are across the entire distribution, rather than just being pulled up by outliers.

In case readers are interested, I've also included the original tables that have the medians in them (tables 3 and 4 in this download link: https://share.getcloudapp.com/YEudBJZj )

As an aside, we actually completed non-parametric tests (Mann-Whitney) comparing medians prior to submitting the manuscript — they were also statistically significant for citations (6 and 12 months) and Mendeley saves at 6 months. However, we decided to remove this analysis and opt to log-transform to normalize the data because we found that most readers understand means better than medians. But as I say, if readers want to see the difference in medians, I am more than happy to share and have provided in the link above.

(3) Conclusions
No study is perfect; our study is no exception. When you boil everything off though, you are still left with firm conclusions that I will re-state here. We completed an RCT that shows that after 1-year there is a small, but statistically significant difference between citations in the intervention group versus the control group. There were no statistically significant differences in the control and intervention group at baseline that would explain the difference in citations at 1-year. There was an overall 50% and 20% difference in the citation means and medians that were driven by TrendMD, respectively, at 1-year.

By Paul Kudlow
Jan 8, 2020, 7:34 AM

When you boil everything off though, you are still left with firm conclusions that I will re-state here. We completed an RCT that shows that after 1-year there is a small, but statistically significant difference between citations in the intervention group versus the control group.

I completely agree. But given that your treatment was likely going to result in positive results, what is the contribution of this article?

Imagine that the American Bowling Association (ABA) decides to run and report their own study that claims that bowling leads to better health. Their study design compares the health of a group of adults, selected at random, to either a bowling league (treatment) or no bowling at all (control). We shouldn’t be surprised that the treatment group shows some positive effects against the control group, which is doing nothing. You could imagine that the Canadian Curling Commission also funds its own study reporting that curling is good for one’s health. Again, not surprising and not very helpful.

Now, is bowling better for one’s health than curling? This would be useful to know. Even better would be a comparison of the health benefits of various exercises, ranked in order of benefit. A cost benefit analysis would also be insightful, as the costs of bowling (balls, shoes, lane rentals) may be lower than purchasing stones, brooms, and ice time. A more useful study would compare the costs (time, expense, risks) of many sports and may conclude that jogging ranks number one and bowling, unfortunately, is at the very bottom, right below golf. Bowling still has “a small, but statistical difference” just not a lot compared to other sports.

Given that there is a now a huge industry hoping to sell publishers products and services to increase the impact of their content, I don’t think that asking for context is too much.

By Phil Davis
Jan 8, 2020, 10:23 AM

Phil,

“But given that your treatment was likely going to result in positive results, what is the contribution of this article?”

I wish I had your confidence about our product! 😉

In all seriousness, we certainly had no a priori knowledge that this trial would result in a positive outcome on citations. We had a hypothesis that it may work to increase citations based on our earlier RCT study – https://link.springer.com/article/10.1007/s11192-017-2438-3 which was also covered in TSK (https://scholarlykitchen.sspnet.org/2018/04/09/guest-post-journal-article-recommendation-features-change-reader-behavior/), but the whole purpose of this study was to determine if TrendMD has any impact on citations relative to no treatment at all.

To be frank, the fact that we found an effect within 1-year on citations was shocking.

Your point and analogy about bowling and its effects on health vs the need to compare to other interventions is a very good one in general, but not at all relevant to this case. I will explain.

Our study is no different than the gold-standard placebo controlled trials that the FDA uses to determine the efficacy of new drugs. I assume you agree that placebo-controlled trials contribute to our understanding of the efficacy of new drugs?

Continuing with the analogy to trials in medicine, drugs are only compared to other drugs in RCTs when there is a gold standard treatment/medication for a condition. In the case of interventions to increase the rate of citations, there are currently no evidence-based strategies that have been shown to work; that is, there was no active gold standard for our intervention to compare to. In fact, the passive gold standard used by both scholarly publishers and authors is not to promote papers at all, but just publish their articles online and hope they receive attention and impact. Therefore, in this case, a placebo-controlled trial was absolutely warranted and required as a starting point in the literature, because the placebo of doing nothing is the current gold-standard adopted by stakeholders in the scholarly ecosystem.

By Paul Kudlow
Jan 8, 2020, 1:09 PM

TrendMD does not increase citations–at least not directly. TrendMD provides recommended links. To that end, it is not unique. Google does it; PubMed does it; many content aggregators (EBSCO, ProQuest) do it, as do some publishers (SpringerNature). While I agree that the TrendMD operates differently, it is not so different that it lacks comparison. We’ve also had cross-publisher article-level linking for about 20 years now (the Digital Object Identifier), which may have come shortly after HighWire’s toll-free linking of article references.

If boosting readership through discovery is the primary function of TrendMD, then you indeed have lots of adequate controls without resorting to a placebo.

By Phil Davis
Jan 8, 2020, 6:02 PM

Presumably all of the articles included in our study were indexed on Google and PubMed, as well as other content aggregators that you mention. This was considered our ‘placebo’ or control. I should mention — none of these indexing strategies you mention have been shown to lead to an increase in citations of articles.

I am very curious though. What do you suggest as an active comparator for a control group in our study? If not placebo, then what exactly? Your suggestions here can certainly help to inform future study…

By Paul Kudlow
Jan 8, 2020, 6:49 PM

Paul, when you wrote in your last comment that no other indexing strategy has been shown to increase citations, it forced me to re-read your methods. Based on TrendMD’s recommendation system, I am not convinced that your randomization was sufficient to isolate TrendMD as a cause from TrendMD as a confounding variable. Here’s why:

While your randomization may have been sufficient to have created two similar groups (intervention vs. control), you are not measuring the performance of these two groups of papers, but the performance of the target papers selected by TrendMD’s recommendation system. If your recommendation system was selecting relevant papers solely based on keywords, I would assume that recommended papers would not be biased toward higher quality papers. However, according to your paper, the TrendMD algorithm selects which links to display based on three criteria:

1. keyword overlap
2. collaborative filtering (“people who bought this item also bought that item”)
3. clickstream analysis (readers’ online history)

Now, while criteria #1 would have resulted in an unbiased selection of links, #2 and #3 would bias the selection of links to more popular papers. In other words, TrendMD may simply be selecting papers that are more likely to be saved in Mendeley and cited more frequently in the future.

Proper experimental controls
I can imagine two types of controls that would help you distinguish real TrendMD effects from TrendMD as a predictor of better papers:

1. A control arm working with an algorithm that functions by keyword relevancy only.
2. A control arm that selects papers based on the full TrendMD algorithm but does not display them on the bottom of articles.

Both controls would help you distinguish between whether TrendMD is a cause of higher performance papers or just a predictor of higher performance papers.

By Phil Davis
Jan 9, 2020, 10:44 AM

Phil –

I re-read your comment several times, but I am not sure I understand your point.

Even if TrendMD somehow were recommending articles that are going to be more highly used on Mendeley or cited, how does that confound the results in an RCT?

“While your randomization may have been sufficient to have created two similar groups (intervention vs. control), you are not measuring the performance of these two groups of papers, but the performance of the target papers selected by TrendMD’s recommendation system.”

This is not correct. All we do is measure the performance of the two groups at 6 and 12 months.

What am I missing?

By Paul Kudlow
Jan 9, 2020, 11:52 AM

Aha! Now I get it!
You have randomly selected two groups of papers: A (treatment) and B (control). You don’t do anything to these papers except measure their performance at 6mo. and 12mo. Your treatment is turning on the TrendMD network such that papers in the treatment group A get added as recommended links to TrendMD papers in the network while papers in the control group B get no recommended links. You are trying to measure whether TrendMD made a difference in the performance of group A papers against group B papers. You found and reported such a difference. Sorry it took me so long to get to this point 🙂

As we discussed, it would be good to better understand how the intervention varies across articles. For example, does the placement of the recommended links matter (at the top of the paper, on the side, at the bottom)? Do highly cited papers receive a disproportionate share of the benefit (rich-get-richer)? How does the size of the network (or subnetwork) affect the results. Not only would this add to the literature but lead to developing a better TrendMD product.

Thanks again for the constructive and patient communication.