One of the benefits of hypertext in a connected digital environment is the ability to interlink documents. This was part of the hyper-text focused vision of the internet that Tim Berners-Lee was trying to create in the 1990s when he developed the World Wide Web. At the time, there were other prototypes and products that were more robust visions of what hypertext could do, including Apple’s HyperCard product, the Microcosm hypermedia system developed by Wendy Hall and a team at Southampton University and Ben Schneiderman’s HyperTies system. There was a great deal of excitement around these ideas that originated from Ted Nelson’s (unrealized) vision of digital communications that he proposed in the 1960s.
Initial experiments and products that were built in the late 1980s and 1990s around hypertext philosophies allowed anyone to create a link from a document. Some of those at the time thought HTML and the WWW that developed based on its principles was a step backward in some respects because the links that were created were only unidirectional and other hypertext features such as annotation were not included (among other criticisms). This means that only a site’s administrators (or eventually those with write-access, in the case of wikis and other “Web 2.0” tools that developed later) could insert a link onto a page. It was the sole responsibility of the author to curate and maintain the content. In fact this curation role — and the principle of the primacy of the author/publisher — created a number of subsequent problems on the Internet that we’re dealing with today, such as link rot, website preservation, and retractions.
Apart from having to deal with the unfortunate implications of these infrastructure decisions, today much of this is historical, but important, context. A few weeks ago, Ross Mounce posted the following complaint on twitter:
Today in things I don’t like about Elsevier…
Has anyone noticed how they’ve been adding embedded hyperlinks to their ‘Topic Pages’ within HTML pages of peer-reviewed articles? The authors surely didn’t put them in.
Do paying subscribers get to say “no” to this? pic.twitter.com/nnNSOS9BdQ
— R⓪ss Mounce (@rmounce) August 20, 2021
the example given in the screenshot is here: https://t.co/j6oTibDl0k
They are trying to get people to use their failed ‘Topic Pages’ project. It’s not going to work. It’ll only annoy people (like me!). But sure, make it an option to ‘turn ON’ if and only if users want it.
— R⓪ss Mounce (@rmounce) August 20, 2021
These tweets raise some important questions worthy of some thoughtful consideration. Let’s start with the fact that more and more publishers and site administrators are doing this same thing with content all the time. Probably every large news site you visit has these embedded links that were not inserted by the author, as well as at a variety of scholarly publishers’ sites. A number of semantic enrichment and natural-language-processing entity extraction and linking tools are available and widely deployed by companies and organizations around the web. Because much of this work can be highly automated and is generally quite accurate (though by no means perfect), publishers see this as a useful value-added service for readers that is easily deployed. One might expect that most of these same publishers have done a ROI analysis of these tool’s effectiveness, which likely show the enrichment increases user’s time on the site, click-through counts to related pages, and likely overall user satisfaction. If this weren’t the case, I expect that publishers would halt this investment.
Reacting to the tweet, the NISO team had a lively discussion about what are the purposes and the values of linking versus citation. We discussed whether this enrichment was equivalent to citation or, as Mounce claimed, if this type of enrichment was a type of editorializing on an author’s work. While the two are related, links are not citations, particularly in the way citations are used in scholarly literature. The strongest evidence of this is that researchers have long been citing websites and every citation style manual includes a model for referencing a website in either a bibliography or in references in formal publication. A paper might include both links and citations. Most readers would very easily distinguish between the two. Furthermore, many of the organizations embedding these links format them differently than they do either links or citations. In the example noted by Mounce, the embedded links are displayed as links (underlined but as grey-colored type) while the citations are differentiated according to Elsevier’s house style (in parentheses and in blue type). Other publishers do similar things to distinguish this semantic linking. Ensuring that there is a clear distinction between the two in the display is likely the best practice when applying this technology.
In part, Mounce’s critique seems a knee-jerk reaction to something a large publisher is doing. Let’s start with the premise that Elsevier is seeking to get ancillary subscription revenue by linking to related content that it was selling. Perhaps this is useful to the reader, perhaps not. However, is this so problematic? I expect that a large company like Elsevier has done sufficient user testing on this presentation that it is both not intrusive to the majority of readers, and it is providing a useful service to some subset of readers, such that Elsevier is continuing to deploy it. Beyond this, a process to semantically enrich content is being advanced by NISO to support even more of this type of document linking. Other publishers are certainly headed in this direction. However, let’s not simply define what is acceptable by what everyone is doing.
Beyond this, let’s also consider for a moment the licensing question at play here as well. This particular article is published under a CC-BY license. That being the case, even if the article was not published by Elsevier in an Elsevier journal, by agreeing to distribute their content under that license, the authors are explicitly agreeing to allow this type of enrichment. In fact ANY enrichment of any kind is acceptable, so long as the original work is attributed. That is exactly what the CC-BY license was designed to allow. As strong a proponent of open access (OA) publishing as Ross is, I’m surprised he gets this aspect of OA publishing so wrong. Creative Commons licenses that are not explicitly “Non-Commercial” (i.e., CC-BY-NC, for example) foster this kind of commercial reuse, by removing the license barriers to repurposing content for commercial use, such as selling other products. There has been a robust argument that education can be a commercial use, which led to the broad advocacy for CC-BY for academic materials to remove even this potential barrier to educational or scholarly reuse.
Mounce’s problem seems to be that Elsevier is trying to sell published content. Yet if the authors objected to this type of reuse, or if the OA movement in general objected to commercial reuse, they could have thrown their considerable advocacy weight behind a different version of CC license. I have long argued that authors should exert more control over their copyright and apply more nuance to the licenses that they give away or retain. Authors will generally follow whatever policy is driven by their institution, funding body, or publisher editorial policy. Most of us follow the path of least resistance, but I digress.
More importantly, since an increasing number of content publishers and distributors are providing this semantic enrichment, it is probably valuable to consider when it crosses a line and when such enrichment is fine. We can all agree that linking to terminology to resources like GenBank sequences, or Oxford English Dictionary definitions which provide more detailed information is acceptable. Linking to curated resources (as Elsevier has done), public information pages, or Wikipedia pages might be a next level of curation, which is more or less objectionable. If there were an established resource where protocols were described in detail (something like what protocols.io is building) and a publisher were to link to a protocol hosted there, but was not embedded in the paper by the author, that might well be crossing a line. How would the publisher know that the process the author undertook is exactly what is being linked by an NLP reading of a protocol? One could take this application even further into troubling territory. There is a line that needs to be defined somewhere. It will probably be best if we collectively don’t leave the definition of that line up to the machines.