Guest Post: Time to Rethink Usage Analytics

Editor’s Note: Tim Lloyd is founder and CEO of LibLynx, a company providing Identity, Access & Analytics solutions for online resources. His career spans several decades in a variety of product development and operational roles in online publishing, with a particular focus on developing innovative products and services to support online learning and research. Tim is a member of the Governance committee of SeamlessAccess.org and co-chair of the Outreach committee, a member of STM’s Researcher Identity working group, and volunteers regularly to support a variety of industry initiatives.

Usage analytics in scholarly publishing is undergoing profound change. Powerful industry trends over the last five years have resulted in a new set of challenges for usage analytics that our current reporting infrastructure is poorly equipped to address. As the impact of these changes may not be apparent to the casual observer, this post explores their implications.

A good place to start when envisioning the future is to understand where we’ve come from. In our community, the bedrock of usage reporting is the COUNTER Code of Practice. Since its first release just over 20 years ago, COUNTER reports have rightly become the gold standard for publishers and other service providers to report usage to libraries. These monthly reports provide librarians and publishers with the core metrics they need to understand usage of their resources, and inform decision making on renewals and acquisitions. This primary use case continues to be essential for our community, and the Code of Practice continues to evolve to meet emerging needs (with a new version 5.1 coming into effect in January 2025).

What’s changed?

What’s changed is the environment, which is adding complexity and new requirements beyond this original use case. Here are five examples of those environmental changes.

Innovation In Business Models

The growing diversity in Open Access business models, nicely illustrated in Tasha Mellins-Cohen’s Scholarly Kitchen post from March, means that a given organization may be consuming publisher content (both open and controlled) in many different ways. For example, in addition to traditional outright purchase and subscription models, institutions may also be paying transactional fees (APCs, peer review fees, etc.), participating in collaborative funding models (such as Subscribe to Open and the like), and signing bundled agreements that combine reading and publishing budgets.

This diversity in models can make it significantly harder for publishers participating in several models to provide institutions with a single, unified understanding of their usage. The broad re-use rights associated with open access content make it more likely that usage arises in multiple siloed and third party platforms (more below). In contrast, publisher reporting systems are generally engineered around library reporting of controlled access content.

New Use Cases

This diversity is also driving new, broader use cases for usage analytics beyond the traditional COUNTER metrics designed for library comparison of vendor services. A variety of new stakeholders are interested in using usage analytics to inform decision making, such as:

Editors and product managers interested in analyzing usage by subject and reader segment, to assess impact as well as identify new fields of research and study. This includes understanding usage across publisher and third party platforms, and across open and paywalled business models.
Sales and business development staff seeking to identify the organizations getting value from their content, to inform strategy and illuminate new opportunities.
Research officers and others involved in institutional funding wanting to understand the impact of funding decisions, to demonstrate the benefit of funded research.
Authors looking for new opportunities for research and collaboration in the communities engaging with their publications, and to demonstrate their productivity.

These new use cases are shaping the future of usage analytics by extending metadata into new areas (to meet emerging analytics needs) and increasing the focus on reporting tools that are intuitive and accessible for audiences with less data analysis expertise.

Fragmentation Of Usage

The traditional model of scholarly users coming to a single publisher-owned platform to access content began to fragment many years ago when ebooks started being distributed on third party platforms, such as JSTOR, Muse, and Amazon. The growth of open access book repositories such as OAPEN and the Open Research Library (ORL) added additional sources of usage for OA content. On the journals side, syndication deals with platforms like ResearchGate and ScienceDirect are further fragmenting scholarly usage. The future will increasingly see scholarly content distributed through more platforms, each (ideally) targeting a valuable additional audience that isn’t effectively addressed by the others.

The incorporation of AI into content discovery further weakens the relationship between content and the publisher platform. If large language models (LLMs) are able to present coherent answers based on publisher content, then users may not feel the need to click on a citation link to view the full version (and therefore won’t generate a usage event that can be tied back to that content). A fascinating session at Silverchair’s recent Platform Strategies meeting explored this issue — most attendees felt that this was both inevitable and unavoidable, but these capsule answers may attract a different audience segment, with traditional scholars still wanting to click through to the full content.

This fragmentation significantly increases the challenge of building a comprehensive understanding of usage. Despite community efforts to build standards and best practices in this area, such as the Distributed Usage Logging (DUL) initiative led by COUNTER and Crossref, building data pipelines to combine usage files from multiple platforms into single, consistent and standardized feeds remains a significant technical challenge. A lot of processing still relies heavily on manual intervention, or simply omits data that’s too difficult to integrate.

Societal Benefit

Innovation in publishing models is also throwing dust into the air when funders (and to a lesser extent, authors) make choices about where to publish — with more models to choose from, competition for publishing funds is becoming tighter. The importance of understanding publishing impact is something we increasingly hear from funders, but traditional measures of impact, such as the Journal Impact Factor, have long been recognized as flawed. A fascinating Clarivate study of Research Officers and Researchers, presented at April’s STM Conference, indicated that societal benefit was expected to become the most important impact measure for research offices in five years, as well as the most difficult to measure.

This evolving understanding of what is meant by “publishing impact” requires more creative approaches to usage analytics. You can’t measure societal benefit in terms of download numbers, but you can start to estimate it by understanding which communities are engaging with which scholarly content. For example, was a paper on tropical disease accessed by research institutes in West Africa? Did citizens of a US state engage with publications funded by state agencies?

Bot Pollution

Robotic activity has always been part and parcel of scholarly usage logging, and COUNTER’s Code of Practice includes a longstanding requirement to ‘exclude robots and crawlers’. However, in recent years the level and sophistication of robotic access appears to have exponentially grown. While there are many factors driving this, presumably including amassing content to feed AI models and paper mills, the result is that open access and free content usage logs are increasingly polluted by robotic activity. This can significantly reduce the value of, and confidence in, traditional usage measures, such as views and downloads.

And what’s the impact on usage analytics?

The simple answer is that usage analytics need to cope with a lot more complexity.

One way to think about analytics pipelines is to consider their component parts — data ingestion, data processing, and data export.

Data Ingestion

Solutions need to be able to ingest data from multiple sources, in diverse formats, at different cadences (daily? monthly?), and to varying levels of quality. Whereas traditional COUNTER reporting is a monthly event, future analytics pipelines will be more akin to a river that is constantly in a state of flux. Platform A may provide a server-driven monthly export of aggregate COUNTER metrics, platform B may provide real-time COUNTER-compliant events, and platform C may provide a manually-created and home-brewed set of metrics in a .csv file that generally arrives around the same day of the month depending on staff availability and time of year. And yet all three sources are important because they address different segments of the user community, and therefore build up our understanding of publishing impact.

Data Processing

Solutions need to be able to normalize data so that these diverse inputs can be aggregated and compared. Using the example above, this could include:

Cross-walking several different flavors of organization identifiers (a publisher taxonomy; uncontrolled text strings; proprietary IDs) to a standard identifier such as ROR
Standardizing timestamps to a consistent format
Converting individual event metrics to a set of consistent monthly aggregate totals
Mapping COUNTER and non-COUNTER event types to equivalents (where they exist)

In addition, the volume of data that needs processing, and reprocessing, can be very significant — especially when open access content is included, which can generate usage an order of magnitude greater than paywalled content. Traditional database architectures can struggle to cope with these requirements. As an example, we’ve spent much of the last three years at LibLynx replacing our legacy COUNTER processing infrastructure — which was designed around a monthly build of COUNTER reports — with a completely new architecture designed to cope with processing of billions of usage events each year, and on-demand generation of reports.

Data Export

COUNTER reports are traditionally consumed as spreadsheets by librarians familiar with the format, or in machine readable formats for automated harvesting. In contrast, the new use cases for usage reports include stakeholders that are less familiar with these industry standards, and need more flexible ways to incorporate analytics into their existing workflows.

Analytics need to become more inclusive — more intuitive, more visual, and more context-sensitive in order to be impactful for these new audiences. They need to cater for a wider range of automated access scenarios, including bulk access to the underlying metrics and more flexible querying of data sets outside of templated reports. They also need to incorporate qualitative data that can add valuable, rich depth to understanding impact, but is often lost when pipelines are designed around scale and numbers alone.

My colleague, Lettie Conrad, has been working on a project over the last 12 months to re-imagine the user experience for usage analytics. She’s been talking to various community stakeholders to understand their current and emerging needs, and working with a team of designers and software developers to create a new framework for exploring analytics that provides the flexibility to support these new use cases.

New challenges and opportunities mean new tools for measuring impact and value. Usage analytics will be a critical component of how our community communicates publishing impact in the future. It’s important that we pay attention to its supporting infrastructure, and make the investments needed to keep pace with emerging stakeholder needs.

Tim Lloyd

Tim Lloyd is founder and CEO of LibLynx, a company providing Identity, Access & Analytics solutions for online resources. His career spans several decades in a variety of product development and operational roles in online publishing, with a particular focus on developing innovative products and services to support online learning and research. Tim is a member of the Governance committee of SeamlessAccess.org and co-chair of the Outreach committee, a member of STM's Researcher Identity working group, and volunteers regularly to support a variety of industry initiatives.

Discussion

9 Thoughts on "Guest Post: Time to Rethink Usage Analytics"

Tim: This is a terrifically important contribution, especially for small journals and small libraries. Thanks. I recently looked at usage stats for a journal generated by Literatum software. I can’t locate the exact number but the percentage of usage generated through university servers was very low, in the order of 15%. At the same time, we have often received requests from librarians to know how much their students and faculty use the journal who seemed unsympathetic to our inability to give them an accurate figure.

By Rowland Lorimer
Oct 2, 2024, 6:43 PM

I think this is going to become an increasingly familiar scenario – more community stakeholders want to understand usage (librarians, publishers, funders, authors), at the same time that our ability to present a coherent and unified picture of usage is becoming harder and harder. As a lot of these trends are at the infrastructure level, they’re much less visible to lay users, and so I suspect the frustrations will increase until our solutions catch up with the complexity.

By Tim Lloyd
Oct 3, 2024, 10:29 AM

Excellent article Tim, thank you. At CELUS, we observe the same environmental changes surrounding e-content usage analytics that you masterfully articulated. As a result, we’ve put adaptability at the center of our organizational culture; it informs every aspect of our platform development, standards compliance, community outreach, and customer support. One example is our recent collaboration with Tasha Mellins-Cohen at COUNTER Metrics to develop their new Registry (https://registry.countermetrics.org/). The CELUS team constructed the Registry to streamline the process of communicating COUNTER-compliance for publishers, libraries, consortia, and intermediaries; we built an application programming interface to the Registry as well, which allows external applications to pull data directly. Thanks again Tim, we hope that e-resource librarians, and those “casual observers” of usage analytics, read your piece. We’ll be sure to share it!

Tomas Novotny,
Founder & CEO
CELUS

By Tomas Novotny
Oct 3, 2024, 3:55 AM

Thanks for an interesting read, Tim. There’s no doubt in anyone’s mind that the way we measure usage needs to change to reflect shifts in the way content is delivered, shared and consumed by both people and machines. One of the reasons COUNTER was developed back in 2002/3 was to ensure that when such changes happen, the new measurements are done in a consistent, comparable fashion, so that everyone can be sure usage metrics from platform A are reporting on the same thing as those from platform B. We’ve found over the years that ensuring this consistency means taking time to carefully consider any change that’s being made to the Code of Practice, and allowing time for publishers and libraries to update their systems after a change is published before we require compliance. For example, work started on Release 5.1 in late 2021. We ran a lengthy public consultation in 2022 and made changes based on that consultation before publishing R5.1 in May 2023. Development time means we’re requiring compliance in January 2025 – something I know LibLynx is working on achieving, alongside the many other data processors in our Registry.
That measured pace is what’s needed for an open standard which requires widespread compliance, but we know that it can present issues for fast-moving technical changes. That’s precisely why COUNTER offers best practice guidance (countermetrics.org/code-of-practice/best-practice/). We recently published guidance on managing syndicated usage, and have two best practice investigations open at present, one on metrics for GenAI and another on how to report usage when a user is simultaneously authenticated against multiple institutions. More will be opening over the coming months. I encourage everyone who’s interested in usage metrics to get involved!

By Tasha Mellins-Cohen
Oct 4, 2024, 3:53 AM

“They also need to incorporate qualitative data that can add valuable, rich depth to understanding impact, but is often lost when pipelines are designed around scale and numbers alone.” Can you say more? Any examples?

By David Parker
Oct 15, 2024, 7:18 AM

A good example is DASH Stories, which is a tool that Harvard’s Office for Scholarly Communication uses to capture feedback from people accessing open content in their institutional repository DASH (Digital Access to Scholarship at Harvard). Users accessing content are invited to share stories on the value they got from the content, providing rich, additional context for the impact of open access, alongside more traditional volume-driven metrics.

Here’s an example from a university faculty member in Brazil:
– “Sharing the product of your research is extremely valuable; in my own experience, I teach graduate level business courses in Brazil and use your materials frequently. Having the opportunity to share knowledge and divulge results found in other countries / cultures is a great opportunity to conduct similar research locally and discuss these possibilities in class.”

See examples at https://dash.harvard.edu/stories.

By Tim Lloyd
Oct 15, 2024, 9:56 AM

Great article that summarizes many of the issues we’re working to address in the OA Book Usage Data Trust effort (www.oabookusage.org). We often hear that to scale, scholarly communication as a whole needs to make it easier to share and ingest usage data from multiple platforms. Stay tuned, as in 2025 we hope to provide a trusted data exchange focused community-led infrastructure to do just this while providing some of the data governance tools and disclosures to address some of the issues you raise, while enhancing usage data discoverability and providing trust indicators to improve transparency. We’re also open to applying insights and infrastructure related to multi-platform OA book usage exchange to other types of open scholarship. If anyone wants to join us on this journey, they can reach out to me (https://www.oabookusage.org/staff-and-project-leads).

By Ursula Rabar
Oct 15, 2024, 10:06 AM

Absolutely, Ursula. Community-led infrastructure like the OA Book Usage Data Trust can play an important role in reducing the cost/risk of managing many-to-many data sharing.

By Tim Lloyd
Oct 15, 2024, 1:36 PM

Thanks Tim for the good work. We do share the same issue. About 15% of our subscribed databases are COUNTER compliant, that means we cannot get 85% of the access statistics. We impose our users to go through EZProxy in order to get access to our subscribed e-resources. For this action, we can do better access analytics, such as to understand how much our students and faculty use the e-resources. However, as more publishers transition to Single Sign-On (SSO), we are losing control over the collection of access data for library resource planning.

By Kam Ming Ku
Oct 18, 2024, 2:03 AM

The Scholarly Kitchen

Guest Post: Time to Rethink Usage Analytics

What’s changed?