Start-up Stories: Bringing DataSeer, A New Data-sharing Toolkit, From Idea to Launch

Part 1: the tool — making it more efficient to enforce data sharing

DataSeer is a newly launched tool, developed by Scholarly Kitchen writer Tim Vines and colleagues. It scans through articles and other texts to look for mentions of related research data; it then annotates the text with suggestions for sharing that data, for example, which repositories might be appropriate. The idea is to substantially improve levels of data sharing by providing much more specific, easy-to-follow guidance about what should be shared, where, and how (e.g., formats). It should help researchers to comply better with data policies, and help editors, publishers, and funders drive compliance more efficiently. I spoke to Tim about it recently and found it eye-opening to hear about the level of resource that currently goes into managing data policies:

“Publishers with strict policies currently use human curators to promote compliance. I’ve been a human curator myself. It’s a difficult, slow, error-prone process — going through authors’ articles, finding the datasets they *should* have shared, and telling them where to put them. Human curators have to be at least at a PhD level to understand the content, and it takes 20-40 minutes for them to process each article, including the back and forth with authors. That average hides the ones that take a REALLY long time — for example, in biochemistry, the authors might have collected up to 30 datasets per article, so identifying all of those is extremely hard work.”

While DataSeer has the potential to pretty much do all of this by itself, Tim guesses that in the first instances, it will be used by human curators to speed up their ability to process articles:

“One of the most time-consuming steps in the process – generating the list of datasets mentioned in the article – is reduced [by DataSeer] from about 20 minutes to approximately 5 seconds. Once the curator is satisfied that what DataSeer has found is aligned with what the journal wants, they can pass it back to the authors to complete the necessary data sharing. All of that communication can also be managed through the DataSeer interface, through a shared interface where actions are passed back and forth, which also speeds things up compared with composing, sending and reading lots of email messages.”

So it’s a “marginal gains” approach — a series of 3-5 minute steps can each be reduced or removed, achieving cumulative savings per article and thus increasingly substantial savings at the journal or publisher level. How else does it improve matters?

1. Compliance:

Authors are more likely to respond / comply with data policies, because “it’s very granular — they’re being guided on how to provide each specific dataset, not just given a vague command to ’share more data’. For each annotation, they can either provide an identifier for where they have now shared the data, or an explanation as to why they can’t share it.”

2. Discovery:

When DataSeer is involved in the curation and sharing of data, it is not only recommending that data is shared – it is also present when the sharing happens. This has two-way benefits — DataSeer “ensures that all sorts of useful metadata are passed to the repository, and ensures (drum roll) that a citation to the dataset is added to the article”.

3. Meta analysis:

The current approach to meta-analyses is to extract summary statistics from published articles and try to draw conclusions about that treatment / intervention from those. Like many things, the accepted approach is a legacy from the print era that, when looked at afresh, is laughably reverse-engineered, and alarmingly error-prone. Our inability to do powerful meta-analyses is a handicap at the best of times but in the pandemic it’s actively costing lives. Tim hopes that DataSeer will move the needle on data sharing to the extent that much more medical data will become available.

Part 2: the how-to — tips for entrepreneurs in the research sector

As well as being interested in DataSeer itself, I’m also interested in the story behind it. How did the team get it funded and developed? What lessons did they learn that might be useful to other entrepreneurs in our sector?

Be realistic about how long it takes to be successful

The short version of the DataSeer development story is that Tim had the initial idea in late 2017 (writing about it here in The Scholarly Kitchen), managed to win some grant funding (from the Sloan Foundation) in 2018, got the first demo version live in 2019, and refined that (and continued building the training data) for the current launch (2020). The next goal is to achieve break-even, which means scaling up awareness and adoption. Realistically that’s probably 1-2 years away, i.e., idea-to-sustainability will likely take about 5 years. “Remember, your early adopters may not be representative of your entire user base,” says Tim, in terms of the pace at which initial users will find out about, use and license a new product.

Focus on the benefits — and the differences between customers and users

DataSeer has two core user types:

Authors, for whom data sharing is time-consuming, with a bewildering range of expectations in terms of what data should be shared, and how / where.
Editors, for whom enforcing data sharing policies is difficult, slow and error-prone — requiring close reading of each article, and lots of back-and-forth with authors.

The product must be attractive, intuitive, and friction-free for both audiences. But it must also have a clear business case for the paying customer, who will probably not be either of those audiences and thus will not have the “empathetic” buy-in that comes from a personal experience of the problem being solved. In DataSeer’s case, the buyers (in the first instance) are likely the directors of editorial / publishing operations. Tim talks about having learned to have a “laser focus” on who the product was for, and quantifying the time savings (and thus cost savings) on human curation to be able to show that the tool would pay for itself.

Don’t ask the driver to build the car

It’s really common in the research sector to see researchers themselves developing solutions to the problems they encounter. I don’t know if this is common in other sectors, too, but I could believe it is a quirk of the proximity between research and innovation, meaning researchers have the right sort of mindset to identify a problem and envision a solution. What researchers may not have is the range of other skills involved in entrepreneurship. For me, much of it comes down to marketing (I know — you knew I’d say that) — in the broad, professional sense of the word. Understanding markets, as above: having a clear understanding of who your users are and the problem you are trying to solve. Note the singular “problem” there — you need to get it down to ONE problem, at least at the beginning, or you will be trying to be all things to all people, stretch your development resource too thinly, and never be able to explain clearly and simply what it is you do.

For many researcher-innovators, being your own user starts as an advantage but can become a constraint: you are too close to your own use case, and may not recognize (for example) that it is not sufficiently universal to require a solution at scale. Or you may develop for too niche a use case, too specific to a single field where a little product expertise might have been able to broaden it out to support a bigger market. And getting the product right is only half the battle, if that, because there is no “build it and they will come”. Of all the researcher-innovators I’ve met over the years, all have lamented their inability to reach enough potential users and customers. Building awareness and uptake is incredibly hard and likely to be quite expensive. Which brings us on to the third reason why it is hard to make it as a researcher-innovator — attracting investment. Given the right product focus / benefits and some knowledge of the philanthropic funding sector, you can get quite far on the sort of funding that academics are very familiar with and successful at winning. But that will only ever get you so far — usually, those kinds of funders are more interested in helping you get off the ground than helping you grow.

Be aware of adjacent markets

For growth, or scale-up funding, you’re going to need investors. And every investor you meet will want to know the size of your “total addressable market” (known as the TAM). How many possible buyers are there of your new product and service? The TAM for innovations aimed at the scholarly / research sector is not big enough to attract the interest of most investors. As Tim puts it: “If you can charge $10 an article, your maximum revenue if you take over the whole world is $25m a year. The VCs yawn and walk away. If your TAM is less than $1bn they don’t want to know.” (As an aside, this is why so many start-ups in our sector get bought by the big players — those ’strategic investors’ are often the only interested parties at the point when an innovation is trying to raise money for growth). One way through is to be aware of “adjacent markets”, i.e., other audiences who have similar needs and could benefit from the product, or from a slightly tweaked version of it. Don’t let this distract from your focus on the benefits for your core markets in the short term — but do factor these adjacent markets in when you are setting out your TAM for potential investors. For DataSeer, there are obvious areas to expand into, for example helping government or commercial entities make datasets more discoverable (within or beyond the organization). Tim’s final tip for other entrepreneurs is to “think hard about your TAM, and how you can make it bigger. It will be time well spent, because it will give you more space to grow and more reason for others to believe in you.”

Are you an innovator in the scholarly / research sector? Let us know if you have a new tool that you think the Scholarly Kitchen’s audience would be interested to know about.

Charlie Rapple

@charlierapple

Charlie Rapple is co-founder of Kudos, which showcases research to accelerate and broaden its reach and impact. She is also Vice Chair of UKSG and serves on the Editorial Board of UKSG Insights. @charlierapple.bsky.social, x.com./charlierapple and linkedin.com/in/charlierapple. In past lives, Charlie has been an electronic publisher at CatchWord, a marketer at Ingenta, a scholarly comms consultant at TBI Communications, and associate editor of Learned Publishing.