Swimming in the AI Data Lake: Why Disclosure and Versions of Record Are More Important than Ever

Bear with me while I torture an analogy between an actual lake and the training corpus (“data lake”) of an artificial intelligence (AI) system.

You are hiking on a hot day in an area neither remote nor urban. You come upon a lake that has a self-service canoe rental. You are hot, thirsty, and enjoy canoeing. You cannot see into the lake so you do not know if it contains plants, fish, chemicals, or something worse. Still, the water looks clean. There is no one to ask and no way to test the water chemically. Are you willing to (a) drink the water, (b) swim in the water, (c) canoe on the water, or (d) none of the above?

The water might be perfect, pristine, and chemical free, but you need to know this for certain before you will drink from it. You might risk swimming if the water looks clear, smells OK, and is not near an industrial plant. You may keep your mouth closed — and maybe your eyes. And if you think it may be filled with toxic sludge you might not even risk getting splashed while in your canoe.

What does this have to do with AI? While generative AI such as ChatGPT gets all the buzz, AI in other, typically more targeted forms, has been used for years to solve business and research challenges. The utility of AI and the value to businesses is directly related to both the quality of inputs and information about what inputs are used. Even when an AI service is trained on high quality content, without proper documentation and audit trail, users cannot be sure what they are using and may stay away.

When thinking about AI, I like to borrow a rubric from academic assessments; namely high stakes vs. low stakes. In that context, high stakes means that the outcome itself will be used for an important decision (e.g., an Advanced Placement test), while low stakes means that it has value as part of something else (e.g., a practice test to help a teacher know what to emphasize in class).

When using AI in high stakes decision making, you want to know that your training corpus (i.e., the “lake” in our analogy) is pristine and you need to know what is in it. For a pharmaceutical company using AI for decision-making research purposes, the training corpus should be comprised of final Versions of Record (VoR). The researcher needs to know that the corpus excludes unwanted content, such as content sourced from predatory journals and/or “junk” science, for example.

In a low stakes environment there can be a higher tolerance for ambiguity. The same pharmaceutical company researcher may need to simply identify potential experts in a field, which would require a less pristine training corpus; preprints can be included and perhaps even a little “junk” science may be acceptable. But the key point is this: unless the AI service provider has disclosed in writing what is included in the training corpus, that corpus will never be able to make the jump to high stakes applications. It’s OK for swimming, but not drinking.

While there can be some value even in polluted lakes, there is a point at which there is too much pollution for most uses. Bias in AI, including racial bias, is well documented. In order to combat this, responsible governments and governmental organizations are moving to regulate AI, focusing on issues such as ethical use and transparency. For example, the OECD’s ethical AI principles include: “AI Actors should commit to transparency and responsible disclosure regarding AI systems. To this end, they should provide meaningful information, appropriate to the context, and consistent with the state of art….”

To some degree, the market should solve transparency. A company using AI in hiring decisions risks lawsuits if that AI was trained on racially-biased data. A company using AI for autonomous flight… well, I shouldn’t have to explain what could go wrong there. Many AI systems, such as ChatGPT, are based on large language models and can be unreliable due to “hallucinations,” as such, they are not fit for purpose for high stakes use. And while in-context learning currently shows promise for reducing the amount of data needed for retraining, high-stakes uses will still need higher value, transparently documented content for training the large language models themselves.

While large data sets scraping the web (often without consent) are all the rage, the use of high quality, documented data will be important in advancing science. The best decisions are made on the best (possible) data. Check for posted notices about water quality.

Roy Kaufman

Roy Kaufman is Managing Director of both Business Development and Government Relations for the Copyright Clearance Center (CCC). Prior to CCC, Kaufman served as Legal Director, John Wiley and Sons, Inc. He is a member of, among other things, the Bar of the State of New York, the Author’s Guild, and the editorial board of UKSG Insights. Kaufman also advises the US Government on international trade matters through membership in International Trade Advisory Committee (ITAC) 13 – Intellectual Property and the Library of Congress’s Copyright Public Modernization Committee in addition to serving on the Board of the United States Intellectual Property Alliance (USIPA).

The Scholarly Kitchen

Swimming in the AI Data Lake: Why Disclosure and Versions of Record Are More Important than Ever

Roy Kaufman

Announcing Our 2026 New Directions Seminar: “What Is a Journal in 2030?”

Roy Kaufman

Related Articles:

Next Article: