Identifiers play a critical role in our online environment, even if we don’t see them or acknowledge they are operating behind the scenes. Formal identification isn’t nearly as simple as we might think at first glance. Everyone in the publishing community is familiar with ISBNs and ISSNs, and likely in scholarly communications also the DOI. But the rules and minutiae of assignment is something few focus on. These questions become especially important when new formats of exchange and new methods develop in our community. Research data is becoming one of these new formats gaining prominence in scholarly communications.
Distinguishing one thing from another is not always a simple prospect, especially in the domain of digital information. What are the critical aspects that make one thing significantly different from another? As digital books have rapidly expanded, the traditional distinguishing characteristics of books are no longer as obvious as paper versus hard cover. By my count, there are at least 14 ways in which digital books can be distinguished from other manifestations. Some of these might be critical to readers, some less so. Where we as a community draw those distinguishing points becomes important when we talk about the supply chain and identifiers?
Last year, the Book Industry Study Group and the International ISBN Agency published separate but similar recommendations on the assignment of ISBNs to e-books [BISG: Best Practices for Identifying Digital Products; International: Guidelines for the assignment of ISBNs to e-books]. These reports follow a fairly simple guideline that the point where a different file is traded is the point where a new ISBN should be assigned. (Many people are unaware that an ISBN identifies a “product” and not the intellectual “content” of a book.) Of course, this creates a metadata management issue if there are dozens of different ISBNs for the same book in its variety of forms. A solution to this proliferation is the creation of an over-arching work identifier that can collocate information about a work. Discussions are underway within BISG, NISO, and EDItEUR to promote the ISO International Standard Text Code (ISTC) for this purpose.
Two weeks ago, I posed the question on this blog of whether each data point should be preserved. In that article, I mentioned the need to consider the identification of data sets online. Over the past two years, the DataCite initiative has developed significant momentum. Beginning in 2009 with seven members, DataCite now boasts 19 members and collectively they have assigned over a million DOIs to data sets. One issue that DataCite has not resolved, leaving the decisions up to its respective members and registration authorities, is when to assign a new DOI to something, which is more problematic with datasets than with something like a journal article. No hard rules exist within DataCite about when a new DOI should be applied, at what granularity, or how freely available the data will be. DataCite does have a metadata structure that is related to the DOI assignment, but even among DataCite members it is not yet fully adopted and certainly its adoption outside of the DataCite community is limited. It is early days yet for DataCite and for the scholarly communications community’s practices around management of data. Things will develop, probably slowly, over the next decade as data sharing and data citation becomes a more recognized and accepted practice in scientific publishing.
This brings us to a fairly critical question of when identifiers are assigned, to what level of granularity are they assigned, and whether the DOI is the most appropriate identifier for the purpose. While there are conceptual models that describe the landscape and transformations for cultural content (books, films, music, etc.), these models don’t adequately describe the lifecycle of data or the complexity of transformations possible. Take the most referenced conceptual model for cultural works, Functional Requirements for Bibliographic Records (FRBR). This model describes the major concepts of works, expressions, manifestations, and items. It also describes persons, concepts, objects, events, and places. However, even within the FRBR context, many questions arise when applied to datasets. Does an item consist of the individual data elements or the set that is comprised of the various data elements? If a dataset is added to by more recent data collection, does the resulting dataset constitute a new work, a derivative work, a new expression, or something else? Using a simple example, if the data is contained in a Microsoft Excel file and is migrated from one version of Excel to a new version, is this a new expression? Would simply changing the font style constitute a variation? Does one type of file serialization compared with another constitute a new version?
These variations are not trivial, especially when dealing with data, nor are the implications about how one identifies them. If a video file is gathered and preserved, but the file is migrated from the original MTS file format produced by the camera to the MP4 encoding format, do you still have the same content? This problem could be viewed through the lens of scientific equivalence — some work on this has been done, notably by Allen Renear, David Dubin, Simone Sacchi, and Peter Buneman/Wang-Chiew Tan. If, for a variety of reasons, data must be compressed or the resolution changes, how might that impact the assignment of identifiers and version identification? Additionally, the question could be viewed through the perspective of preservation. Several people in the archives and preservation community have explored these issues — notably, Martin Doerr and Patrick LeBoeuf, and Luciana Duranti. Some fields, such as atmospheric studies and astronomy, have been more deeply involved in these questions because of their history of dealing with large-scale datasets. However, as yet no suitable model exists at a sufficiently abstract level such that it can be broadly applied across the growing number of domains that are rapidly incorporating data into their scholarly communications.
Granularity is another complex issue. Just as with books, it probably doesn’t make sense to identify each component element. Our community doesn’t identify each word or paragraph within a book. And we are just testing the waters with the assignment of ISBNs to chapters — only where they are individually sold — but take up has been slow, in large part because that is not how content has been consumed. A single data point might be valuable, but there are likely better citation methods to identify a single measure (although it should be noted that in a digital environment citation of specific points in digital texts is not as simple as one might first consider). Even if we move identification to the dataset level, there are questions about what the set is, especially as datasets change over time. Might the thing being identified be a cover page to the dataset that provides access as well as metadata describing the object?
It is still an open question which identifier(s) might be best suited for research data, or if a wholly new identifier is required. DOIs are wonderful identifiers that have proven extremely useful in the role of linking citations in scholarly literature. They are growing rapidly in their use outside of this traditional space, though. DOIs are being used for adding resolution functionality for ISBNs, ISSNs, and within the film and TV industry as part of the EIDR system. And DataCite is using them for data as well. This begs the question of whether the DOI is the appropriate identifier for every object potentially contained or described within a digital system. Since discovery and management of nearly every object is moving toward digitization, does our industry need to give up the question of identifiers and simply use the DOI for everything? I’m not certain that’s the best approach, as useful as the resolution functionality provided by the DOI is. The California Digital Library is assigning a different kind of identifier, the Archival Resource Key (ARK), to datasets in their collection. But they are also assigning DOIs. It is instructive to consider CDL’s rationale and position on this. Joan Starr, CDL’s Strategic & Project Planning Manager, spoke at a NISO event in September about the value of different ID systems and the differences in their needs and approaches. You can view her presentation slides.
The work of Hedstrom, Doerr, Europeana, and Renear are particularly good examples of work on these questions. NISO and several of these thought leaders are exploring the development of a high-level conceptual model that describes data and its transformations. Our present goal is to seek funding to organize an initiative to develop this conceptual model. The first step in this process will be to gather experts from the scientific community, leaders in repository management, and those expert in the creation of information management models to frame the issue. Just as the existence of other models hasn’t answered every question about the relationships between cultural content items, we don’t expect this project will provide the definitive description of data relationships. This initiative will, however, provide a launching point for the ensuing robust discussions on identifiers, metadata, preservation, and data management practices. This is a conversation the scientific community will need to have as the use of data expands in scholarly communications.
Discussion
7 Thoughts on "These Data Are Different from Those — Data Equivalence and Identification Issues"
About assigning DOIs to book chapters, you say “that is not how content has been consumed.” But hasn’t the CCC been doing business licensing use of book chapters (as well as journal articles) for nearly 35 years?
This is a fascnating and important set of issues but I suspect they will be resolved at different levels of granularity depending on the issue and the discipline, possibly even at the problem level. In some cases standards may not be feasible or even appropriate, except locally. Thus I am especially skeptical about what you call a high-level conceptual model of data. Still it will be fun and useful to try.
I studied this problem when I did staff work for the US IWGDD. Here is an example of a worst case data scenario from climate science but I imagine every discipline has them. It is global temperature for a given year. At its simplest this is a single number as estimated by a major statistical model. But it is based on a complex set of layered computer manipulations each of which has its inputs, outputs, software, assumptions, simplifications and rationales. All are potentially data or metadata. Beneath this lies millions of thermometer readings, which themselves have attendent notes and related information.
In this case what we may call the data ranges over a million orders of magnitude. Different pieces and levels are available in many differnt forms and locations. So while it is all there it is far from clear what the data even means, much less how we would label it, especially since much of it has other uses as well.