Thunderstorm in Kansas
Thunderstorm in Kansas (Photo credit: Wikipedia)

At the end of August, the U.S. National Academies Board on Research Data and Information (BRDI) met in Washington, DC. The Board is primarily a group of scientists who monitor the scientific landscape and propose initiatives to the National Research Council (NRC) related to data collection, management of data systems, and data sharing within the research community.

One of the speakers that day was Kelvin Droegemeier, Professor at University of Oklahoma. His presentation, A Strategy for Dynamically Adaptive Weather Prediction: Cyberinfrastructure Reacting to the Atmosphere, was fascinating from several perspectives. First, simply the notion that our weather prediction system — as good as it is — is a one-size-fits-all approach that has not been at all adaptive to conditions and storm development was quite interesting. The research work that Droegemeier is leading toward making the system adaptive, such that it could react both locally and quickly to focus attention on specific weather events (say that approaching thunderstorm with tornado potential) was even more fascinating. Finally, the implications for large-scale data analysis and storage (where massive amounts of data are generated daily) are particularly interesting to the research and scholarly communications community.

Dr. Droegemeier made the point that there was little value in storing all the fine-grain data for replicability purposes, since the scale of data collection was too massive, and the time-scale of that data use was so narrow that it would be both unnecessary and irrelevant. The atmosphere and the data being gathered to measure it, even at a very low resolution, is tremendous. One fact that he pointed out related to this was that most commercial airliners are collecting and transmitting weather data every six seconds at low altitudes and at between five and six minutes at higher altitudes while en route. This data stream alone is responsible for over 140,000 observations per day. Combined with other data streams, such as meteorological stations, weather radar, weather balloons, and other data collection methodology, the amount of weather data is massive even at a macro level.

Just as we all do (or should) pay attention when important things happen, it makes sense to collect finer grain data related to specific events that are predicted by lower-resolution monitoring. There are times when one will want to pay attention to every available detail, but that level of data collection, monitoring, and analysis isn’t always appropriate. But for every reduction in grid area that is being sampled, as is the case in Droegemeier’s work, the scale of the data management problem increases proportionally. At the highest level of data preservation, we simply cannot manage data at that scale over the long term. This has applications for data stewardship and data preservation, which obviously vary based on the use case.

One can certainly maintain the highest grain data if in retrospect it was an extraordinary discovery or event.  However, if fine grain detail was collected and nothing of consequence occurred, does that fine-grain detail need to be preserved? Probably not, without some other specific reason to do so. Obviously, this is a simplification, since you will want to retain some version of the data collected for re-analysis, but the raw data and the resolution of that data need not be preserved on an ongoing basis.

The same is true, perhaps at an even greater scale, when considering astronomical data. A group of organizations led by the NSF is planning to build the Large Synoptic Survey Telescope (LSST) facility in Peru. This single telescope will collect as much as 30 terabytes per night of data. Data at that scale could barely be transmitted much less stored for any length of time in its raw form. Jeff Kantor, LSST data management project manager, described the problem:

At 1 Gbps, 30TB would take 67 hours to download (without overhead).

Even with a network operating at 2.5 times that speed, it would still take significantly longer to transmit the data than to gather or process it. Of course, we could build tremendous storage facilities to keep everything that the LSST creates, but this quickly runs into both cost and systems management issues. A more realistic approach is to process the data, transform it, and save specific elements, sections, or characteristics of the original data stream.

At issue is the granularity of a dataset and the processing power necessary to analyze that data. If the resource is sufficiently large and the needed processing is sufficiently complex or time consuming, realistically will anyone ever reprocess it? One might process today’s data or analyze the results of processing yesterday’s data compared with what actually happened, but the comparison wouldn’t likely need the full dataset, simply the outputs.

Inherent in this question is the replicability of large-scale data science and whether it is worth it, both from a large-scale data preservation perspective and from a science-needs perspective. Now, not every scientific domain is the same. Most do not operate at the same computational data level as, say, astronomy or physics. However, many fields are rapidly growing in their sophistication of data analysis and the size of the datasets they are analyzing. For linguistic analysis, the text in all of the books at the Library of Congress might only equate to a few terabytes of data. However, if one were to analyze all of the text on the Web, or the text that also includes audio transcriptions, that amount of data would be considerably more massive.

There are at present few best practices for managing and curating data. Libraries have developed, over the decades, processes and plans for how to curate an information collection and to “de-accession” (i.e., discard) unwanted or unnecessary content. At this stage in the development of an infrastructure for data management, there is no good understanding of how to curate a data collection. This problem is compounded by the fact that we are generating far more data than we have capacity to store or analyze effectively.

This raises another question of data equivalence. I will discuss this at length in a future post, but essentially the issue of data equivalence is this — if one takes an original file or data stream and then transforms it in some way, does one still have the same data? There is a conceptual model for describing cultural content (books, movies, etc.) called the Functional Requirements for Bibliographic Records (FRBR, pronounced “fur-burr”). The FRBR model describes works, manifestations of works, and items and how they are related. No similar model exists for data sets. This gap will become particularly important as we begin to transition from one data form to another. As a simple example of this, consider if you have data in an Excel spreadsheet and save it as a text file. Do you still have the same thing or is it no longer “equivalent” to the original? Possibly you do have the same data; but you’ve lost potentially important information in formatting, underlying formulas used to make calculations, or inserted comments that make the data meaningful.

Many publishers have thrown up their hands when looking at the data management problem and have responded that it is not the publishers’ problem to solve. Many publishers view the data management problem as something that the academy needs to address. Some libraries have embraced this notion and are looking to new ways that the library can serve the data curation needs of their community. One thing is clear, however: wherever this data is stored, it will need to be permanently linked to the scholarly communications chain. By what mechanism that linking occurs is certainly something that still requires a great deal of work and agreement among the affected parties.

In the very near future, the scientific community is going to have to come to grips with this data deluge. Publishers, too, will have to find a way to fit the non-traditional content forms into their publications or the websites that contain supporting materials. Some work is already underway to address this. A NISO/NFAIS project is developing recommended practices for the handling of supplemental materials for journal articles, but this project is excluding large-scale datasets. Other organizations, such as BRDI, DataCite, and CODATA are exploring related issues. However, the much deeper questions of large datasets and what to preserve, at what level of detail and granularity, and whether all data is equally important to preserve are questions that have yet to be fully addressed.

Enhanced by Zemanta
Todd A Carpenter

Todd A Carpenter

Todd Carpenter is Executive Director of the National Information Standards Organization (NISO). He additionally serves in a number of leadership roles of a variety of organizations, including as Chair of the ISO Technical Subcommittee on Identification & Description (ISO TC46/SC9), founding partner of the Coalition for Seamless Access, Past President of FORCE11, Treasurer of the Book Industry Study Group (BISG), and a Director of the Foundation of the Baltimore County Public Library. He also previously served as Treasurer of SSP.


8 Thoughts on "Does All Science Need to be Preserved? Do We Need to Save Every Last Data Point?"

Looking forward to hearing more from you on this matter. A few thoughts:

In some fields, we’re reaching a point where re-generating the dataset anew is cheaper and faster than organizing, annotating, storing, maintaining and downloading that dataset, which may make the questions of storage moot.

I’m always drawn back to this piece by Steven Wiley ( where he suggests the above, and also notes that, “Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes.” It speaks to the notion of different standards of retention for different types of data.

“In the very near future, the scientific community is going to have to come to grips with this data deluge.”

Excellent piece, Todd. I would emphasize that given the massive differences in the way scientists use, communicate, and value raw data, it makes little sense to create unified policies and practices. More likely, these policies and practices will need to be developed at the community level–by scientific societies and funding organizations.

Phil I tend to agree with you. The data is but a tool within the research tool box and like an odd sized wrench can just lay there but when it is needed it becomes the most important tool.

I tend to look at data from a business perspective and the closest analogy I can think of is marketing. It is said that 80% of all marketing dollars are wasted. Now if I only knew which 80%.

Laying in the heap of data maybe a gem that is just waiting to be used. Also, we really are just beginning to collect data and figure out how to store it in a usable fashion. We are using rather crude computers and programs – after all they too are really pretty new.

Thanks Harvey. I don’t think the real issue is about how to preserve and deliver data just in case it is needed in the future: It is about defining what is data. And this definition is discipline-centric. To an historian, an original audio recording preserved on a wax cylinder is raw data. To a linguistic computationist, it may be a dataset that breaks down that conversation into word frequencies. To cultural archeologist, it may be a point on a geographic map that shows the placement of these artifacts. To a social scientist, it might be a value statement expressed in the recording that is plotted with other such statements over time…okay, you get the point.

Ultimately, the definition of data comes down to what the community of practice considers data. And often this definition is based on what is considered reasonable proof of a valid observation.

I tend to agree with Phil’s remarks above. Ultimately the data preservation issue is a funding issue and the funders are addressing it at the grant level. In the US the funding agencies several years ago formed the Interagency Working Group on Digital Data or IWGDD. I was lucky to do staff work for them from the first meeting until their first report in January 2009:

The upshot was that each agency should consider having its own policy. At NSF this meant requiring a data management plan with every proposal. Today the general policy is that if someone thinks certain data is worth saving and they can get the money to save it so be it. It is not clear that anything more is required, so the deluge is being addressed.

Part of the problem is that the concept of data is so vague and ambiguous that there can be no general policy about it. What counts as a project’s data can range over a million orders of magnitude.

Moreover preservation and access is an ongoing expense so it is not possible to add new data every year without diminishing research accordingly. The popular claim that all data should be preserved is simply nonsense. We need to get rid of it as quickly as we save it. Thus funding is the decision mechanism and that mechanism is in place today. It might be better organized but that is about all that can be done.

Comments are closed.