At the end of August, the U.S. National Academies Board on Research Data and Information (BRDI) met in Washington, DC. The Board is primarily a group of scientists who monitor the scientific landscape and propose initiatives to the National Research Council (NRC) related to data collection, management of data systems, and data sharing within the research community.
One of the speakers that day was Kelvin Droegemeier, Professor at University of Oklahoma. His presentation, A Strategy for Dynamically Adaptive Weather Prediction: Cyberinfrastructure Reacting to the Atmosphere, was fascinating from several perspectives. First, simply the notion that our weather prediction system — as good as it is — is a one-size-fits-all approach that has not been at all adaptive to conditions and storm development was quite interesting. The research work that Droegemeier is leading toward making the system adaptive, such that it could react both locally and quickly to focus attention on specific weather events (say that approaching thunderstorm with tornado potential) was even more fascinating. Finally, the implications for large-scale data analysis and storage (where massive amounts of data are generated daily) are particularly interesting to the research and scholarly communications community.
Dr. Droegemeier made the point that there was little value in storing all the fine-grain data for replicability purposes, since the scale of data collection was too massive, and the time-scale of that data use was so narrow that it would be both unnecessary and irrelevant. The atmosphere and the data being gathered to measure it, even at a very low resolution, is tremendous. One fact that he pointed out related to this was that most commercial airliners are collecting and transmitting weather data every six seconds at low altitudes and at between five and six minutes at higher altitudes while en route. This data stream alone is responsible for over 140,000 observations per day. Combined with other data streams, such as meteorological stations, weather radar, weather balloons, and other data collection methodology, the amount of weather data is massive even at a macro level.
Just as we all do (or should) pay attention when important things happen, it makes sense to collect finer grain data related to specific events that are predicted by lower-resolution monitoring. There are times when one will want to pay attention to every available detail, but that level of data collection, monitoring, and analysis isn’t always appropriate. But for every reduction in grid area that is being sampled, as is the case in Droegemeier’s work, the scale of the data management problem increases proportionally. At the highest level of data preservation, we simply cannot manage data at that scale over the long term. This has applications for data stewardship and data preservation, which obviously vary based on the use case.
One can certainly maintain the highest grain data if in retrospect it was an extraordinary discovery or event. However, if fine grain detail was collected and nothing of consequence occurred, does that fine-grain detail need to be preserved? Probably not, without some other specific reason to do so. Obviously, this is a simplification, since you will want to retain some version of the data collected for re-analysis, but the raw data and the resolution of that data need not be preserved on an ongoing basis.
The same is true, perhaps at an even greater scale, when considering astronomical data. A group of organizations led by the NSF is planning to build the Large Synoptic Survey Telescope (LSST) facility in Peru. This single telescope will collect as much as 30 terabytes per night of data. Data at that scale could barely be transmitted much less stored for any length of time in its raw form. Jeff Kantor, LSST data management project manager, described the problem:
At 1 Gbps, 30TB would take 67 hours to download (without overhead).
Even with a network operating at 2.5 times that speed, it would still take significantly longer to transmit the data than to gather or process it. Of course, we could build tremendous storage facilities to keep everything that the LSST creates, but this quickly runs into both cost and systems management issues. A more realistic approach is to process the data, transform it, and save specific elements, sections, or characteristics of the original data stream.
At issue is the granularity of a dataset and the processing power necessary to analyze that data. If the resource is sufficiently large and the needed processing is sufficiently complex or time consuming, realistically will anyone ever reprocess it? One might process today’s data or analyze the results of processing yesterday’s data compared with what actually happened, but the comparison wouldn’t likely need the full dataset, simply the outputs.
Inherent in this question is the replicability of large-scale data science and whether it is worth it, both from a large-scale data preservation perspective and from a science-needs perspective. Now, not every scientific domain is the same. Most do not operate at the same computational data level as, say, astronomy or physics. However, many fields are rapidly growing in their sophistication of data analysis and the size of the datasets they are analyzing. For linguistic analysis, the text in all of the books at the Library of Congress might only equate to a few terabytes of data. However, if one were to analyze all of the text on the Web, or the text that also includes audio transcriptions, that amount of data would be considerably more massive.
There are at present few best practices for managing and curating data. Libraries have developed, over the decades, processes and plans for how to curate an information collection and to “de-accession” (i.e., discard) unwanted or unnecessary content. At this stage in the development of an infrastructure for data management, there is no good understanding of how to curate a data collection. This problem is compounded by the fact that we are generating far more data than we have capacity to store or analyze effectively.
This raises another question of data equivalence. I will discuss this at length in a future post, but essentially the issue of data equivalence is this — if one takes an original file or data stream and then transforms it in some way, does one still have the same data? There is a conceptual model for describing cultural content (books, movies, etc.) called the Functional Requirements for Bibliographic Records (FRBR, pronounced “fur-burr”). The FRBR model describes works, manifestations of works, and items and how they are related. No similar model exists for data sets. This gap will become particularly important as we begin to transition from one data form to another. As a simple example of this, consider if you have data in an Excel spreadsheet and save it as a text file. Do you still have the same thing or is it no longer “equivalent” to the original? Possibly you do have the same data; but you’ve lost potentially important information in formatting, underlying formulas used to make calculations, or inserted comments that make the data meaningful.
Many publishers have thrown up their hands when looking at the data management problem and have responded that it is not the publishers’ problem to solve. Many publishers view the data management problem as something that the academy needs to address. Some libraries have embraced this notion and are looking to new ways that the library can serve the data curation needs of their community. One thing is clear, however: wherever this data is stored, it will need to be permanently linked to the scholarly communications chain. By what mechanism that linking occurs is certainly something that still requires a great deal of work and agreement among the affected parties.
In the very near future, the scientific community is going to have to come to grips with this data deluge. Publishers, too, will have to find a way to fit the non-traditional content forms into their publications or the websites that contain supporting materials. Some work is already underway to address this. A NISO/NFAIS project is developing recommended practices for the handling of supplemental materials for journal articles, but this project is excluding large-scale datasets. Other organizations, such as BRDI, DataCite, and CODATA are exploring related issues. However, the much deeper questions of large datasets and what to preserve, at what level of detail and granularity, and whether all data is equally important to preserve are questions that have yet to be fully addressed.