Engineers, being analytical by nature, love data. They also depend on data from a multitude of sources to conduct their research. At the American Society of Civil Engineers (ASCE), we have been rolling out Data Availability Statements to enable data reusability. With each distinct community within civil engineering, we bump into new caveats, complications, and considerations around sharing data.
Within the broad topic of civil engineers, there are sub-disciplines, some of which share characteristics (and collaborate with) chemistry, physics, materials science, and biology, to name a few. Because of these differences in practices, nailing down one policy on data availability statement requirements can be tough.
Ithaka S+R recently completed a report “Supporting the Changing Research Practices of Civil and Environmental Engineering Scholars.” This report is part of a series in which university libraries interview faculty and students in specific departments to ascertain where the pain points are and what might be done to assist researchers in the discipline. To help this project come into being, ASCE provided sponsorship and I served as an advisor. Danielle Cooper and Rebecca Springer were the principal authors, along with research teams from eleven research universities in the United States and Canada.
The Ithaka Study spent considerable time talking to researchers about data: how they find it, what they do with it, where they keep it, and how comfortable they are with sharing it. Before I dig into some of those details, it’s important to note where funding for research comes from in civil engineering.
Funding for civil engineering related research comes from government agencies in the US and other countries. A large percentage of funding also comes directly from industry or industry groups. In many cases, the industry data, either provided to a researcher or procured by the researcher in the course of the industry-funded work is considered proprietary and/or confidential.
Expectations of confidentiality was the first hurdle to requiring data availability statements. As such, we allow authors to state that the data is proprietary in the statements; however, these data sharing limitations will slow the discovery of data and replication of works.
Data is often procured from the government, both federal and state.
How Do Researchers Find the Data?
The Ithaka Study confirmed what we had been hearing. Much data discovery is through personal contacts, journal articles, or conference presentations. Repositories, with a few exceptions, are not proving useful, though I will be interested to see how the new Google Dataset Search changes these practices.
Civil engineers interviewed for the Ithaka Study explained that some datasets need to be purchased from local governments or companies. These are additional expenses for which funding is not always available. Accessibility of government data is certainly unstable. Shortly after the Trump Administration was installed, the EPA removed lots of climate data.
State resources can be even tighter as seen in this example from the study:
“Graduate students are sometimes responsible for the labor of collecting difficult-to-access data: an interviewee reported that one of their students has assembled data from hundreds of water utilities by submitting Freedom of Information Act requests.”
Where Is Data Kept?
Perhaps the most concerning part of the study is about how data is being stored.
“Most research groups maintain a rough-and-ready system whereby students and postdocs keep data on their computers and periodically upload it to shared Box, Dropbox, or Google Drive folders. Group leaders make a particular effort to collect data from students before they graduate, but may struggle to navigate or understand that data later on.”
This is obviously not ideal. Other anecdotes included difficulties in transferring large datasets and having to resort to mailing hard drives to each other or providing log in credentials to access the servers hosting the data.
Researchers receiving federal grant funding have been mandated to have data management plans for several years. I assumed that this would make instituting a data availability statement pretty easy. That assumption would be wrong.
Several researchers interviewed stressed concerns about the resources required to support data management efforts and the lack of funding available to do so.
The use of data repositories is uneven. The Natural Hazards Engineering Research Infrastructure (NHERI)’s DesignSafe repository is often held up as the “gold standard;” however, it is a curated database with adequate technical support — a luxury not afforded to all repositories.
Feedback we recently received while rolling out our policy for the transportation engineering journals is that the researcher may not be able to guarantee that the data they used will be available in perpetuity. Some of the data is regularly destroyed after a set period of time.
Researchers also shared concerns in the Ithaka Study about the durability of government repositories: “…a repository funded by the National Science Foundation (NSF) [was described] as ‘constantly functioning on a shoestring budget…nothing NSF is forever’.”
Will Civil Engineers Share Their Data?
This was the first question we asked when the editor of the Journal of Construction Engineering and Management first started asking about availability statements several years ago. In order to accommodate the need for proprietary data, we did allow authors to declare that the data is available by request. Realizing that this has the potential to be the ultimate cop out, we monitored the use of this option and it was, in fact, the most common statement made in the first year of statements being required.
Since launching statements for that journal, we have learned to refine the statements to ensure that authors must tell readers why exactly data is not available. Some of what the Ithaka Study heard explains the high frequency of those statements. Interviewees reported that “trust” was very important in getting others to share data:
- “Sometimes we’ll contact the authors [of journal articles] and ask them for the raw data. It’s really rare that they respond, or give us anything, unless I know them personally.”
- “Since we’re usually trying to get other people’s data rather than the other way around, we have to first show them, usually with a little subset of the data, that we could do something very interesting with it, and they can be either lead authors or joint authors with us,” one researcher explained.
- “Unless it’s been really well documented, how their test set up was constructed and what types of data they’re using, it can be really difficult to interpret the data correctly.”
For some, the work of preparing the data to be shared is a burden they may put off until someone asks them for the data. It was seen as a waste of time to do all that work, just to have the data sit in a repository unused.
This is, at its core, the point of data availability statements and even public data sharing mandates. As research practices adjust to the requirements of making data public, the work process will need to shift so that better care and documentation is maintained as the work is being done to alleviate some of the slog at the end of the process to make the data useful to others.
ASCE authors and researchers from the Ithaka Study report challenges with contextualizing data. They need to know how it was procured and what was done with it. To advance this need for information, ASCE will be publishing Data Papers that can cite the dataset and provide the information users need to confidently re-use data.
Since rolling out requirements for data availability statements, we have learned a lot about what caveats researchers deal with in sharing data — confidentiality, industry considerations, file size, perpetual access, board approval, etc. None of these were insurmountable and we have been able to craft a policy that addresses the concerns or practices for different sub-disciplines.
The feedback from authors has been overwhelmingly supportive of data sharing expectations. The field in general depends on re-using loads of data and yet, the Ithaka Study unveiled lots of obstacles to doing so. Journals could help in removing some of those obstacles:
- Require statements and expect authors to be explicit about the availability of the data
- Understand that there may be reasons data can’t be shared, but require authors disclose those reasons (not just “because I don’t want to”)
- Provide resources for authors on best practices for citing data — the concern about data papers is that they won’t count toward tenure and promotion. Proper citation practices should alleviate those concerns.
Clearly federal agency funders and institutions should play a role in helping smooth out the bumps in the road to data transparency for civil engineers. Unfunded mandates, for example, don’t appear to be very helpful when it comes to complicated data that needs lots of management. At the same time, it seems clear that building a system that can benefit from government funding when it is available, but at the same time endure resiliently when such funding is interrupted, is a key principle and an important challenge.