Data Transparency and Civil Engineers

Engineers, being analytical by nature, love data. They also depend on data from a multitude of sources to conduct their research. At the American Society of Civil Engineers (ASCE), we have been rolling out Data Availability Statements to enable data reusability. With each distinct community within civil engineering, we bump into new caveats, complications, and considerations around sharing data.

Within the broad topic of civil engineers, there are sub-disciplines, some of which share characteristics (and collaborate with) chemistry, physics, materials science, and biology, to name a few. Because of these differences in practices, nailing down one policy on data availability statement requirements can be tough.

Ithaka S+R recently completed a report “Supporting the Changing Research Practices of Civil and Environmental Engineering Scholars.” This report is part of a series in which university libraries interview faculty and students in specific departments to ascertain where the pain points are and what might be done to assist researchers in the discipline. To help this project come into being, ASCE provided sponsorship and I served as an advisor. Danielle Cooper and Rebecca Springer were the principal authors, along with research teams from eleven research universities in the United States and Canada.

The Ithaka Study spent considerable time talking to researchers about data: how they find it, what they do with it, where they keep it, and how comfortable they are with sharing it. Before I dig into some of those details, it’s important to note where funding for research comes from in civil engineering.

Funding for civil engineering related research comes from government agencies in the US and other countries. A large percentage of funding also comes directly from industry or industry groups. In many cases, the industry data, either provided to a researcher or procured by the researcher in the course of the industry-funded work is considered proprietary and/or confidential.

Expectations of confidentiality was the first hurdle to requiring data availability statements. As such, we allow authors to state that the data is proprietary in the statements; however, these data sharing limitations will slow the discovery of data and replication of works.

Data is often procured from the government, both federal and state.

How Do Researchers Find the Data?

The Ithaka Study confirmed what we had been hearing. Much data discovery is through personal contacts, journal articles, or conference presentations. Repositories, with a few exceptions, are not proving useful, though I will be interested to see how the new Google Dataset Search changes these practices.

Civil engineers interviewed for the Ithaka Study explained that some datasets need to be purchased from local governments or companies. These are additional expenses for which funding is not always available. Accessibility of government data is certainly unstable. Shortly after the Trump Administration was installed, the EPA removed lots of climate data.

State resources can be even tighter as seen in this example from the study:

“Graduate students are sometimes responsible for the labor of collecting difficult-to-access data: an interviewee reported that one of their students has assembled data from hundreds of water utilities by submitting Freedom of Information Act requests.”

Where Is Data Kept?

Perhaps the most concerning part of the study is about how data is being stored.

“Most research groups maintain a rough-and-ready system whereby students and postdocs keep data on their computers and periodically upload it to shared Box, Dropbox, or Google Drive folders. Group leaders make a particular effort to collect data from students before they graduate, but may struggle to navigate or understand that data later on.”

This is obviously not ideal. Other anecdotes included difficulties in transferring large datasets and having to resort to mailing hard drives to each other or providing log in credentials to access the servers hosting the data.

Researchers receiving federal grant funding have been mandated to have data management plans for several years. I assumed that this would make instituting a data availability statement pretty easy. That assumption would be wrong.

Several researchers interviewed stressed concerns about the resources required to support data management efforts and the lack of funding available to do so.

The use of data repositories is uneven. The Natural Hazards Engineering Research Infrastructure (NHERI)’s DesignSafe repository is often held up as the “gold standard;” however, it is a curated database with adequate technical support — a luxury not afforded to all repositories.

Feedback we recently received while rolling out our policy for the transportation engineering journals is that the researcher may not be able to guarantee that the data they used will be available in perpetuity. Some of the data is regularly destroyed after a set period of time.

Researchers also shared concerns in the Ithaka Study about the durability of government repositories: “…a repository funded by the National Science Foundation (NSF) [was described] as ‘constantly functioning on a shoestring budget…nothing NSF is forever’.”

Will Civil Engineers Share Their Data?

This was the first question we asked when the editor of the Journal of Construction Engineering and Management first started asking about availability statements several years ago. In order to accommodate the need for proprietary data, we did allow authors to declare that the data is available by request. Realizing that this has the potential to be the ultimate cop out, we monitored the use of this option and it was, in fact, the most common statement made in the first year of statements being required.

Since launching statements for that journal, we have learned to refine the statements to ensure that authors must tell readers why exactly data is not available. Some of what the Ithaka Study heard explains the high frequency of those statements. Interviewees reported that “trust” was very important in getting others to share data:

“Sometimes we’ll contact the authors [of journal articles] and ask them for the raw data. It’s really rare that they respond, or give us anything, unless I know them personally.”
“Since we’re usually trying to get other people’s data rather than the other way around, we have to first show them, usually with a little subset of the data, that we could do something very interesting with it, and they can be either lead authors or joint authors with us,” one researcher explained.
“Unless it’s been really well documented, how their test set up was constructed and what types of data they’re using, it can be really difficult to interpret the data correctly.”

For some, the work of preparing the data to be shared is a burden they may put off until someone asks them for the data. It was seen as a waste of time to do all that work, just to have the data sit in a repository unused.

This is, at its core, the point of data availability statements and even public data sharing mandates. As research practices adjust to the requirements of making data public, the work process will need to shift so that better care and documentation is maintained as the work is being done to alleviate some of the slog at the end of the process to make the data useful to others.

ASCE authors and researchers from the Ithaka Study report challenges with contextualizing data. They need to know how it was procured and what was done with it. To advance this need for information, ASCE will be publishing Data Papers that can cite the dataset and provide the information users need to confidently re-use data.

Since rolling out requirements for data availability statements, we have learned a lot about what caveats researchers deal with in sharing data — confidentiality, industry considerations, file size, perpetual access, board approval, etc. None of these were insurmountable and we have been able to craft a policy that addresses the concerns or practices for different sub-disciplines.

The feedback from authors has been overwhelmingly supportive of data sharing expectations. The field in general depends on re-using loads of data and yet, the Ithaka Study unveiled lots of obstacles to doing so. Journals could help in removing some of those obstacles:

Require statements and expect authors to be explicit about the availability of the data
Understand that there may be reasons data can’t be shared, but require authors disclose those reasons (not just “because I don’t want to”)
Provide resources for authors on best practices for citing data — the concern about data papers is that they won’t count toward tenure and promotion. Proper citation practices should alleviate those concerns.

Clearly federal agency funders and institutions should play a role in helping smooth out the bumps in the road to data transparency for civil engineers. Unfunded mandates, for example, don’t appear to be very helpful when it comes to complicated data that needs lots of management. At the same time, it seems clear that building a system that can benefit from government funding when it is available, but at the same time endure resiliently when such funding is interrupted, is a key principle and an important challenge.

Angela Cochran

@acochran12733

Angela Cochran is Vice President of Publishing at the American Society of Clinical Oncology. She is past president of the Society for Scholarly Publishing and of the Council of Science Editors. Views on TSK are her own.

Discussion

6 Thoughts on "Data Transparency and Civil Engineers"

This is an extremely useful piece of work and certainly helps me understand more about the practices of the small number of ECR engineering researchers practices in data discovery, use and sharing I interviewed in the Harbingers project. I would be interested in knowing of any similar recent project finished or under way in other STM discipline. Thank you Angela

By Anthony Watkinson
Feb 6, 2019, 7:34 AM

Thank you for sharing the civil engineering perspective, Angela. I was surprised to see how much it echoes the clinical data availability experience. ASCO convened a wide range of stakeholders last November to discuss clinical trial data sharing to inform a harmonized data availability policy and statement across leading oncology journals. We are having a follow-up meeting tomorrow for publishers to try to reach consensus on a policy and data availability statement (DAS) building on the ICMJE clinical trial data sharing policy and experience of ICMJE journals.

Your observation, “Because of these differences in practices, nailing down one policy on data availability statement requirements can be tough.” is especially important. I posted a rather emotional comment on the January 30th Scholarly Kitchen Guest Post, Encouraging Data Sharing: A Small Investment for Large Potential Gain, calling for publishers and other data stakeholders to work together on an upstream solution for researchers to register their data and sharing plans instead of creating individual publisher and journal data availability statements (DASs) of varying levels of specificity and duration.

I’m not calling on publishers to halt DAS initiatives. But to convene data stakeholders to explore the development of a single data registry or discipline-specific, but interconnected, data registries in which researchers input the who, what, when, where, why, and how about their data in one place. Publishers, funders, institutions, and other stakeholders would pull the data needed to populate grants, DASs, CVs, etc. instead of requiring authors to fill out different DAS questions for each journal or publisher as we do for COI disclosures. The data registries would connect data to articles and other relevant research outputs to show the impact of a particular data set.

Let’s not repeat for DASs the mess we’ve made of COI disclosures, which btw stakeholders are working to fix. And let’s remember that our authors participate in activities beyond publishing in journals as we think about DASs and policies.

By David Sampson
Feb 6, 2019, 3:38 PM

David, this is an interesting proposal. Whom do you envision the owner/builder of this database be?

By Angela Cochran
Feb 7, 2019, 8:50 AM

Ahhh…the devil is in the details isn’t it? Several of our publishing colleagues have mentioned a role for CrossRef. For biomedical data I will be speaking with NLM. Maybe figshare? I welcome suggestions!

By David Sampson
Feb 8, 2019, 4:01 PM

It sounds to me like the first step would be a set of standards for Data Availability Statements. You’ll certainly need those in place if you’re going to create any sort of repository for the information. So possible starting points could be Force11, NISO, or the STM Standards and Technology Committee. I know the NLM is very focused on open data at the moment, so they may be a good partner as well. As for figshare, I feel like we’re seeing more of a movement toward open, community-owned infrastructure rather than leaving these things to proprietary, privately-owned companies. Maybe drop me an email and I can connect you to some of the groups above.

By David Crotty
Feb 9, 2019, 6:51 AM

Thank you for an excellent post and highlighting the Ithaka S&R report (in a welcome contrast to the post on SK from a couple days ago, equating difficulty in data sharing to how much time it took authors and editorial staff to write and check data availability statements). The insights with civil engineers match my own observations in pollution studies.

One of the overlooked benefits to authors of taking the effort to curate and publish their own data is that it keeps the data accessible and organized for themselves in years hence. Authors move between institutions, offices get cleaned, servers dumped, software subscriptions lapse or lose compatibility, analyses for papers are produced by teams, and teams disband when the project/degree/funding ends. Sure, the authors are intimately familiar with their data when the paper is submitted, but 5, 10 and more years out?

For publishers, I’m waiting for one to hear of one who has the gumption to state and follow through with a policy that says ‘sure authors, you can say data are available upon request, but if we learn that when called upon, you were unable or unwilling to provide the data supporting your article, you should expect a retraction.’

But data curation and publishing do take work. For example, even funder demands for metadata creation and review are daunting, require skill, and take a noticeable chunk of the labor budget to comply with.

One potential unintended consequence of enforcing a strong data policy is increased publication bias. With publishing becoming an increased lift with more steps and signoffs and a backlog of worthy data and results, authors may have to triage what they are going to take through the process and what to leave behind in the files. I’ve a few where an idea didn’t pan out once the data were in, and I really should write up a short communication on the negative results, but there’s only so much time and there are higher priority, “better” stories also waiting. Not admirable behavior and this is how publication bias develops, but something has to give.