Data.gov: Selling the Government and Democratization of Information

: Image via Wikipedia

Last Friday marked the one-year anniversary of the Obama Administration’s Open Government Initiative (OGI). The occasion was honored with a cupcake and candle on the landing page of the newly re-designed Data.gov site and a widely disseminated announcement from the White House.

For global publishers who have generated a significant portion of revenue building and selling databases, a requirement to make their data freely available is a mixed blessing. Despite the fact that global access and use of the data are expected to rise exponentially, balance sheets will take a hit.

Databases are not just part of a publishers portfolio, if done right they can be the most profitable part and have sometimes carried the less profitable and declining parts of the publishing line up — namely, books. Presses being impacted by this change must quickly seek new ways to recapture publishing expense and reinvent the services they provide.

Conversely, if a business has retooled to conceive of and build data services, it’s a golden egg. For publishers in adjacent spaces — CQ Press, Bloomberg, LexisNexis, Thomson Reuters, National Journal, CQ-Roll Call, the Washington Post — access to troves of free, authoritative, updated data presents a significant opportunity to create new revenue streams by developing bespoke products and services that monetize free content.

What’s It All About?

If unfamiliar with the OGI, an excellent summary of the initiative and the role the Office of Science and Technical Information (OSTI) has played can be found on the OSTIblog in an article written by Walt Warnick, Director of OSTI, and Peter Lincoln, co-author of the Department of Energy (DOE) Open Government Plan:

On January 21, 2009, his first full day in office, President Barack Obama signed the Memorandum on Transparency and Open Government. The memo was addressed to the heads of all Cabinet departments and agencies, and in it, the President called for “an unprecedented level of openness in Government” and instructed the Director of the Office of Management and Budget (OMB) to prepare a directive that would serve “to ensure the public trust and establish a system of transparency, public participation and collaboration” throughout the Federal Government.

On December 8, 2009, OMB Director Peter Orszag issued the Administration’s Open Government Directive, which required agencies to take a number of steps to advance the principles of transparency, participation and collaboration, including preparation and publication of an Open Government Plan by April 7, 2010.

The Department of Energy was one of 29 agencies that has posted its Open Government Plan online, and OSTI’s contributions appeared throughout the 30-page DOE document.

Data.gov includes more than 250,000 datasets, up from 47 made available at launch. The impact of the OGI is not confined to the United States. At present, six nations outside the U.S. are also developing open repositories of government data.

And these datasets are being accessed for all sorts of things, according to the White House:

To date, the site has received 97.6 million hits, and following the Obama Administration’s lead, governments and institutions of all sizes are unlocking the value of data for their constituents. . . . From these datasets, citizens have developed hundreds of applications that help parents keep their children safe, let travelers find the fastest route to their destinations, and inform home buyers about the safety of their new neighborhood.

In the area of semantic Web innovation, a proposal is also in the works with Rensselaer Polytechnic Institute to provide a “new encoding of datasets converted from CSV (and other formats) to RDF.”

The message from the Obama Administration is that the OGI signals a sea change for government information that will:

Spawn a global movement to democratize access
Enable global linking of data
Foster innovation and transparency via the creation of “community developed” applications

Who is Paying Attention?

This sweeping initiative presents an enticing opportunity for the technology community. The Gov 2.0 Expo crowd is already descending on Washington for a meeting this week. Referred to as “THE IT event for 21st Century Government” by UMB TechWeb and O’Reilly Conferences (the organizers), the Gov 2.0 Expo will include keynotes from Sir Tim Berners-Lee, Danah Boyd, Dave Girouard, Tim O’Reilly, and others. The premise, question, and objective that the meeting proposes to deal with:

The rise of Government 2.0 signals the emergence of IT innovation and the Web as a platform for fostering efficiencies within government and citizen participation. How can we harness these innovations to decrease waste and increase productivity? Gov 2.0 Expo brings stakeholders together to explore transformative technologies and discover new solutions.

Sunlight Labs, an extension of the Sunlight Foundation and member of the data.gov community (featured previously on the Scholarly Kitchen), will announce the winners of its “Design for America” contest during the event. Sunlight developed the contest with the purpose of inspiring the design community to create and share applications using the data.gov resources.

Nancy Scola, a NY-based writer with the Personal Democracy Forum, has followed the site since its launch last February. In “The New Data.gov Sells the Idea of Gov Data,” Scola notes some interesting differences in the way that the project is being presented today compared to 2009.

[L]ast February, Data.gov had data itself front in center. . . . The new version of the flagship site of the Obama Administration’s open government push seems to have an increased interested in selling the very concept of open government data. [T]he Obama White House and CIO Vivek Kundra have a lot riding on Data.gov. There seems to be a renewed acknowledgment in the new site that the vast majority of us have a very tough time wrapping our minds around the import of raw data sets.

Will It Work?

The quasi-evangelical enthusiasm coming from fans of the program tends to focus the conversation towards future opportunities and away from present day challenges. Stripping out the rhetoric, what data.gov and its international counterparts deliver are profoundly complex sets of expert research data via API.

APIs and data are only part of a larger equation.

In a discussion on TechCat, Vivek Wadhwa makes a convincing case for OGI’s necessity, based on deficits in the government’s own technology infrastructure:

While grandma flips through photo albums on her sleek iPad, government agencies (and most corporations) process mission-critical transactions on cumbersome web-based front ends that function by tricking mainframes into thinking that they are connected to CRT terminals. These systems are written in computer languages like Assembler and COBOL, and cost a fortune to maintain. . . . [OGI] provides entrepreneurs with the data and with the APIs they need to solve problems themselves. They don’t need to wait for the government to modernize its legacy systems; they can simply build their own apps.

A post on NextGov says “the Obama administration still has its work cut out for it” and goes on to discusses potential weaknesses and areas for improvement–noting that the academic research sector can help:

[T]he information portal now needs to focus on data context and integrity to achieve true transparency …. Data.gov must do a better job of disclosing the methodology agencies and the White House use to collect and process the underlying information. . . . Academic research has well-established protocols and expectations for how data should be revealed in order to permit others to replicate reported results.

Even professionals face challenges.

In a first installment in the Guardian, “Making things with data.gov.uk – Part 1,” a staff developer presents a play-by-play of what it takes to create and application from data in the beta release of the UK Government Data platform, data.gov.uk. (The post also includes a useful summary of the UK counterpart to the Open Government Initiative and Government 2.0 in the United States, which, in the UK, is led by Sir Tim Berners-Lee and is described in his TED2009 talk focusing on the “next Web”.)

The obstacles that Thorpe describes indicate that there is a steep learning curve, with support needed, for building even a basic app:

One of the challenges of making official government data driven apps is that only a small percentage of the people already making things in this space are fluent in SPARQL, the query language used to retrieve data from RDF stores.

SPARQL 1.0 has no support for aggregate queries such as COUNT but fortunately SPARQL 1.1 which many of the data.gov.uk stores support does.

Not all records are created equal . . . not all of the schools have triples corresponding to these objects. For example at the time of writing 3 schools didn’t have a name.

And, in response to Thorpe’s post:

[I]t’s interesting that the Guardian are offering an introduction to SPARQL before anyone has published a dedicated handbook on the subject. I know the data.gov site has had a bit of a bashing in newspaper comments, but that’s at least partly due to the lack of a guide like this one.

The End Game?

Professionals will find or create the means to build utilities from these emerging global repositories of government data that will:

Enable comparisons of data that has historically been unavailable, siloed, and non-standardized
Deliver tools that surface previously hidden relationships between data points and suggest relational meanings
Aid users develop new hypotheses and research entry points

Whether this translates to empowerment of the general public — or strictly adds to the use of charts and graphs in presentations and articles by researchers and in the media, which pass by the general citizenry — is an open question.

The Obama Administration has presented their lofty vision. However, the locus of control for productizing the data currently lies outside government. As the data.gov website states, innovation will be driven by the “community.” This means that significant responsibility rests with technology professionals and businesses who are equipped to deliver tools, applications, visualizations, and services from the data.

As we have seen recently in the Web 2.0 space, businesses that begin with noncommercial “do no harm” doctrines may ultimately be won over by forces pulling them in other directions.

Pending questions:

Will the technology community remain fiercely committed to using open data to serve the public good?
Will commercial interests predominate?
Will the level of commitment and interest in the objectives of a global data program continue without institutional incentives?
Does the Administration have its own plans for making this type of information digestible for the general public?

An articulated strategy for harnessing resources to continue the process will be a primary determinant of outcome. Otherwise, it is up to independent business and nonprofit interests to embrace and expand upon the mission.

Alix Vance

Discussion

5 Thoughts on "Data.gov: Selling the Government and Democratization of Information"

This is indeed a big deal Alix. But 250,000+ datasets is an unmanageable (opaque) number so the first service scholarly folk should think about is simply how to help your users find what they can use. By the same token, while OGI does not deprive anyone of their proprietary, money making data, your customers are bound to want to be sure they can’t get something almost as good as your product for free. You had best help them determine that it is not so.

In the short run data.gov is a prescription for vast confusion. Since my field is technological confusion I love it. There is big money in confusion, especially when it is driven by a new need to know. In a way that is what scholarly publishing is all about, alleviating technical confusion.

As for mashing up diverse datasets to create even more data, that is certainly an opportunity but it will be a long, slow process. The number of pair-wise combinations of 250,000 datasets is astronomical. Early returns suggest that mapping numerical data to geography is the low hanging fruit.

Enjoy the evangelists but do not believe that this is going to be simple, easy, quick or cheap. I am sure there is a herd of ponies in there somewhere.

Also, this huge aggregation of federal spending is bound to attract the OMB budget cutters, saying surely this number can be made smaller. So if you plan to build a product out of some of these data streams, choose carefully.

Enjoy the confusion.

By David Wojick
May 25, 2010, 11:15 AM

As an honorific aside, one of DOE’s three OGI flagship initiatives — http://www.scienceeducation.gov(SE.gov)– includes a project of mine. My SBIR teacher team is developing the semantic search algorithm that estimates the grade level of web content based on the language used. College level instruction is of no use to 4th graders, and vice versa. The prototype, while not complete, is fully operational on SE.gov now. See http://www.stemed.info for more information on this project.

By David Wojick
May 25, 2010, 11:30 AM

No one else seems interested in this topic but Alix’s article got me to go play with Data.gov, where I found a huge goofiness! My search results showed a severe duplication problem, as follows.

My 2 search terms were science education. The engine said I had around 40,000 hits and gave them to me in pages of 10. (Searching on the term “science education” yielded just one hit, and that was a mistake, as the two words were just part of a list, separated by commas. Apparently we have no data on science education per se.)

The first page of hits looked okay, mostly data about education, especially grants and test scores, that includes science stuff. OSTI’s Science Accelerator, another project of mine, was number 3. So far so good. But the second page of 10 included 5 hits that duplicated the first page. This is not good.

Each successive page of 10 hits, up to and including page 5, included 4 to 6 duplicates, some occurring up to 4 times. At page 5 the “next” button disappeared and I could go no further. In all I got just 28 unique hits. Moreover, the names are so similar in many cases that one needs to write down the numbers just to see what is new.

Maybe it is just me, or my topic, but this is awful. Has no one noticed? I will send them a trouble report.

By David Wojick
May 27, 2010, 4:08 PM

I have also noticed that we don’t seem to have a great number of other government data junkies in the audience, David. 250,000+ datasets is not only an unmanageable number but also a possibly intimidating topic.

It does not surprise me whatsoever that the delivery of the datasets is muddled. I don’t think that any attention has been paid to services. I agree with your assertion that the next responsibility is to “help your users find what they can use”. What has troubled me is the possible interpretation by some that the publishers’ responsibility ends when they drop the data at the doorstep, leaving it to “the community” to make something of it. While third parties may be able to deliver interesting technology tools, the publishers/researchers have necessary and unique subject area expertise. Without the context that these folks can provide, all the technology in the world will not generate a useful outcome.

By Alix Vance
May 27, 2010, 4:42 PM

The Scholarly Kitchen

Data.gov: Selling the Government and Democratization of Information

Society for Scholarly Publishing Recognizes Six Members for Outstanding Contributions

Alix Vance

Related Articles:

Next Article: