Can New XML Technologies and the Semantic Web Deliver on Their Promises?

PhotonQ-Tim Berners Lee on Linked Data at TED by PhOtOnQuAnTiQuE. — Tim Berners-Lee at TED2009 by PhOtOnQuAnTiQuE on Flickr

I recently read a paper from Los Alamos National Labs (LANL), “Using Architectures for Semantic Interoperability to Create Journal Clubs for Emergency Response.” Without diving too deeply into the technical weeds, what the paper describes is:

[A] process for leveraging emerging semantic web and digital library architectures and standards to (1) create a focused collection of bibliographic metadata, (2) extract semantic information, (3) convert it to the Resource Description Framework IExtensible Markup Language (RDFIXML), and (4) integrate it so that scientific and technical responders can share and explore critical information in the collections.

Why recommend creating a semantic research repository in RDF XML? Let’s step back and take a look at this interesting use-case:

Problem (specific to the LANL paper): Prevent bioterrorist incident outbreaks and prevent spreading of viruses having the potential develop into a catastrophic pandemic.

Desired outcome: Assemble an appropriate group of experts who quickly receive access to customized research and tools, which enable them to collaboratively head off large-scale crises.

In order to proactively respond in the event of an emerging emergency response situation, in this case the potential threat of the SARS pandemic, the authors have outlined a process — leveraging both Web 2.0 (social networking) and Web 3.0 (semantic) capabilities, and using RDF XML to normalize data without confining its meaning or future expansion possibilities — which mobilizes expert research groups in alignment with situational specifics and provides them with customized research information and visualization and analytical tools that enable them to quickly and collaboratively generate solutions and curtail the impact of the biological threat.

Extrapolating from the article—and expanding a bit on what the authors have proposed—a generalized process outline would look something like this:

Condition: Urgent need for expert research response

Content: Using a semantic repository of technical articles (developed via the harvesting, augmenting, and mapping processes, which the article describes)

Activity 1: Dynamically assemble specialist expert “journals clubs” or researcher networks based on biographical metadata — such as expertise, affiliation, publication history, relationships, geography — to quickly form a collaborative emergency response team

Activity 2: Facilitate the equally dynamic creation of custom knowledge collections, driven by semantic search that is supported by enriched metadata contained in normalized RDF XML

Activity 3: Provide visualization tools and provide other analytical capabilities to support collaborative problem-solving the expert group

Close the loop: Capture process and outcome information, scenarios considered, implementation recommendations and provide routes for republication and sharing — with or without further peer review

The LANL author team is not alone in exploring this terrain.

Collexis is another highly visible proponent of semantic technologies in the scholarly research industry. Their BioMedExperts, for example, accomplishes a number of the functions proposed by the LANL group:

BioMedExperts contains the research profiles of more than 1.8 million life science researchers, representing over 24 million connections from over 3,500 institutions in more than 190 countries. . . . profiles were generated from author and co-author information from 18 million publications published in over 20,000 journals.

BioMedExperts includes visualization and linear/hierarchical tools for browsing and refining result sets, authority metrics, and networking tools to facilitate conversations among geographically dispersed researchers. For the curious, free access is available on the site. Collexis has also recently announced a project with Elsevier SCOPUS, University of North Carolina, and North Carolina State to create a statewide expert network. From the press release:

Once implemented it will be the largest statewide research community of its kind. The web community created will have fully populated information on publications grant data and citations from over 15,000 researchers across all research disciplines.

There is active debate on the Web about the potential for Web 3.0 technologies and the standards that will be adopted to support them. Writing for O’Reilly Community, Kurt Cagle has remarked:

My central problem with RDF is that it is a brilliant technology that tried to solve too big a problem too early on by establishing itself as a way of building “dynamic” ontologies. Most ontologies are ultimately dynamic, changing and shifting as the requirements for their use change, but at the same time such ontologies change relatively slowly over time.

This means that the benefit of specifying a complex RDF Schema on an ontology — which can be a major exercise in hair pulling — is typically only advantageous in the very long term for most ontologies, and that in general the flexibility offered by RDF in that regard is much like trying to build a skyscraper out of silly putty.

As of January 2009, when Cagle wrote this, RDF had failed to garner widespread support from the Web community — but it has gained significant traction during the past year, including incorporation in the Drupal 7 Core.

Last year, Gigaom spotlighted Tim Berners-Lee speaking about the need for linked data standards at TED2009:

Berners-Lee wants raw data to come online so that it can be related to each other and applied together for multidisciplinary purposes, like combining genomics data and protein data to try to cure Alzheimer’s. He urged “raw data now,” and an end to “hugging your data” — i.e. keeping it private — until you can make a beautiful web site for it.

Berners-Lee said his dream is already on its way to becoming a reality, but that it will require a format for tagging data and understanding relationships between different pieces of it in order for a search to turn up something meaningful. Some current efforts are dbpedia, a project aimed at extracting structured information from Wikipedia, and OpenStreetMap, an editable map of the world.

The promise within this alphabet soup of technologies is that semantic Web standards will support the development of utilities that:

Provide access to large repositories of information that would otherwise be unwieldy to search quickly
Surface relationships within complex data sets that would otherwise be obscured
Are highly transferable
Deliver democratized access to research information

But there are risks. Building sites that depend on semantic technologies and RDF XML can take longer and be more costly initially. In a stalled economy, long-term financial vision is harder to come by, but those with it may truly leapfrog. In addition, there are concerns about accuracy, authority, and security within these systems, ones the architects must address in order for them to reach the mainstream.

In our industry, which depends on research authority, one may wonder whether this is an all-or-nothing proposition. Without speed and consistent delivery of reliable results, projects such as these may fail to meet user expectations and be dead in the water. On the flip side, if RDF XML and its successors can accomplish what they purport to, they will drive significant advances in research by providing the capacity to dynamically derive rich meaning from relationships as well as content.

Thanks to David Wojick for sharing the LANL paper that contributed to this post.

Alix Vance

Discussion

7 Thoughts on "Can New XML Technologies and the Semantic Web Deliver on Their Promises?"

Basically this LANL Library technology takes conventional bibliographic data (which many of us have) and converts some of it into RDF triples. The question then becomes what are these triples good for? This is the question everyone is struggling with.

How I found this article may be more immediately useful. It was in our database of full text DOE research reports (http://www.osti.gov/bridge/). But I never heard of a “journal club” and do not follow emergency response technologies. However, we just implemented an old timey technology called “more-like-this” term vectors, or simply MLT. Lucene has vector MLT built in.

MLT is mathematically powerful stuff because it uses an entire document as a search term. I found a report I liked then did an MLT search and up popped the LANL report. MLT is fun because you begin to see hidden patterns that do not depend on citation, co-authorship, or other social structures. It reveals communities whose members may not even know about each other.

To me this is better than waiting for ontologies that may never work.

By David Wojick
May 10, 2010, 7:32 AM

What a fascinating application of semantic web capabilities. I’m speaking on this topic (broadly) at the SSP meeting in June and will use this as an example of the potential value of semantic analysis of RDF article repositories.

One of the other projects I plan to talk about is “Breathing Space” (http://respiratory.publishingtechnology.com), which is using RDF / semantic capabilities to mine pooled data from two respiratory / thoracic societies, in the hope of (for example) revealing hitherto unnoticed connections between content that might suggest new research paths. The project is just launching after addressing a number of issues that you describe – such as the accuracy of the semantic data extraction, tagging etc. The idea is to see what the research community make of it – does it add value to their research processes, is it in line with their expectations, is it accurate enough, etc. – to see, as you say, whether it’s an all or nothing proposition, and whether semantic capabilities can drive real advances.

All that is a long-winded way of getting to the point that I’d love to hear about other examples of societies (in particular) and scholarly publishers (more generally) using semantic technologies to enhance publications and data, and create new value for members / readers.

By Charlie Rapple
May 10, 2010, 7:36 AM

Many thanks to Alix for this article about semantic technology.

In my opinion getting experience with many applications utilizing semantic taxonomy whether you are using an ontology, taxonomy or thesaurus is very important as we will see which applications gain traction versus the applications that fail. Sometimes in those failures, there will be little wins that can be combined with other successful applications.

As we learned from our experience at Johns Hopkins, the Expert Profile application was born from our Knowledge Dashboard application. Listening to our customers is essential to for us to modify our approach and applications to provide them with tools that genuinely solve a problem, improve productivity or achieve a new objective.

When I joined Collexis back in April 2007, the STM industry was not aware and certainly not focused on semantic technology. As I spoke with different industry associations about presenting at their conferences back then it was difficult to get a speaking slot however over the next two years, the industry began to recognize the importance of semantic technology and now semantic technology is featured at most conferences. Now that we are embracing semantic technology the time for action is now to gain that much needed experience. Debating this topic is fine but please let’s make sure that we have two tracks to achieve our objective of providing researchers not only with the best research but the tools that allow them to achieve true knowledge discovery in a more efficient and effective manner.

Track A – provide the market with a number of semantic applications and let them tell us with their budget dollars if we are on track or not.

Track B – The debate! Yes, let’s continue the discussion while we experiment. Ultimately, track A and B will work together to ensure that we providing the market with the most cost effective tools.

The state of North Carolina will have the first state expert network for their research community, we should expect to see many benefits of this network as administrators and researchers will create and find new benefits. The lessons learned will benefit the entire world research community.

So let’s get started!

By Darrell W. Gunter
May 10, 2010, 9:37 AM

Alix,
In the case of a crisis, we often look to human experts for their knowledge and guidance on how to deal with the situation.

Will the prospect of automatically generating “journal clubs” really going to fill a gap in how we approach disasters?

Put another way, in the case of crises, is access to relevant information really the problem? Or is the problem access to qualified experts and individuals willing to take responsibility and accountability?

In going over a mental laundry list of human and natural disasters over the last decade, I’m coming up with a lot of the latter and none of the former.

By Philip Davis
May 10, 2010, 9:37 AM

I don’t think this is an either-or situation. Experts need information in order to be responsible and accountable (and knowledgeable). At NEJM, we created a center for the H1N1 outbreak. It was both useful and popular. Experts used it, as did non-experts. Did it help? Probably. It helped experts by making information acquisition more efficient and by making non-experts more aware, which makes it easier to implement things like vaccination programs, handwashing protocols, etc.

Crises are often driven by things we haven’t seen before — volcanoes in Iceland, new diseases, oil spills. Information helps everyone deal with these things. If we can find ways to pull it together more quickly, it should help everyone.

By Kent Anderson
May 10, 2010, 10:04 AM

I find it hard to imagine a disaster that does not create an urgent need for information. The Gulf oil leak is a case in point, where no one can figure out how to plug it, plus there is what to do to minimize damage, prevent future occurrences, etc. Then too finding information often leads to finding experts. It is basically the same search problem.

Note also that the LANL folks emphasize “slow burn” crises that take months or years to unfold.

I myself have done a number of post-disaster investigations, specifically dam breaks, and they too are research intensive.

By David Wojick
May 10, 2010, 10:11 AM

The Scholarly Kitchen

Can New XML Technologies and the Semantic Web Deliver on Their Promises?

Innovation Showcase Highlights Cutting-Edge Publishing Solutions

View photos from the 46th Annual Meeting!

Alix Vance

Related Articles:

Next Article: