Big Data
Big Data (Photo credit: JD Hancock)

I recently attended a fascinating day of discussion and talks in Washington DC at the Wiley Executive Seminar, held at the National Press Club.

One talk that caught my attention was presented by Scott Lachut, entitled Data, Tech and the Rise of Digital Scholarship. Scott is Director of Research and Strategy at PSFK Labs, a brand and innovation strategy consultancy. Scott gave us an elegant overview of new tools available, to help us consider how publishing may better incorporate technology in the context of a connected society. It is sobering to realize how accustomed we have become to being socially connected.

A few fun facts:

500 million tweets are sent every day

100 hours of video are uploaded onto YouTube every day

2.5 billion pieces of content are shared on Facebook every day

There is an increasing connection between content and people. Business Insider reports a significant increase in the market for wearables – gadgets like smartwatches, connected fitness bands, and smart eyewear. The wearables company, FitBit recently raised $43 million in financing. Business Insider speculates that global annual wearable device unit shipments will cross the 100 million mark in 2014, reaching 300 million units five years from now. Gartner reports that while there were 2.5 billion connected devices in 2005, by 2020 there will be over 30 billion.

Wearable Device market Size Forecast

According to The Guardian, 90% of all the data in the world has been generated over the last two years, but less than 1% of this information has been analyzed. The question for academic publishers and societies is one of comprehension. How do we assimilate these data? It is tempting to disregard them as irrelevant – to write off the social world at least as not being not concerned with academia. This is all very well, but open communication is part of the fuel that drives the open access debate. When everyone and everything appears connected, why should academics be shielded behind the castle walls? There has been much written in the Scholarly Kitchen on OA, and I am not going to address any of the issues for, or against, here. However, it is important to note that one of the drivers of OA is the notion that the more content and ideas that are shared, the more likely breakthroughs will materialize; OA is associated with innovation. An interesting venture in this space is API Commons where developers are encouraged to share their APIs (Application Programming Interface) under creative commons licenses.

Another fascinating venture, still in its early stages of development, is Thingful, a search engine for the Public Internet of Things, which provides a geographical index of where things are, who owns them, and how and why they are used. If an academic chooses to make their data available to third parties – either directly as a public resource or channeled through apps and analytical tools, then Thingful organizes ‘things’ around locations and categories and structures ownership around Twitter profiles (which can be either people, or organizations), enabling discussion.

As Scott Lachut indicated in his talk, there is so much data and yet so little analysis. Perhaps the next step is to organize and visualize the data in an effort to discern patterns and meaning. Useful tools that have emerged here are seen in Openrefine and Kinetica. Openrefine (formerly supported by Google and named Googlerefine) is an open source means of working with publicly available, potentially messy data, cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases like Freebase, a community-curated database of people, places, and things.

Kinetica is an app for exploring data on your iPad. Instead of forcing you to use a spreadsheet. Thistool allows you to see, touch, sift, and play with your data in a physical environment.

I would say that one of the key paths forward for publishers and libraries to work together is in the area of data mining, extracting intelligence from the sea of information. Libraries can do this on behalf of their constituents, or users may run the analyses themselves. There are a number of tools emerging in this area, allowing researchers to scrape and analyze data more effectively, providing predictive insights from existing data.

Take a look at Premise. Premise is an economic monitoring platform, enabling users to track global business and economic conditions in real-time across thousands of online and real-world locations. Premise monitors the price, quality, availability, and other metrics of a range of goods and services, from online and on-the-ground sources. With Premise data, a person may observe such things as a change in price of a product in several countries simultaneously, how much a product is being discounted, how often a product is out of stock.

In the cyber security world there is Recorded Future. Their mission is to organize open source information for analysis. There are predictive analysis tools for conducting intelligence research, competing in business, or monitoring the horizon for situational awareness.

With many start-ups and large scale players dipping their toes into social sharing and data analysis tools, where does this leave the researcher and their publishing output? What I think would be wonderful is a way to combine tools that provide predictive data, the value of recommendation engines, and a social ecosystem around journal articles and authors – would this be the best of what we know and what we may want, but don’t yet know it?

Robert Harington

Robert Harington

Robert Harington is Associate Executive Director, Publishing at the American Mathematical Society (AMS). Robert has the overall responsibility for publishing at the AMS, including books, journals and electronic products.

View All Posts by Robert Harington

Discussion

10 Thoughts on "Technology and Digital Scholarship"

Actually 100 hours of video are uploaded onto YouTube every minute…

Thanks for the shout out for our seminar, Robert! Another new service Scott mentioned which I thought had potential in our space is circa (http://cir.ca/), which its CEO describes as being like Cliff Notes for news, ie, rather than attempting to summarize a story they break it into its component parts. A future alternative or addition to the abstract, perhaps?

Circa’s home page is strange. There seems to be no way to get any information, no About, or product or company info, etc. All one can do is sign up to get their news or something. It just says “Enter your email or phone number to have a link to Circa sent directly to your phone.”

I have never seen a home page like this.

Very interesting brief tour of new data tools. Some of this flood will mean that known unknowns can become known–we can “see” more than ever before. Who will be watching my refrigerator, my car, my phone etc. does concern me.

“According to The Guardian, 90% of all the data in the world has been generated over the last two years, but less than 1% of this information has been analyzed.”

This is a very strange thing to say. (Do you have a permalink to the actual article?) “Analyzed” is a weasel word. Does “analyzed” mean “any algorithm has done anything at all besides store the data”? If so, this is almost certainly false. Let’s say 25% of this data is collected by google, twitter, and fb combined. But their algorithms clearly look at this data, if for no other reason than to advertise on it. And there is an infinitude of potential analyses that can be done with even a small amount of data. NASA, which probably collects 1/4 of the rest, has been analyzing the same data for many decades and is still drawing new science from it. Ditto the financial industry — probably another 1/4… etc.

Back when it was hard to gather data, you would formulate a research question first and then gather just the information you needed to answer the question. Nowadays, data is cheap. Say for the sake of argument that any data you need is already out there. So what do you want to know? I think people get overwhelmed by the bigness of big data, but the heart of the matter is still asking interesting questions. And maybe secondarily, finding novel ways of answering those questions.

Interesting philosophical question, Amanda. Let’s say that I do have such a question, with today’s smart search engines that can parse, analyze and even synthesize from large text data bases, it seems that the digital world is like my office desk with stacks of “random access” materials. it would seem, with alternatives to “validating” and the push for citizen scholars (cf the European Commission on Science 2.0) that the idea of separate “academic” journals becomes less important (except with the academic obsession with pub/perish)? Watson’s children, it would seem, could just ignore the rather primitive efforts of parsing information into small chunks called journals? We do this now with our rather limited “biocomputers”.

Good stuff. Can’t wait to check out some of the tools you reference. There is a massive amount of content being published in all forms by all kinds of people and machinery. We’re all trying to be seen and heard at the same time. It is noise until someone or something can make sense of any of it. Back in the day, not that long ago, we had the sense of community because we all got essentially the same information from the local and national news sources. Now we all get mini executive summaries of disparate information. I guess you consume what you like and there are some interesting tools to assist. Metadata and it’s slicing and dicing is a great way to get a grip on what’s happening in all realms and it will be fun to see how it plays out over the years to come. The question of how to monetize academic content is a great one as the thinkers of the world need not only to be heard but need to be encouraged to publish!

Comments are closed.