fear-441402_640At the opening of the Frankfurt Book Fair this year, a pre-meeting session was held called CONTEC. This follow-up to the much beloved, but now defunct, O’Reilly Tools of Change conference brought together an interesting mix of leadership from traditional publishers, new start-ups, more traditional supply chain management, and a mix of speakers from outside the conventional publishing realm. One of those speakers was Viktor Mayer-Schönberger of the Oxford Internet Institute who discussed trends in big data. His talk was based on some of the topics discussed in his best-selling book, Big Data, co-authored with Kenneth Neil Cukier.

Data has been the topic of many a conversation in the past several years, particularly in the scientific community, which is awash in it and has undertaken many efforts to better manage it and better incorporate it into the scientific publication workflow. Data analysis and process management in business have similarly been in use for decades, and the incorporation of sensors in just about everything, including on people (consider wearable monitoring devices) is pushing the boundaries of what can be quantified and analyzed to improve outcomes. Everything from health treatments, to appliance performance and maintenance, to media creation and consumption has been the focus of data analytics and this trend has led to advances in decision making in many industries in recent years.

Simultaneously, many companies have built very lucrative businesses based on trading off the secondary data generated by online activities. One needn’t look any further than the household giants of Facebook, Google, and Twitter as examples of companies that leverage user data as a business model, but there are many, many more. In fact, there’s a very robust industry trading in these data, which is starting to see push back from regulators.

While the vast majority of content in the STM publishing world, has been available and distributed digitally for nearly two decades, the application and deep analysis of the data being thrown off by the use of these systems has been modest at best. A lot of effort has gone into assessment and quantifying downloads, via COUNTER, but the granularity and richness of data analysis possible is woefully underappreciated and as yet unrealized. This is one of the potential growth areas for the cadre of altmetrics service providers and one of the drivers of interest in that space.

Perhaps two of the most significant barriers to this utilization are the issues of trust and privacy within the library and scholarly communities. Few issues raise the hackles of librarians more than privacy and the notion that someone (companies, the government, etc.) is abusing that privacy by tracking the reading and information use behavior of the public. In an era when even the most of us are knowingly—or unknowingly—tracked financially, physically, and online, the respect accorded to privacy by the library community is refreshing.

But even within the context of a bias toward privacy in our community, it seems that there are missed opportunities to leverage usage data to improve services. Where can cultural institutions and scholarly publishers leverage this information and what are the boundaries? What is the necessary balance between using data to improve services, enhance products, or effectively manage resources and the more creepy sales, marketing, and profiling of users’ activities? There are a variety of perspectives on how best to handle privacy. For example, this week, Tim Berners-Lee told the IPExpo Europe that data should be held by, owned by and managed by users, not companies or advertisers. Others have advocated this, but they are challenged by the financial and adoption success of those who benefit from leveraging user data.

One obvious step over the boundaries was announced this week, when it was discovered that Adobe Digital Editions software was collecting and sending data back to Adobe on the e-book usage without any encoding of that data. Nate Hoffelder of The Digital Reader first reported on the issue, followed quickly by Ars Technica. Adobe has subsequently confirmed the practice to Ars Technica, stating that it would be working on an update. Likely, the update will address the security flaw of transmitting these data in the clear, “allowing anyone who can monitor network traffic (such as the National Security Agency, Internet service providers and cable companies, or others sharing a public Wi-Fi network) to follow along over readers’ shoulders.” But it is unlikely that it will remove the data collection from the service, since for rights management purposes it is a core functionality. However, the anonymity of that user data, the length that data is stored, and any limitations on what the data can be used for are very open questions. In this particular situation, the Digital Editions team didn’t live up to the basics of Adobe’s Privacy policy related to data security.

This points to a larger question of how do any of us, excluding a small set of techies, really understand what data is being gathered, shared, and stored from our devices? While we all, knowingly or not, agree to this data gathering activities when we blithely agree to the incomprehensible terms of service and EULAs for most software and content, do we really understand what is being done and for what purpose? We are generally relying on service providers and software engineers to be good actors regarding our data. Often, the problems are more attributable to lax programming practice than to malicious intent.

One of Mayer-Schönberger’s points during his talk was that the Amazon Kindle platform is as much a data ingest tool for providing end-user behavior data to Amazon as it is a sales platform for digital media content, books, software, and audio-visual content. Amazon is notoriously silent about its activities, but it is well known that their use of big data gathering and analytics is profound. In this context, it seems odd that Adobe should get raked over the coals from a privacy perspective about gathering data on users’ reading behavior, while Amazon within its own proprietary service is collecting the same, if not more data specifically to serve its business ends. I’m not excusing either practice, but on the same day one service is held up as the shining example of the future of business among publishers, the other was being pilloried for its lack of security and invasion of privacy. It seems that the only significant difference is that one has a direct relationship with the reader, while the other is mediated through the library. Perhaps we should all consider working with trusted third parties, such as libraries, for more of our data management and privacy activities online.

Eventually, this comes down to trust in the service providers; trust that the data will be kept secure (if kept at all), anonymized if it is used, and limited in its use. While Mayer-Schönberger and Cukier discuss the ethical challenges and potential solutions of big data in their book, it seems that in deference to the significant money flowing into big data analytics, little will be done about solving the core issues. There’s too much to be gained, competitively and financially, by industry in exploiting these data for any serious curbs to be put on the industry, either internally or via regulation. It seems that counter to this trend, libraries and scholarly publishers are the exception to the rule. Whether our community will remain outliers and whether this status is a good thing or not over the long run, remains to be seen.

Todd A Carpenter

Todd A Carpenter

Todd Carpenter is Executive Director of the National Information Standards Organization (NISO). He additionally serves in a variety of leadership roles of a variety of organizations, including the ISO Technical Subcommittee on Identification & Description (ISO TC46/SC9), the Linked Content Coalition, and the Foundation of the Baltimore County Public Library.

View All Posts by Todd A Carpenter


7 Thoughts on "Trust, Privacy, Big Data, and e-Book Readers"

Adobe was also collecting data on eBooks that had no DRM to verify. That and sending data as clear text was very clumsy. Adobe is repentant, not for collecting more data than needed but for doing it so clumsily that they got caught in the act. Lack of discretion, then, is the sin to avoid committing. There’s no real discussion as to the propriety of collecting whatever data one can.

An editorial nitpick: you say “these data” but also “what data is being gathered.” So, which is it: “data” is a plural or singular noun? Or does no one care anymore?

Or maybe his copyeditor just flew back in from the west coast and has a sick kid. This one is on me.

Language is as language does. “Data” is now both singular and plural. Let’s focus on the kind of data.

We have improved products and services before big data. When has customer service become a fundamental human right? With the pressures of competition and profit in an economically driven world, are we really to trust those that collect those data? Did we trust the banks to create mortgage products that “served” the customer? The opportunity to abuse data is extremely high and those collecting those data would certainly need oversight. I would hope that data collection becomes an op-in system where permission needs to granted by the individual, just like in medical and psychological research. And we need a real choice, not just blanket terms and condition given when we purchase a product or service–the take-it-or-leave-it contract. Do corporations have a right to my private behavior, no matter how insignificant?

As a side note, what kind of data are being collect by requiring a Facebook login to post here and who is collecting it?

No Facebook login is required to comment here. The site is hosted on WordPress and a variety of means of commenting are provided.

However, WordPress does collect a lot of interesting data on usage, but nothing personal that I know of. A real problem with the data issues is that “data” is an incredibly vague concept. I did staff work for the US Interagency Working Group on Digital Data (IWGDD) and they ultimately punted data policy for this reason. Data sharing efforts also suffer from the difficulty of saying just what the data is in any given case, much less having rules for different kinds of data. There are established practices in certain different specific cases but that is about it.

We need a good policy taxonomy of data types and some standards for handling each. Perhaps NISO can look at this. Or are they already?

Leave a Comment