There is a persistent conceit stemming from the IT arrogance we continue to see around us, but it’s one that most IT professionals are finding real problems with — the notion that storing and distributing digital goods is a trivial, simple matter, adds nothing to their cost, and can be effectively done by amateurs.
This notion of free data emerged recently in a comment thread here, but has been a consistent theme among dew-eyed idealists about publishing — that digital goods are infinitely reproducible at no marginal cost, and therefore can be priced at the rock-bottom price of “free.” Of course, this argument is implicitly cost-based, while the information economy works more rationally if it’s value-based, so the argument is fundamentally flawed at it outset. But, even if taken at face value, the argument doesn’t align with reality.
Digital goods have costs.
I’m not talking here about just things like the cost of electricity, which should be enough on its own to disabuse idealists of their vacuous notions of what makes the world go around. I analyzed this at length in another post earlier this year. Even beyond just their power requirements, digital goods have particular traits that make them difficult to store effectively, challenging to distribute well, and much more effective when handled by paid professionals.
First, digital goods are not intangible. They occupy physical space, be that on a hard drive, on flash memory, or during transmission. A full Kindle weighs an attogram more when fully loaded with digital goods, and there are hundreds of thousands of Kindles in the field. So while digital goods are very small, they require physical management and exist in physical space and time. In fact, their small size makes managing them somewhat unique and in some ways quite difficult.
Because digital products are information products, they need more data than just their inherent data to be managed and used. This is metadata — descriptions of what’s in the tiny packet, where it resides, what forebears it has, what dependencies it has, and how it can be used. Any MP3, HTML page, JPEG, or EPUB file can exist without metadata, but it will be much harder to use and manage. Creating, updating, and tracking the metadata is a chore for owners of digital goods. Poor metadata — like a photo name off your digital camera of DX0023 — can make the photo hard to find or use. Better metadata — usually applied by humans, like “Rose in bloom, August 2006” for that elusive photo — makes more sense.
Then we get into the pesky problem that digital goods are very small and very easy to create. The proliferation of digital goods — photos, music, Web pages, blog posts, social media shares, tweets, ratings, movies and videos, and so much more — puts incredible and growing pressure on metadata management techniques and layers. This means building more and larger warehouses, which adds to both ongoing costs for current users and migration costs as older warehouses are outstripped by new demands. Megabytes become gigabytes become terrabytes become zettabytes and beyond. Where will they all fit?
Building more and larger digital warehouses and the information to make those digital goods accessible entail major work from highly technical people using sophisticated equipment and engineering. These people are expensive, and work at multiple levels in the digital economy. Let’s not forget the human effort involved in distributing and storing digital goods.
Take for example the Library of Congress’ effort to archive all the Twitter messages since 2006. How easy this should be, right? Twitter’s free! Digital is free! It’s all endlessly reproducible at no marginal cost! Not so fast:
What makes the endeavor challenging, if not the size of the archive, is its composition: billions and billions and billions of tweets. When the donation was announced last year, users were creating about 50 million tweets per day. As of Twitter’s fifth anniversary several months ago, that number has increased to about 140 million tweets per day. The data keeps coming too, and the Library of Congress has access to the Twitter stream via Gnip for both real-time and historical tweet data.
Each tweet is a JSON file, containing an immense amount of metadata in addition to the contents of the tweet itself: date and time, number of followers, account creation date, geodata, and so on. To add another layer of complexity, many tweets contain shortened URLs, and the Library of Congress is in discussions with many of these providers as well as with the Internet Archive and its 301works project to help resolve and map the links.
Well-stored digital goods need a superordinate data structure to organize them, a structure that becomes more complex and expensive to maintain the more robust, commercial, and valuable the underlying digital assets become. And this structure can change. For instance, bit.ly has been the major URL-shortener for years, but t.co has become a major one thanks to Twitter, and others are in ascendancy. They probably each move data in a slightly different manner. Bit.ly in particular existed before Twitter, so what additional metadata came into being once it modified itself for the Twitter era? How much did you spend to get your XML to comply with the NLM DTD?
Digital goods also need to be backed up. Because they’re small, they are fragile. Because they are digital, they are all-or-nothing (i.e., a scratched analog record can sound mostly fine, but a damaged digital file is most often rendered useless by a bad string of code). Cold, warm, and hot backups usually exist for the most valuable digital goods, and all of these have costs.
Digital goods need to be secure. Do you want your credit card number, government ID numbers, or bank accounts to be easily transportable? Security is a major cost to the ongoing storage of digital goods, from articles to financial information. From firewalls to anti-virus software to cryptographic keys, security around data elements is vital to their integrity and viability. Governments, companies, and individuals spend billions each year protecting their digital goods.
Digital goods have owners. The legal barrier for protection of databases is fairly low — “the author has to make choices about the selection, coordination, or arrangement of the facts or data” — and some databases are very valuable. Your customer database is but one example.
In fact, while PubMed Central is free to access, PubMed itself charges a licensing fee, often into the tens of thousands of dollars. Free abstracts compiled with taxpayer funds, and the government charges for them? Well, OA zealots, I’ll leave you to ponder that one. (Correction: PubMed has been free via license since 2000. Sorry.) And this isn’t the only case There are many cases of the government hoarding public data, often because it’s so expensive to purvey and keep well. In any event, determining, protecting, and ensuring the legal status of databases and digital goods is difficult and costly. Even Creative Commons — which trades in nothing tangible — spends $2.5 million a year doing what it does, which is basically distributing labels it defines.
Digital goods have to be available all the time. This requires a lot of infrastructure, more than a typical physical warehouse does. For instance, you can turn out the lights on a physical warehouse, lock the doors, and leave it for a week or more. With a digital warehouse available online, you need to staff it around the clock, provide power 24/7, and monitor the warehouse for problems and errors. Digital goods require much more expensive warehouses.
Digital warehouses are more expensive to build. Site planning is a major undertaking. A physical warehouse is something a small business owner can buy and construct with relative ease. They aren’t expensive (a concrete pad, a sheet metal structure, some crude HVAC, and a security system is usually all it takes). A digital warehouse is expensive to construct — servers, site planning, redundant power requirements, high-grade HVAC, earthquake-proofing, and so forth. This means that digital goods have to work off a much higher fixed warehouse cost.
Digital goods also vary in quality, and some of the quality has to do with the infrastructure in which they exist. You can pay for faster downloads at some sites, because bandwidth is an expensive variable in the purveyance of digital goods. Do you pay more for a robust data plan? Do you pay more for a bigger pipe?
The costs associated with data provision and storage can shock IT managers because they reveal all the overheads of digital goods. After all, in standard IT budgets, things like electricity, heat, and rent are handled in other internal budgets. Once data move to the cloud, the IT budget is paying for those things directly via the vendor, and these additional expenses can be sizable — or so says Jonathan Alboum, CIO at the U.S. Department of Agriculture’s Food and Nutrition Service (FNS):
With the cloud, these basic infrastructure charges are baked into the overall cost, so I’m now paying for some things that previously didn’t come out of my IT budget.
So, are digital goods infinitely reproducible? Not practically. Information has an energy cost, even in biological systems. This is why our brains are only so big and so busy. Biology strikes a balance in energy needs and information processing abilities.
In the realm of digital goods, we’re reaching a point at which we’re facing trade-offs. Already, some data sets are propagating at a rate that exceeds Moore’s Law, which may still accurately predict our ability to expand capacity. And these are purposeful data sets. As data becomes an effect of just living — traffic monitoring software, GPS outputs, tweets, reviews, star ratings, emails, blog posts, song recommendations, text messages — we as a collective will easily outstrip Moore’s Law with our data. If there’s no place to put it, and nobody to manage it, does it exist? Quick, find me all your five-year-old emails.
I think the sooner we come to grips with the fact that digital goods are real, expensive in their own way, not intangible, not infinitely reproducible, and require management, warehousing, maintenance, and space, we’ll be able to have more rational discussions about the future of scholarly publishing, online commerce, data storage initiatives, and multimedia.
Unless we fully realize the costs and obligations of being digital, we’re likely to mistakenly believe it can be free.