English: Data Applied's Web Interface
English: Data Applied’s Web Interface (Photo credit: Wikipedia)

There is a persistent conceit stemming from the IT arrogance we continue to see around us, but it’s one that most IT professionals are finding real problems with — the notion that storing and distributing digital goods is a trivial, simple matter, adds nothing to their cost, and can be effectively done by amateurs.

In fact, a study done last year found that initiatives to move to cloud-based computing stalled most often because of higher-than-expected costs.

This notion of free data emerged recently in a comment thread here, but has been a consistent theme among dew-eyed idealists about publishing — that digital goods are infinitely reproducible at no marginal cost, and therefore can be priced at the rock-bottom price of “free.” Of course, this argument is implicitly cost-based, while the information economy works more rationally if it’s value-based, so the argument is fundamentally flawed at it outset. But, even if taken at face value, the argument doesn’t align with reality.

Digital goods have costs.

I’m not talking here about just things like the cost of electricity, which should be enough on its own to disabuse idealists of their vacuous notions of what makes the world go around. I analyzed this at length in another post earlier this year. Even beyond just their power requirements, digital goods have particular traits that make them difficult to store effectively, challenging to distribute well, and much more effective when handled by paid professionals.

First, digital goods are not intangible. They occupy physical space, be that on a hard drive, on flash memory, or during transmission. A full Kindle weighs an attogram more when fully loaded with digital goods, and there are hundreds of thousands of Kindles in the field. So while digital goods are very small, they require physical management and exist in physical space and time. In fact, their small size makes managing them somewhat unique and in some ways quite difficult.

Because digital products are information products, they need more data than just their inherent data to be managed and used. This is metadata — descriptions of what’s in the tiny packet, where it resides, what forebears it has, what dependencies it has, and how it can be used. Any MP3, HTML page, JPEG, or EPUB file can exist without metadata, but it will be much harder to use and manage. Creating, updating, and tracking the metadata is a chore for owners of digital goods. Poor metadata — like a photo name off your digital camera of DX0023 — can make the photo hard to find or use. Better metadata — usually applied by humans, like “Rose in bloom, August 2006” for that elusive photo — makes more sense.

Then we get into the pesky problem that digital goods are very small and very easy to create. The proliferation of digital goods — photos, music, Web pages, blog posts, social media shares, tweets, ratings, movies and videos, and so much more — puts incredible and growing pressure on metadata management techniques and layers. This means building more and larger warehouses, which adds to both ongoing costs for current users and migration costs as older warehouses are outstripped by new demands. Megabytes become gigabytes become terrabytes become zettabytes and beyond. Where will they all fit?

Building more and larger digital warehouses and the information to make those digital goods accessible entail major work from highly technical people using sophisticated equipment and engineering. These people are expensive, and work at multiple levels in the digital economy. Let’s not forget the human effort involved in distributing and storing digital goods.

Take for example the Library of Congress’ effort to archive all the Twitter messages since 2006. How easy this should be, right? Twitter’s free! Digital is free! It’s all endlessly reproducible at no marginal cost! Not so fast:

What makes the endeavor challenging, if not the size of the archive, is its composition: billions and billions and billions of tweets. When the donation was announced last year, users were creating about 50 million tweets per day. As of Twitter’s fifth anniversary several months ago, that number has increased to about 140 million tweets per day. The data keeps coming too, and the Library of Congress has access to the Twitter stream via Gnip for both real-time and historical tweet data.

Each tweet is a JSON file, containing an immense amount of metadata in addition to the contents of the tweet itself: date and time, number of followers, account creation date, geodata, and so on. To add another layer of complexity, many tweets contain shortened URLs, and the Library of Congress is in discussions with many of these providers as well as with the Internet Archive and its 301works project to help resolve and map the links.

Well-stored digital goods need a superordinate data structure to organize them, a structure that becomes more complex and expensive to maintain the more robust, commercial, and valuable the underlying digital assets become. And this structure can change. For instance, bit.ly has been the major URL-shortener for years, but t.co has become a major one thanks to Twitter, and others are in ascendancy. They probably each move data in a slightly different manner. Bit.ly in particular existed before Twitter, so what additional metadata came into being once it modified itself for the Twitter era? How much did you spend to get your XML to comply with the NLM DTD?

Digital goods also need to be backed up. Because they’re small, they are fragile. Because they are digital, they are all-or-nothing (i.e., a scratched analog record can sound mostly fine, but a damaged digital file is most often rendered useless by a bad string of code). Cold, warm, and hot backups usually exist for the most valuable digital goods, and all of these have costs.

Digital goods need to be secure. Do you want your credit card number, government ID numbers, or bank accounts to be easily transportable? Security is a major cost to the ongoing storage of digital goods, from articles to financial information. From firewalls to anti-virus software to cryptographic keys, security around data elements is vital to their integrity and viability. Governments, companies, and individuals spend billions each year protecting their digital goods.

Digital goods have owners. The legal barrier for protection of databases is fairly low — “the author has to make choices about the selection, coordination, or arrangement of the facts or data” — and some databases are very valuable. Your customer database is but one example. In fact, while PubMed Central is free to access, PubMed itself charges a licensing fee, often into the tens of thousands of dollars. Free abstracts compiled with taxpayer funds, and the government charges for them? Well, OA zealots, I’ll leave you to ponder that one. (Correction: PubMed has been free via license since 2000. Sorry.) And this isn’t the only case There are many cases of the government hoarding public data, often because it’s so expensive to purvey and keep well. In any event, determining, protecting, and ensuring the legal status of databases and digital goods is difficult and costly. Even Creative Commons — which trades in nothing tangible — spends $2.5 million a year doing what it does, which is basically distributing labels it defines.

Digital goods have to be available all the time. This requires a lot of infrastructure, more than a typical physical warehouse does. For instance, you can turn out the lights on a physical warehouse, lock the doors, and leave it for a week or more. With a digital warehouse available online, you need to staff it around the clock, provide power 24/7, and monitor the warehouse for problems and errors. Digital goods require much more expensive warehouses.

Digital warehouses are more expensive to build. Site planning is a major undertaking. A physical warehouse is something a small business owner can buy and construct with relative ease. They aren’t expensive (a concrete pad, a sheet metal structure, some crude HVAC, and a security system is usually all it takes). A digital warehouse is expensive to construct — servers, site planning, redundant power requirements, high-grade HVAC, earthquake-proofing, and so forth. This means that digital goods have to work off a much higher fixed warehouse cost.

Digital goods also vary in quality, and some of the quality has to do with the infrastructure in which they exist. You can pay for faster downloads at some sites, because bandwidth is an expensive variable in the purveyance of digital goods. Do you pay more for a robust data plan? Do you pay more for a bigger pipe?

The costs associated with data provision and storage can shock IT managers because they reveal all the overheads of digital goods. After all, in standard IT budgets, things like electricity, heat, and rent are handled in other internal budgets. Once data move to the cloud, the IT budget is paying for those things directly via the vendor, and these additional expenses can be sizable — or so says Jonathan Alboum, CIO at the U.S. Department of Agriculture’s Food and Nutrition Service (FNS):

With the cloud, these basic infrastructure charges are baked into the overall cost, so I’m now paying for some things that previously didn’t come out of my IT budget.

So, are digital goods infinitely reproducible? Not practically. Information has an energy cost, even in biological systems. This is why our brains are only so big and so busy. Biology strikes a balance in energy needs and information processing abilities.

In the realm of digital goods, we’re reaching a point at which we’re facing trade-offs. Already, some data sets are propagating at a rate that exceeds Moore’s Law, which may still accurately predict our ability to expand capacity. And these are purposeful data sets. As data becomes an effect of just living — traffic monitoring software, GPS outputs, tweets, reviews, star ratings, emails, blog posts, song recommendations, text messages — we as a collective will easily outstrip Moore’s Law with our data. If there’s no place to put it, and nobody to manage it, does it exist? Quick, find me all your five-year-old emails.

I think the sooner we come to grips with the fact that digital goods are real, expensive in their own way, not intangible, not infinitely reproducible, and require management, warehousing, maintenance, and space, we’ll be able to have more rational discussions about the future of scholarly publishing, online commerce, data storage initiatives, and multimedia.

Unless we fully realize the costs and obligations of being digital, we’re likely to mistakenly believe it can be free.

Enhanced by Zemanta
Kent Anderson

Kent Anderson

Kent Anderson is the CEO of RedLink and RedLink Network, a past-President of SSP, and the founder of the Scholarly Kitchen. He has worked as Publisher at AAAS/Science, CEO/Publisher of JBJS, Inc., a publishing executive at the Massachusetts Medical Society, Publishing Director of the New England Journal of Medicine, and Director of Medical Journals at the American Academy of Pediatrics. Opinions on social media or blogs are his own.


69 Thoughts on "Not Free, Not Easy, Not Trivial — The Warehousing and Delivery of Digital Goods"

This has to be the best article I have ever read explaining all the details of digital goods that many people aren’t even aware of. Wonderful bit of work!

You are absolutely right. Digital goods are not cost free. BUT …

I recall discussions more than a decade ago (2 decades?) about the cost of communicating to your customer becoming “free”. Instead of a mail piece costing one to two dollars each you will be able to send an email for zero cost.

The critical question that should then have been discussed was “What it the consequence of the cost of communicating to your customer becoming free?”

But mostly that wasn’t the discussion. It was “no it’s not really free” … emails have to be composed, email address lists have to be maintained and backed up, records have to be kept, bandwidth and storage isn’t actually free.

And even now that’s right — it does cost $26 to send out a billion emails. The cost to send one spam email isn’t zero. It is $0.000000001 but that’s not zero. That’s why spam never happened.

It is not free to send an email to practically every person in the country – thereby guaranteeing you hit a large number of customers of any given business without knowing who they are. It is not zero – that’s why phishing never happened.

But spam and phishing did happen. Spam and Phishing are bad things. They happened because bad people where thinking “what if it were free?” without quibbling between the difference between “free” and “negligible”, while the good people were doing the quibbling instead of the thinking.

Without a reference to the $26 for a billion emails, I can’t comment on the cost numbers, except to say I’ve seen a lot of calculations done that simply miss out large chunks of cost, for whatever reasons. Here’s the thing about spam though. It’s not free. Not free at all. The spammer pays to use a botnet and the botnet has been constructed by hijacking the computers of unsuspecting folk. The people with the compromised computers are paying via the bandwidth being used in their name. And the money they paid out for the machine in the first place.

You want to look at costs? Amazon’s content delivery network runs about 12cents a Gb for the first 10Tb of data and then scales up from there. You first have to get it to their servers (so double the cost for the upload and first download). Then the user has paid for the ability to download that chunk of data so the costs need to be added for that, and that cost varies depending on the network, the package the user is on and the location they are getting it from. And that’s a real can of worms. Over here in the UK, the pricing model for domestic broadband users doesn’t effectively cover heavy users of bandwidth – effectively the light users are subsidising the big users in a number of offerings, at least until the ‘fair usage’ policy kicks in.

So what’s the cost to download a DVD quality video, via your mobile phone? In the Uk it’s about £5 ($7.77ish) for the user to download that data, plus the upload and download costs ($0.12X2)X6) = $1.44. All of a sudden it’s $9 to download that there zero marginal cost digital object. The next download would still cost about $8 by the way. And that’s why HD netflix streaming is a very low quality version of a bluray even when the headline resolution is ‘the same’. Netflix pay a ton of money to the CDN networks to deliver those movie streams. Even if you look at Comcast’s cheapest offer (about $30 I think?) you as a consumer are still paying $0.60 to download a DVD worth of data (plus the $1.44 or $0.72 for the total cost) – until you hit the limit, when the costs go up dramatically. If you are doing this by mobile phone, you’ll have to put in the contract and subsidised phone cost as well – £30 per month for 24 months plus £100 – 200 for your shiny iOS or Android device gives you an annual cost of about £400. When you look at it that way, it’s probably cheaper to buy a physical copy and a DVD player than to use your data allowance on your mobile device if you are going to do so regularly.

If you want to get a better grip on the costs for enterprise class things, go look at the difference between the price of a consumer level solid state drive, and one designed for enterprise use. It’s about a 10X price difference.

So what are the costs? Well I know they are not zero. After that, calculating the total costs to deliver a digital object becomes a complicated business. And I think that was Kent’s point was it not?

Spam and phishing are good analogies, precisely because no one has introduced the concept of value (which is why I, and 99% of other people, never open them). A targeted TOC produced by a publisher has value added, value to me, and that’s why I open it. The fact that it cost money to produce and mediate is why it is a product I’m interested in paying for (i.e. one with value) and not just background noise (which is what most low-to-no cost digital communication comprises).

It is weird to read you pretend that the “OA Zealots” you so often try to make look like fools believe that digitals goods have no cost. Have I to recall that many of these “OA zealots” strongly support *gold* OA? Would you pretend that PLoS, which *charges* author for the publishing service it provides, is based on the belief in the absence of cost of digitals goods? Seriously? Even mathematicians and physicists, who use arXiv a lot, know it costs something to run it.

While your description of the costs of digital goods is pretty interesting, the way you try to use it to dismiss OA is quite offensive. I know that Schopenhauer, in “The Art of Being Right”, explains that attacking a simplified variation of your opponent’s argument is a good strategy, but you should probably do it in a less obvious way. Reading you, it seems impossible to believe that I can access the laws of my country online without paying a fee. Free for the reader does not mean without cost to anyone, and that’s something I did not assumed would need to be told, especially to you.

I don’t try to make them look like fools, and you can drop the “OA prima donna” attitude, please. If you assert something in the world, be ready to back it up. Sorry if you find reality offensive.

As for some OA zealots looking like fools, they don’t need my help. Nor do they need help pretending at the absence of costs for digital goods. They tend to do those things on their own. Here is a quote from a very recent article, in which an editorial director at PLoS says, “But, in an online world, the costs don’t come every time someone new looks at it, they come at the point where you publish the work.”

So, bam-bam, there go your two points — PLoS person saying there are no costs after publication, and OA person looking like a fool. Sorry. Reality has high standards.

Clearly, I’m not attacking a simplified version of anyone’s argument. I’m pointing out that their worldview is too simplistic, which is entirely different. And I went to great pains to do so.

As for your ability to access the law of your country, you are paying a fee to access them — taxes.

I hope you don’t blame me for making you look bad. I just held up a mirror.

I don’t see how this makes the PLoS guy look foolish. First of all, in a major newspaper, I would expect that his perhaps nuanced opinion would get simplified (that’s journalism, after all). Second of all, he’s just comparing relative costs.

For PloS, or for any other website, the bulk of the cost is making the stuff available in the first place, but each individual page view only adds a teeny tiny extra cost over the big initial fixed cost. This is clearly what he is talking about. I don’t think he’s saying that PloS doesn’t pay anything for hosting, but rather that the hosting cost per page view (on a linear graph) has a very shallow slope and a very high y-axis intercept, as opposed to a print journal, which has a steeper slope and also a high y-axis intercept. (I actually suspect that the cost per page view graph for PloS and similar operation is stair steppy, but it still has a rather low average slope.)

So, the defense of the statement is, “Certainly, they were misquoted”?

No, just that a journalist took the person’s perhaps much more nuanced view of economics and flattened it into something less nuanced. You didn’t respond to my second paragraph about what that more nuanced view might be. If anyone here looks foolish, Kent….

Your point is based on downloads or, as you put it, “page views.” That’s a very limited point of view.

What is a page view in your mind? PDF downloads are much more expensive than HTML page views, because PDFs are often much bigger than a full HTML page, depending on what’s on the page (ads and widgets can add a lot). Since PDFs are what users download instead of just viewing, users traverse both an HTML page and a PDF in most cases where they find relevant content, so you get both costs of download. Or is a page view also the supplemental data? The comments? The reference links? Where is all this stored? Are you talking storage costs, upload costs, transmission costs, or the whole set? It usually costs about $10 per user per year just for this kind of stuff if you run a robust site, and that’s mainly variable costs. PLoS has fewer costs because they don’t have access controls and don’t enforce copyright on behalf of their authors, but they have plenty of costs.

But there’s more to it than that. Articles have to be submitted to indexing services and third parties. What happens when Google changes its rules about how PDFs are displayed on archival content, which they did about a year ago? How does the expense of that get covered? Or when a new social search engine instigates another round of SEO work? Or when their fiber optic provider is acquired and the contract needs to be renegotiated? Or when a disk array goes down? Who’s paying for the site administrator? Who paid the CLOCKSS fees (both annual and per-upload)?

“Hosting” and “page views” are one step up from “free” when it comes to understanding what goes into digital publishing expense, but there’s so much going on in digital publishing that these concepts only scratch the surface.

I don’t see any of this to contradict the comment from the PLoS person; she isn’t saying that there aren’t costs (where else would the huge OA fees they charge go); she’s saying that the costs for the most part aren’t accumulated per article view/page view/download/etc., but rather they are all large, mostly fixed costs.

The things you list above are all costs that cannot easily be associated with individual “copies” (i.e., downloads, page views, or however we want to measure that) of the articles. Rather, they are associated with the site/platform in general or they are associated with the master copy of the article in the database (e.g., if you have to pay $10 to reformat each of the pdf on your site for some reason, you still pay that $10 regardless of whether the article is downloaded zero times or a thousand times) and so the cost can’t be related directly to individual downloads.

Of course, those costs do need to be spread across all downloads (etc.) to come up with a unit cost, which may in fact be large. (This, in particular, is where I feel that Chris Anderson’s model of FREE doesn’t work for large segments of academic publishing.)

I want to note that I’m not an OA acolyte. I am in favor of it only as far as someone can make a workable business model for it, and I know that that’s not possible in all segments of the industry.

That’s fine, but the costs are real, and they are associated with purveying digital goods. PLoS likes to say most of their costs are fixed costs because it jibes with their business model (upfront author charges), but they have a lot of distribution, storage, technology, IT salary, and backup costs, all of which scale with the size of the endeavor, making them variable costs, not fixed costs. They can’t be associated with individual copies, so they apply equally to all copies for accounting simplicity. But they are real, and because authors ultimately want their materials to be read, those costs are intrinsic to publication.

I think we agree more than we disagree. The point of the post wasn’t to pick on PLoS (although they and their attack dogs do tend to minimize or even ignore variable and technology costs). The point was to think through something we all believe we know the answer to, but which turns out to be much more complicated and “real” than many think. Good comments.

There is an interesting, understandable, and not uncommon confusion here between “cost” and “price”.

” (e.g., if you have to pay $10 to reformat each of the pdf on your site for some reason, you still pay that $10 regardless of whether the article is downloaded zero times or a thousand times)”

If you have to pay $10 … then that’s the price, and it is cost to you, but it’s not the cost of doing it.
Mostly like the cost of doing was the cost of running some software for a few seconds, which is close to a cost of zero. Except for the fixed cost of paying someone to write the software. The owner of the software may able to price a run of the software at $10 per run, or per page reformatted or whatever, but that has little to do with his cost. It has only to do with some mechanism that prevents the price being zero (such as a mechanism to prevent you copying the code and running yourself “for free”).

What is often called “costs” in this discussion is often the accumulation of other people’s prices – prices which reflect their ability extract a price rather than their actual cost (in person hours or raw materials etc) of what is done.

When someone says “the cost should be zero”, they generally confusing cost and price. May be the price should be zero or not, but the cost is a matter of physics not a matter of “should”.

And in our digital world we are dealing with real costs that are close to zero – the cost of processing, storage, bandwidth, even programming (for any high volume usage) is negligible.

My off-the-cuff “$26 for a billion emails” is wildly off (as Kevin said). It’s actually taken from a price quotation from a “mass-marketeer” (aka spammer) – $26 for a MILLION (my mistake) emails. … that’s his price, so you know his cost is less because he makes a profit). So the cost is a thousand times greater than I said and it’s still NEGLIGIBLE!

My point is mainly about thinking through this problem thoroughly. We are in a world dominated by near-zero-marginal-cost goods. It’s very unfamiliar world and it’s going to bite you.

Wow, circular, man. So, someone else’s prices are my costs which makes them just prices so they aren’t costs but prices? I love it. Actually, not even circular, but sort of self-canceling. If only money worked that way, like a magic trick on the street corner where if I fool you enough, it just disappears up my sleeve and you applaud the trick. Unfortunately, it doesn’t work that way.

“$26 for a million emails” — let’s analyze that, since I seem to be your outsourced consultant on thinking things through.

This cost is wildly low for modern emails that comply with CAN-SPAM and have metrics on them, things publishers want. Emails can run from $0.025 (2.5 cents) to $0.0085 (0.85 cents) and anywhere in between for emails that actually allow you to have unsubscribe features, bad address and bounce detection, list hygiene, resend features (in case someone is away or the network is down where they are), open rates, click-through rates, HTML vs. text options and rendering, and so forth. Then, you have to build the landing pages, compose the emails, proof them, store all the related graphics on a server and make sure they’ll load when the HTML version is loaded, have your database ingest and wash any bounces resulting from the send, monitor response rates to see if they matched your expectations, deal with any unsubscribes, etc. And for a publisher, you have to do it all again, and are usually supporting multiple emails and types of emails at once — so, marketing emails, content emails, alerts emails, and customer service emails.

As a business owner, I have to absorb all the costs related to this, from the salaries of the customer service people dealing with in-bound emails (not mentioned above), to the fulfillment system costs, to the list rental and scrubbing, to the email creative, to the hosting of assets, to the marketing managers and staff, to the technology contracts, to the domain name registrations to the trademarks and their costs to the email consultants to the publication events and so on. Those aren’t just “other people’s prices.” Those are real costs, and all related to the provision of digital goods.

You keep wanting digital to be negligible, but those days are over. Digital is the main way publishers, marketers, advertisers, authors, and editors are reaching people. It has become sophisticated. So, Rip Van Winkle, I’d suggest you realize it’s not 1998 anymore, and accept that sophisticated, stable, reliable, strategic, profitable, useful, and interesting digital products are complex, robust, and rather expensive to do well.

Actually I’m trying to be helpful, not antagonistic.

Think through how many of the costs you describe are fixed costs not marginal costs, when you consider the physical costs not peoples prices).

You can pay for a service that provides “unsubscribe” at a per-email cost. But “unsubscribe” is done by some software that took a human programmer a fixed time to write. It is executed on a computer that is a fixed cost. It runs over copper wires that are a fixed cost – even if you pay by the GB of traffic. Yes there are some costs that increase with scale – like administrators – but they get more automated with more scale because there are using the same technology – electronics and programming, so there costs don’t grow linearly.

The difference between “your costs” and “other people’ prices doesn’t matter until something changes. You pay $1 per email for that service that is really done by software that is fixed cost amortized at $0.90 per email, so you pass on your cost (his price) as a variable price to your customer. But your competition buys the software outright, and can now adjust his price structure to beat you because his marginal costs are $1 less than yours.

All of this cost/price stuff worked fine in a competitive world of physical goods where the major portion of the real costs are unit costs. We now have a world in which the fixed costs are very high (which is really the point you are making – the REAL COST OF DIGITAL GOOD IS NOT ZERO — they are HIGH (I completely agree) but they are FIXED costs, not marginal costs).

They are a mix of both, and often a tumbling mix. Add a new issue? A mix of fixed and variable costs. Email people? A mix of fixed and variable costs. Each email costs something to send, the bounces cost something to process, the database costs something to update, etc. I mean, really, what are we arguing here? Your religious views vs. reality? Join the atheists of digital — they have better accounting practices.

Dave, I have to disagree with you yet again on the subject of the cost of high-volume email. No respectable organization uses spammers, so what they charge is entirely irrelevant. You wouldn’t argue that the cost of a Rolex is $20 because that’s what a counterfeiter charges, would you? Email sent by spammers is no different, and it winds up in spam filters (or not being delivered at all) with good reason. Even still, spammers don’t support CAN-SPAM (which Kent pointed out), which is not a nice-to-have but rather a matter of compliance with the law. Are you suggesting that publishers start breaking the law to lower their costs?

A real Rolex COSTS more to make than a fake because there’s more material, better material, more labor and better labor in the real one. The PRICE is higher because it costs more to make, and because Rolex can extract a higher rent for its brand if the buyer thinks its a real Rolex (because the buyer ‘knows’ it is made with better material etc ..).

CAN-SPAM is implemented by software not human labor. It costs real resources to do but they are fixed costs (programming labor) not marginal costs (there’s no labor per email).

Underlying the entire digital world is a high fixed cost, zero marginal cost model that we don’t have an economic system to handle. The “zero-marginal-cost” leads to cries for the price to be zero, or marketing tactics result in the price being zero (as in Browsers, for example) … leaving the suppliers wondering how to cover their fixed costs and make a sustainable profit.So far the only model is some form of “trick” in which the price is a function of the strength of the “trick”, and is unrelated to cost, and related to value only to the extent that the price cant exceed value if the customer can walk away. Therefore your “costs” are built on rickety framework of other people’s “tricks”.

Sorry, but you’re wrong. I’ve tried to shed light where you only want to impose darkness, but I have to move on now.

Kent, this is an excellent post.

The belief that digital goods are inherently less expensive to deliver than physical ones (even cost-free, perhaps) is not limited to the debate over open access. I’m sure you’ve encountered librarians and individual subscribers who are not advocating open access but don’t think they should be paying as much for your online journal as they did (or still do) for your print journal.

Dave commented on the cost of email, which emphasized your point (this digital “stuff” is all free, right?—wrong), although the cost figure he provided for sending one billion emails is wildly inaccurate. High-volume email (e.g., journal TOC alerts) costs hundreds of thousands of dollars, and I can assure you it does not approach one billion emails.

People who believe that the delivery of digital goods should be cost-free or nearly cost-free tend to argue that “creature comforts” like journal TOC alerts can be done away with in order to lower or eliminate the cost, because the “community” will develop a free alternative. I have seen no evidence of this; in fact, journal TOC alert subscriptions are steadily rising.

I’m sure you’ve encountered librarians who are advocating open access but also do understand that online journals and other electronic resources sometimes have more value than the print version and thus are justified in having a greater price than the print version. Just a couple of hours ago I participated in a group at my academic library that makes decisions regarding e-resource acquisitions and we decided that a doubling of price was reasonable for switching from print to online for a reference series. No, this is not technically a journal, but everyone seemed to accept that a doubling in price was a reasonable cost increase for the advantages offered by the online version for this particular resource.

Kent, can you expound on what you mean here:

Of course, this argument is implicitly cost-based, while the information economy works more rationally if it’s value-based, so the argument is fundamentally flawed at it outset.

Information is valuable because of what you get from it. Therefore, it should be based on the value the user ascribes to it, not on the cost of production. Information is about receipt and utility, not delivery and FOB.

The belief that digital costs almost nothing and therefore ebooks should be priced cheaply is pervasive, and I have written to try and rebut this assumption in numerous places on numerous occasions. Your summary here, Kent, will be an invaluable resource to me in the future to get the point across that digital costs are mostly substitutive for print costs, not conducive to huge savings for publishers.

Unfortunately (as has been alluded to in some previous posts at TSK), publishers are doing a poor job of explaining to the outside community why some digital costs are so darned high. A recent article quotes Elsevier as saying average per-download costs for articles are around $1.10 (not sure how accurate the number is or how it was calculated, and I am happy to hear corrections; I also understand that different publishers may have different numbers). Assuming 1,000 downloads for an article, this is not a trivial cost, and I fully accept that. The hangup for many of us is how this then translates into $37/download for non-subscribers who are outside of a major research library system and choose to buy their access legally? If publishers want readers to pay a premium for information access, publishers need to do a much better job of making these costs transparent.

Andy, I was thinking about this at a recent meeting watching several technical presentations from Elsevier showing the really interesting things they’re doing through SciVerse to engage with the developer community and build apps to add new functionality to journals, along with an overview of their implementation of semantic technologies. These are the sorts of things that researchers want to see from publishers, but Elsevier seems to do a poor job of connecting the investments they’re making to benefit the researcher to the prices of their journals and the earnings everyone is so upset about.

Not to be pedantic, but the article you reference states that “fees” have come down to an average of $1.10 per download, not costs. As I understand the usage of “fees” as used int he cited article, this is was Elsevier charges consumers for articles, not the costs it incurs for publishing, warehousing, and distributing them.

Thank you for this extremely helpful piece. Your research will take many readers behind the curtain of this utopian Oz where lunch and data are supposedly free. I offer one elaboration on your point about varying quality. I have been involved in the current struggle to defend the U of Missouri Press, which the new president of the University (a former software company president with no background in academics) just said he is closing. The reaction to his announcement has been swift and massive — 1700+ following a Facebook page and 2400+ signatures on a petition in a couple of weeks — and the president is already talking about “reimagining the press,” using a faculty adviser or two, student interns, and increased reliance on digital technologies (never mind that the laid off staff of ten already trained interns and was moving rapidly into new media). What he overlooks are these statements from the American Association of University Presses task force on new business models: “The financial investment of printing and physical distribution typically comprises about a fifth of the costs; far more is invested in the time of acquiring and developmental editors, copyeditors, project managers, proofreaders, and indexers, as well as lights, copiers, office space, and other overhead costs involved in publishing that book” and that less than 10% of revenue for university presses comes from electronic editions (probably less for those, like Missouri, which don’t publish or at least emphasize journals). Here’s a link to the petition:


Kent, can you point me to your evidence for the statement, “PubMed itself charges a licensing fee, often into the tens of thousands of dollars”? I did not think this was the case. The license page (http://www.nlm.nih.gov/databases/leased.html) says “There is no charge for leasing NLM data or licensing the UMLS from NLM,” and the license itself states “Currently NLM does not impose charges of any kind for data licensed under this License.” I have heard of charges for licensing MEDLINE from other vendors, but that has to do with the vendors not the government.

There is no charge for NLM data or UMLS, but PubMed has historically had a relatively large licensing fee, which you don’t find out until you go through the contracting process. They don’t publish it online.

If this has changed, I’m happy to learn.

And, yes, this has changed, as my long-time friends at the NLM were nice enough to inform me today. PubMed has been free via license since 2000. The license is required because NLM wants the ability to revoke rights if the data are mishandled or poorly monitored. In my defense, I checked around, and apparently some people of my vintage are under the same impression — that you have to pay for PubMed. Oh well. Live and learn. My apologies.

PubMed is free to use online, but the downloaded version is a license, and it still costs money. To be clear, I used PubMed as shorthand for the Medline database, since that’s what it’s become known as. But it’s still costly, and there are still errors in it (I had a doozy pointed out to me yesterday).

This is an excellent article. Reading some of the follow-up comments makes me realize yet again that there is no arguing with someone who expects a free lunch.

Case in point, the desire by some to create the Public Library of Science is a predictable one. It’s one thing to persuade some people to PAY to give their research away (which is utterly beyond me why anyone would do that) but let’s face it, if you are a free-lance researcher and you can persuade legislators to make all government research free to everyone, why wouldn’t you do that?

This is kind of like standing up at a football game and saying “Who thinks the hot dogs should be free?” Getting a show of hands is not going to be hard. If we stretch the analogy further and assume that the stadium is built with taxpayer money and the cows are fed by the taxpayers and the vendor is paid by the taxpayers does it now make sense to give away the hot dogs? Worse, should the government now employ people to stand around and give away the hot dogs? Even the Soviets didn’t go that far.

So if the US government legislates in favour of PlOS and gives away all the scientific research for free, because “hey it’s just bits”, are the OA community going to be happy to see hostile governments then benefit from that research? Does anyone seriously think those countries are going to put their secret state-sponsored data up for everyone to see?

What about last year when someone turned up data on the bird flu that was incredibly dangerous? Should we just repeat the OA mantra and put it up for all to see because we can and its cheap to do so? Do these people seriously want to roll the dice with everyone’s lives under the assumption that “Oh the OA community will be able to find a cure WAY faster than some moron can weaponize it.”

This attitude of entitlement that comes from the OA community is not only anti-business it is dangerous. Just because it has recently become easier to get a hot dog, doesn’t mean you should get unlimited hot dogs for nothing. Not only is that clearly unsustainable, it’s incredibly bad for you.

Even though my street-meat analog kind of slides into peripheral territory, rock on Kent, this is a GREAT article. I feel your frustration.

“Your Excellency, we were all ready to start our multi-million dollar program to turn the latest US government secret research into a weapon. But the article was behind a paywall and we couldn’t afford the pay-per-view fee! Curses, foiled again!”

Not sure if this is an elaborate troll or if you’re just very confused. There are some well-reasoned arguments against funder OA mandates. This is not one of them.

David, I was trying to make the point that if the US government MANDATES that government research be put up for all to see where does that leave the bright spark in some government lab who figures out how to weaponize something nasty? Or is PlOS advocating only the liberation of the nice stuff?

Ah, I see where perhaps you’re confused. I know of no mandate requiring all government research, particularly classified or dangerous research to be made public. The mandates in question here come from funding agencies (both public and private). The idea is that the funding agency can require that any published research that results from that funding be made freely available (immediately or after an embargo period), rather than being available in forms that are only accessible to paying subscribers. It’s not about making secret research public, it’s about making publicly available research more broadly available.

Oh I see. I had read elsewhere that the OA community was advocating that ALL government funded research be made freely available. Although this seemed utterly daft to me it was consistent with many of the people who are complaining about the price of ebooks and saying that they should be the same price as music i.e. 99 cents.

Maybe a romance novel churned out in 48 hours might be sold for 99 cents but non-fiction publishers and writers have got a real problem on their hands. You can spend 20 years writing a work of non-fiction, then you have to pay for things like copy-editing, proofing, layout/coding, foreign rights negotiating, image rights, translation costs, marketing, sales, legal costs, artwork etc etc. Some people can try doing all this themselves but combined with Kent’s clearly articulated explanation of all the other costs it adds up to a very tangible string of expenses. There are a lot of people making a lot of noise supporting the drive by Amazon to run ebook prices down to nothing. Amazon themselves are literally now talking about FREE ebooks by running adware inside the book.

This can only lead to one thing, experts not getting paid adequately for their expertise. Not everyone is a tenured professor. There are many untenured experts who make a living from what they know. If we don’t pay them, why share? Thus my comment about open access being dangerous. It contributes to the dumbing down we’re seeing all around us. Whether it be librarians, journalists, teachers, or just retired engineers with a story to tell. There a lot of people out there who see no point in sharing what they’ve spent a lifetime learning because there’s no incentive anymore.

where does that leave the bright spark in some government lab who figures out how to weaponize something nasty?

Probably highly classified/restricted and certainly not submitted for publication in a journal (OA or not OA).

I actually used to work with a computer security journal, and it was often hard to get submissions because so many of the researchers in the community weren’t allowed to publicly publish their work.

Reblogged this on Oden Konsult and commented:
This is coming from a scholarly publication perspective, but the problems the author describes are almost identical to what we’re facing in broadcasting. In a nutshell, we are generating data at a rate faster than Moore’s Law can keep up with. Unless that rate changes, it’s just going to keep getting more expensive to store our media no matter how much the price of the hardware goes down.

So the question is – do we just just accept that as the price of doing business, or do we take a hard look at our policies and just exactly what we are trying to accomplish?

Indeed, digital storage and distribution is not free. The ArXiv has an annual budget of close to a million dollars. This comes out to around $7 per paper uploaded, or less than 2 cents per download. I’d be more than happy to have a discussion of scholarly publishing that starts with the agreement that digital storage and distribution has real costs on the order of magnitude of cents per download.

What a fantastic article and comment stream! Thank you Kent! A voice of reason.

The discussion reminds me of a paper Ronald Coase wrote over 60 years ago; The Marginal Cost Controversy.

I feel sure these are the exact same issues… Duffy also tackled it more recently I believe.


I find your critique upon the “digital = free” school of thought to be valid and well placed. That having been said, I still find it difficult to believe that per unit cost of production and distribution of digital works is anywhere near the per unit cost of production and distribution of physical works. (I have yet to see scholarly comparison of these costs.) Please note: I am by no means positing that digital works are cost free. I simply am interested in a realistic comparison of the costs.

I also find the argument that the increased volume of the data we are creating increases the either the marginal or the absolute cost of a particular work of authorship to be entirely unpersuasive. While more data may mean more costs, it is the choice of content provider as to whether to produce and distribute additional content and the costs associated therewith. Any additional costs are then properly attributable to the additional content.

Of course, cost does not necessarily dictate either price or value. That having been said, as a matter of consumer psychology, knowledge (or even belief) that the cost of one of two comparable (but not identical) products is radically different is capable of impacting a consumer’s perception of what constitutes a “fair” or “reasonable” price from the standpoint of the consumer. There is, of course, far more to creation of perceived value on the part of consumers. (In fact, arguably, this is where publishers have truly fallen down on the job: the creation of perceived value in digital works.)

Whether you find it persuasive or not, people are discovering it’s real.

Kent, I am not sure I understand your argument on this particular point. I would certainly agree that more data = greater total costs (in storage, bandwidth, management, etc.). I am not certain, however, that I see (or have seen economic data that would support) how adding more content necessarily raises the costs associated with data that is already there.

As an example, if a data set (say an e-book repository) has 1000 works in it. There is a marginal cost associated with warehousing and distributing each of those works. If another 1000 works are added to the repository additional storage space, management, and I/O bandwidth may be required to accommodate the additional works. These additional costs for adding the new works does not, however, increase the cost of handling the original 1000 works. In fact, increased volumes often (but not always) create economies of scale which have the net result of reducing the per unit cost (spreading costs of a greater number of units).

Again, I am by no means arguing that there is no cost (or even that there is only minimal cost) involved in the distribution of digital works. I just have not seen an adequate demonstration of the validity of the argument that greater volumes of works increases the cost per work. Perhaps there is data to validate this argument. I’ve just never seen it put forth in a convincing manner.

Sandy, the U Mich article is interesting, but not 100% apropos.

First, it is dated 1998. As such its cost assumptions are rather out of date. Second, the article cited is not really a “study” and does not contain any real cost data, beyond some very broad anecdotal assumptions. Rather the article is more of a broad “topic of discussion” piece: it’s great for fostering dialog, but is pretty thin on empirical, actionable information.

Comments are closed.