At the recent Society for Scholarly Publishing (SSP) Annual Meeting, the matter of the thing that begins with a B that nobody likes to talk about these days, came up. No, not that. Blockchain.
Candidly – I’ve been struggling to work out exactly what it is, and what it is for, and why I should care about it. And I don’t think I’m alone. So here then is a guide as to what it’s about and why and when you should care about it. And by care about it, I mean consider spending money on technology that features it as a component. I’m going to attempt to do this without mentioning one other B word.
There is a LOT to try and get to grips with, so I’m going to start by talking about (hopefully) the simpler bits, and then build from there. Let’s go. Note to tech wizards and other similar types – it’s a guide for the perplexed and confused, okay?
1) Blockchain is a type of database.
To go right back to basics, a database is a way of storing lots of information in a structured way, with each bit of information categorized and labeled, for easy searching and access. Blockchain is a lot more than that, but let’s start here. It has certain features that differentiate it from other data storage and manipulation technologies. Let’s take a look at what those are:
2) A blockchain database holds the data it contains as a linked list.
Basically, this means that all the pieces of information in the database are ordered linearly with each piece of information (or ‘data element’) labeled with a link to the next data element in the list. Let me give you an example: A Spotify playlist is an example of a linked list. Linked list data structures aren’t particularly special (though they can get quite complex) and you can store linked lists in many databases. But, linked list data in a Blockchain database is special. And the reason is that (unlike in other data systems) data in a blockchain database can only be added to.
3) Data cannot be deleted from a blockchain database.
Blockchain is fundamentally designed to only allow data to be added to data. Data cannot be deleted or altered. Thus, a blockchain database is said to be ‘immutable’ (unchanging). So you would only use it for information that is best looked after in this way. Each data element is encrypted. Each time data is added, the encryption works by taking the new data and encrypting it together with the encrypted data of the preceding data element.
Because you are combining both the current item and the previous item into each encryption, if you wanted to delete or alter any single item, you would have to de-encrypt the item you want to change, and then re-encrypt everything that comes after it in the list. (So, if you anticipated wanting to change items in your database, you would not choose to use blockchain.) But that’s only part of the reason why it’s immutable. We’ll take a look at the other reasons below.
4. Blockchain databases are decentralized.
This one is crucial to understand. So let’s take it from the top. For all sorts of reasons, all databases, regardless of technology, consist of more than one copy. Those reasons include:
- Performance (speed of searching and accessing the data in the database).
- Resilience (being able to get your database working again after something like a power cut)
- Different access rights (being able to keep one copy of master data, with highly restricted access, while working on and potentially mucking up other versions).
With multiple copies, we need to figure out a) how they are connected to each other and b) how we make sure the data they contain is correct in every instance. We call this latter point ‘consensus’. Consensus is the term for an agreed view of what a given set of data contains and how it is structured and labeled. It could be your customer relationship data, your content management system, or the data from a research experiment. All versions of a database need to achieve consensus.
So-called centralized databases need a master or ‘parent’ to keep everything in order with multiple ‘child’ databases that are used in anger. And this master database is the one true source of truth. NONE of the others are. If there’s a discrepancy – the master database version is used to sort that. Every time. In our centralized set-up, the master database is not only the one source of truth, it is also, by virtue of that fact, TRUSTED. And by trusted, I mean that it is the master database that does all the checking to make sure that the data inside it is just so and correct and valid. We call this ‘validation’. There is another approach to database management, and that is ‘decentralized’.
5. Decentralized databases are radically different. There is no master source of truth.
The truth is held and agreed on by all (or sufficient) participants in the network. You’ll see I introduced the idea of network there. A decentralized network means that there is no master database, and there are no parent–child relationships between the databases. All the different databases can talk to each other directly, share updates about what has changed, and accept / action changes from each other. Now you may well be reading that with a furrowed brow and wondering just how on earth that happens. Well let’s see if any of these things ring any bells; Napster; BitTorrent; Seti@Home; The Tor network.
All of these are examples of Peer2Peer networks and all lack a central organizing and controlling authority, and Blockchain too comprises multiple databases without a central authority.
6. A blockchain database is a Peer2Peer network where all participants can see and communicate with all the other participants.
Peer2Peer technology has been around a good while now and is really very mature. The ways of putting a robust Peer2Peer network together are well understood. You have probably used a Peer2Peer network (legitimately!) at some point even if you didn’t realize it. Some readers may recall that Pennsylvania State University, MIT, and Simon Fraser University built one called LionShare to exchange research information.
The Key difference between a general Peer2Peer network and a blockchain network is that in the blockchain network ALL participants must arrive at an agreed version of what the data they hold looks like. They must become consistent. It takes time for a consensus to be agreed upon, and then propagated around the network.
7. A Blockchain database network stores transactional data and the means to validate it.
What sort of data is actually stored in a blockchain database? There are two types. The first type is transactional data. It doesn’t have to be monetary (or equivalent) transactions, but it does have to represent an event. Transaction data has a timestamp, a reference to the object being transacted and a value eg ([I] [wrote] [a cracking blog post] for [The Scholarly Kitchen] [On this date]). This why you hear blockchain described as a ledger – it’s the record of the transactions. The second type is data that enables those transactional statements to validated. Validation in the blockchain system is very interesting. Interesting because the validations have to be computable; the events you wish to verify MUST be described in terms of rules that a machine can perform matching operations on. This is a very complex area all by itself, so I’m going to leave it there. If you are wondering “but how do you do that really?” then you are thinking about it correctly.
8. A Blockchain database network operates in a zero trust environment.
So our blockchain network MUST consist of parties who have a shared interest in recording the transactions or interparty agreements, or whatever, between parties in the network – because each party will be expected to validate and store information about the other parties’ transactions (events).
Every participant must also, by the act of participating, accept a shared view of what all those agreements and exchanges actually are going to be (the validation data). Shared interest then is a key, indeed vital, motivator needed for a blockchain network of databases to work. But that’s not enough. Shared interest alone does not prevent bad people joining the network, up to no good, who might wish to skew the data to meet their ends, rather than reflect the truth. In a decentralized network, you can’t trust any of the other participants. This is a key point – if you trust one person or a group of people on the system – you’ve just agreed to accept a nucleus of aligned interests which have centralized around that. When these terms are used, it is specifying the fundamentals of how such networks actually function. True decentralization is an incredibly hard thing to actually achieve.
9. Blockchain transactions are broadcast to the entire network and then a consensus must be arrived at in order to validate and then record (immutably) those transactions. All participants must take part in some aspect of this process.
What’s needed is a mechanism to a) incentivize effort to get a consensus on the state of those inter-party interactions that are being logged and b) to be secure against people who might seek to distort or corrupt the data that the network is working on.
And this is how it happens:
- You undertake some sort of transaction with another party on the network. Because it’s all encrypted, you both exchange the information required for each of you to verify that exchange (I shall spare you the pitiless logic of the world of public key encryption – it works, it’s amazing stuff, the planet runs on it, and don’t let the politicians ruin it!).
- Each of you now broadcasts the news of this transaction to every other database (we call them nodes) on the blockchain network. At this point, those nodes that have been told about the transaction, but weren’t actually involved, need to decide what they are going to do about it. In a theoretical world, they’d do the work you just did with your counterparty — to agree that the thing you just did, had actually happened to their satisfaction. This is the validation step. And when enough nodes have done that validation work, then the data is declared to be definitive and everybody in the system now accepts that transactional information is correct. As this all takes some effort, transactions are typically bundled up into groups of transactions and the work is done to validate the transactions within that grouping or BLOCK as it is known… So our linked (transactional) data that I explained waaaay back at the start, consists of BLOCKS of such data appended together into a CHAIN (chain link geddit?) distributed around the system, validated and ultimately accepted as being the true representation of the state of things. Once something is in, you can’t go back on it.
- But of course in a world where some people are doing lots of transactions of whatever sort, and others aren’t, there’s pretty quickly a difference in the effort being expended for the benefit that’s being gained. So there needs to be some sort of incentive system to get nodes in the system (or enough of them) to do the work to do the validation. Enter the Proof of Work.
10. Proof of Work is the incentive system that powers Blockchain validation and thus decentralized consensus
Proof of Work is the vital component needed to keep the verification process honest. Without it, a group of nodes could collude to decide on a version of the truth and if there are enough of them, their version will get added instead of the correct version. So proof of work is a problem that has to be resistant to that. In fact it’s a competition. First node to solve this problem, and have their solution confirmed (by the other nodes), wins!
“But what is the problem?!” I hear you cry. The problem is as follows: take those transactions that have been wrapped up into a block and
- validate them all (according to the rules), Then…
- ‘Hash’ them. Hashing is a cryptographical process that takes data and turns it into a number or other representation (aka the hash). The twist here, is that it can’t just be any old hash (these are pretty easy things to do for a computer), it must be a special hash that meets ‘specific’ criteria. I’ll leave it there (and so will you, tech nerds m’kay?!). To find the special hash, the node doing the proof of work must try a lot of hashes until it finds one that meets the criteria.
When a node is first to find the special hash of the data, it broadcasts this fact to the rest of the network. Other nodes in the network then check that the special hash is in fact special (i.e., they double check that it does meet the ‘specific’ criteria and hasn’t just been made up by a node pretending to have ‘won’). This is easy to do and requires very little energy or effort, and so doesn’t need any incentivizing. Once enough nodes have independently done this, the block of data becomes accepted as the latest addition to the system and all nodes update their copy of the data to reflect this. And the circle of life continues…
This takes a lot of computional power and this is why the Proof of Work is so energy intensive. There’s no way around this for this particular way of validating. And the key point here is that Proof of Work, with a reward at the end for the successful node (and therefore the node owner), is the thing that enables the system to work in a decentralized way. People are incentivized to do the work to check the blocks that represent the transactions and add them to the consensus view of who has done what with whom.
What do they win? Well that’s a very good question. The answer is that they win a number. That number might (and so far in the blockchain world this is the only answer…) represent an amount of currency. But it could represent some other sort of value recognition. But a pat on the back, or other non-monetary rewards, don’t help to pay that electricity bill (or the cost of the hardware).
11. Now, if you’ve got this far, I hope the realization has dawned that this is in no way your regular database.
A. The rules, once agreed, cannot be changed. You have to decide – before you build your blockchain system – exactly what transactions it is going to record, and exactly what information you need to know about each transaction. You cannot change these things after the blockchain has started to be used. This is very different to other database systems, where the ‘schema’ (which describes the fields that can be used, the type of information that can go into those fields, and how that information must be formatted) can be updated from time to time. In the case of blockchain, the ‘schema’ is baked into the mathematical protocols that power the system. If you find yourself needing to change the rules, you have to build a new system, and the participants have to decide which version of that system they are going to participate in. If there’s an argument about whether a set of agreed transactions should be recognized (‘we got hacked’ or ‘we changed our minds…’) then you resolve that by building a system with that version of the truth and again, the participants have to decide which version they will support (why yes, this situation is a complete nightmare). This is the Forking Problem.
B. A blockchain MUST have a critical mass of participants, or it cannot be trusted. Because blockchain is a peer2peer system, it needs a certain scale in order to function correctly. In a trustless system, you can only trust what your version of the data is, if enough people are participating to make the risk of bad actors corrupting the data, mathematically unfeasible. Decentralization is not simply a word — it has to be a requirement. In fact, if an application of this technology does NOT pass the test of NEEDING to be decentralized, in my opinion, it should not be considered.
C. A blockchain is spectacularly energy intensive. As the network scales – so the effort to do the verification scales. Not only is each act of verification computationally expensive (in time, money, and energy) but because it’s a competition with only one winner, almost ALL of the effort is for nothing. All the losers get nothing. Their blocks of data that were part of the competition are destroyed. The rate at which information can be stored in the system is subject to the mathematical constraints that I’ve described, i.e., data is only added when the math problem is solved. That math problem can only be solved more rapidly by applying more computing power – and, indeed electricity – to the process. This limits the amount of data that can be stored per day, but is critical to the function whereby trustworthiness is linked to critical mass. If verification was in the hands of a smaller number of participants, the individuals would be trusting each other specifically, rather than trusting to critical mass generally. This is important when we consider the potential uses of blockchain in our sector. There are far more efficient (and quicker) ways of storing information, which should be used for any system where quality and trustworthiness does not RELY on decentralization.
D. Meaningful incentives are an absolute requirement. This is why currencies have been the use case so far for blockchain. If you are going to spend a lot of energy to participate in the network, you need to be rewarded for that with something you can ‘spend’. Because that energy is actually costing you money (your electricity bill), ideally that reward will be actual cash. For blockchain to work in use cases other than currencies, you need a reward that can either be easily converted into cash, or is worth more to the participants than cash.
12. Ultimately – blockchain is a synthesis of data, computing power, and the PEOPLE who are motivated to participate in a trustless network.
When you think about the potential for blockchain to be used in a scholarly publishing or communication process, ask yourself these questions:
- Who is going to participate in this network?
- Why would they want to (What’s in it for them?)
- What is it about the data, that would drive the behaviors needed for such a system to grow, survive, and ultimately flourish?
- Why is it best for this data network to be decentralized?
If you can articulate some answers, and you are up for the effort needed to grow a network around that answer to sufficient size, and you are willing to work through the mathematical and algorithmic issues in order to get the network to function, you’ve got yourself a blockchain.
So, do I think there are any applications for blockchain technology?
As it happens I do. Decentralized Identity Management.
A subject for a future post perhaps.
Special note: I’d like to thank my fellow Chef Charlie Rapple for her superb input into this article.
13 Thoughts on "What is the Blockchain Really, and Should You Care? A Guide for the Perplexed Scholarly Publishing Citizen."
Great job. Thanks. But sometimes such stories cry out for graphics. Just a thought.
I do, too: Decentralized and trustworthy research data management. But is it really worth it if the technology is so energy intensive? With the climate change and all?
This is a good basic overview, but some of it seems specific to Bitcoin’s implementation of blockchain. For instance, the reason the Bitcoin blockchain is so “spectacularly energy intensive” is because anyone can join the chain, and your proof of work validates your trustworthiness. A protected blockchain, where only certain trusted organizations maintain the ledger, can perform this work without the energy-intensive validation steps. Of course, that is less decentralized, so the benefit of that blockchain is somewhat different.
Hi Shaun, I left out a discussion about other ways of validating – for length mostly. However… It is my view that The Proof of Stake approach as you indicate is seemingly considerably less decentralised and thus calls in to question whether the rest of the technology should even be considered in a far more trusted environment – such as Scholarly Publishing. If the participants have trust based relationships (and we do) then…
Great overview–and so clearly written, which is always the toughest job with new technology that is swimming in buzzwords.
One potentially interesting application for blockchain and scholarly publishing is peer review, and there is an effort to bring that to life.
My colleague Bill Rosenblatt has also written about this, including an article in Publishers Weekly where he discussed three potential applications for book publishers–rights and royalties, consumer e-book distribution, and piracy track-and-trace.
Those of you looking for visual examples, try these here:
http://graphics.reuters.com/TECHNOLOGY-BLOCKCHAIN/010070P11GN/ or https://anders.com/blockchain/. See https://blockchaindemo.io/ for an interactive demo.
It seems to me that this is an example of group-think which implies group preducies. Would a blockchain exclude data to which all agree should be excluded?
In a decentralised world, then everyone is thinking for themselves. So there shouldn’t be group think (it requires trust/collaboration) However, you raise in interesting point about what the validation rules actually are in situations other than monetary transactions. Ultimately if you want to say that data in the system needs to be rolled back then you have to ‘fork’ the system. You end up with two versions…
I would think that “A. The rules, once agreed, cannot be changed.” would preclude this from any practical use in peer review. Our Editorial Board is always coming up with new curve balls that require us to go into our manuscript submission software and change settings/set up new processes. In fact, the selling point of our current manuscript submission software was that it was very flexible and adaptable.
It rather suggests that there be “one set of rules for all” doesn’t it. Explanations to the contrary are most welcome!
Thank you, David, for this matter-of-fact explanation of blockchain. I still can’t quite wrap my head around it (nodes competing to find the special hash of the data — whaaaat??), but this post was very helpful all the same.
Thank you David for this well written explanation. For thinking through your final question the following publication might be helpful.
Blockchain for research. Joris van Rossum et. al. Digital Science Report. November 2017.
See the content blockchain project for another interesting experiment: https://content-blockchain.org/