Backblaze Open Sources Reed-Solomon Erasure Coding Source Code

June 16th, 2015

Reed  Solomon Erasure Coding

At Backblaze we have built an extremely cost-effective storage system that enables us to offer a great price on our online backup service. Along the path to building our storage system, we have used time-tested technologies off the shelf, but we have also built in-house technologies ourselves when things weren’t available, or when the price was too high.

We have taken advantage of many open-source projects, and want to do our part in contributing back to the community. Our first foray into open source was our original Storage Pod design, back in September of 2009.

Today, we are releasing our latest open-source project: Backblaze Reed-Solomon, a Java library for erasure coding.

An erasure code takes a “message,” such as a data file, and makes a longer message in a way that the original can be reconstructed from the longer message even if parts of the longer message have been lost. Reed-Solomon is an erasure code with exactly the properties we needed for file storage, and it is simple and straightforward to implement.

Erasure codes and storage

Erasure coding is standard practice for systems that store data reliably, and many of them use Reed-Solomon coding.

The RAID system built into Linux uses Reed-Solomon. It has a carefully tuned Reed-Solomon implementation in C that is part of the RAID module. Microsoft Azure uses a similar, but different, erasure coding strategy. We’re not sure exactly what Amazon S3 and Google Cloud Storage use, because they haven’t said, but it’s bound to be Reed-Solomon or something similar. Facebook’s new cold-storage system also uses Reed-Solomon.

If you want reliable storage that can recover from the loss of parts of the data, then Reed-Solomon is a well-proven technique.

Backblaze Vaults utilize erasure coding

Earlier this year, I wrote about Backblaze Vaults, our new software architecture that allows a file to be stored across multiple Storage Pods, so that the file can be available for download even when some Storage Pods are shut down for maintenance.

To make Backblaze Vaults work, we needed an erasure coding library to compute “parity” and then use it to reconstruct files. When a file is stored in a Vault, it is broken into 17 pieces, all the same size. Then three additional pieces are created that hold parity, resulting in a total of 20 pieces. The original file can then be reconstructed from any 17 of the 20 pieces.

We needed a simple, reliable, and efficient Java library to do Reed-Solomon coding, but didn’t find any. So we built our own. And now we are releasing that code for you to use in your own projects.

Performance

Backblaze Vaults store a vast amount of data and need to be able to ingest it quickly. This means that the Reed-Solomon coding must be fast. When we started designing Vaults, we assumed that we would need to code in C to make things fast. It turned out, though, that modern Java virtual machines are really good, and the just-in-time compiler produces code that runs fast.

Our Java library for Reed-Solomon is as fast as a C implementation, and is much easier to integrate with a software stack written in Java.

A Vault splits data into 17 shards, and has to calculate 3 parity shards from that, so that’s the configuration we use for performance measurements. Running in a single thread on Storage Pod hardware, our library can process incoming data at 149 megabytes per second. (This test was run on a single processor core, on a Pod with an Intel Xeon E5-1620 v2, clocked at 3.70GHz, on data not already in cache memory.)

Where is the open source code?

You can find the source code for Backblaze Reed-Solomon on the Backblaze website, and also at GitHub.

The code is licensed with the MIT License, which means that you can use it in your own projects for free. You can even use it in commercial projects.

We’ve put together a video titled: Reed Solomon Erasure Coding Overview to get you started.

If you’re interested in an overview of the math behind the code, keep reading. If not, you already have what you need to start using the Backblaze Reed-Solomon library. Just download the code, read the documentation, look at the sample code, and you’re good to go.

Reed-Solomon Encoding Matrix Example

Feel free to skip this section if you aren’t into the math.

We are fortunate that mathematicians have been working on matrix algebra, group theory, and information theory for centuries. Reed and Solomon used this body of knowledge to create a coding system that seems like magic. It can take a message, break it into n pieces, add k “parity” pieces, and then reconstruct the original from n of the (n+k) pieces.

The examples below use a “4+2” coding system, where the original file is broken into 4 pieces, and then 2 parity pieces are added. In Backblaze Vaults, we use 17+3 (17 data plus three parity). The math—and the code—works with any numbers as long as you have at least one data shard and don’t have more than 256 shards total. To use Reed-Solomon, you put your data into a matrix. For computer files, each element of the matrix is one byte from the file. The bytes are laid out in a grid to form a matrix. If your data file has “ABCDEFGHIJKLMNOP” in it, you can lay it out like this:

The Original Data
The Original Data

In this example, the four pieces of the file are each 4 bytes long. Each piece is one row of the matrix. The first one is “ABCD”. The second one is “EFGH”. And so on.
The Reed-Solomon algorithm creates a coding matrix that you multiply with your data matrix to create the coded data. The matrix is set up so that the first four rows of the result are the same as the first four rows of the input. That means that the data is left intact, and all it’s really doing is computing the parity.

Applying the Coding Matrix
Erasure Coding

The result is a matrix with two more rows than the original. Those two rows are the parity pieces.

Each row of the coding matrix produces one row of the result. So each row of the coding matrix makes one of the resulting pieces of the file. Because the rows are independent, you can cross out two of the rows and the equation still holds.

Data Loss: 2 of the 6 rows are “lost”
Data Loss: 2 of the 6 rows are lost

With those rows completely gone it looks like this:

Data Loss: The matrix without the 2 “lost” rows
Data Loss: The matrix without the 2 "lost" rows

Because of all the work that mathematicians have done over the years, we know the coding matrix, the matrix on the left, is invertible. There is an inverse matrix that, when multiplied by the coding matrix, produces the identity matrix. As in basic algebra, in matrix algebra you can multiply both sides of an equation by the same thing. In this case, we’ll multiply on the left by the identity matrix:

Multiplying Each Side of the Equation by the Inverse Matrix
Multiplying Each Side of the Equation by the Inverse Matrix

The Inverse Matrix and the Coding Matrix Cancel Out
The Inverse Matrix and the Coding Matrix Cancel Out

This leaves the equation for reconstructing the original data from the pieces that are available:

Reconstructing the Original Data
Reconstructing the Original Data

So to make a decoding matrix, the process is to take the original coding matrix, cross out the rows for the missing pieces, and then find the inverse matrix. You can then multiply the inverse matrix and the pieces that are available to reconstruct the original data.

Summary

That was a quick overview of the math. Once you understand the steps, it’s not super complicated. The Java code goes through the same steps outlined above.

There is one small part of the code that does the actual matrix multiplications that has been carefully optimized for speed. The rest of the code does not need to be fast, so we aimed more for simple and clear.

If you need to store or transmit data, and be able to recover it if some is lost, you might want to look at Reed-Solomon coding. Using our code is an easy way to get started.

Brian Beach

Brian Beach

Brian has been writing software for three decades at HP Labs, Silicon Graphics, Netscape, TiVo, and now Backblaze. His passion is building things that make life better, like the TiVo DVR and Backblaze Online Backup.
  • Roger Shimizu

    We already know the source code, mentioned here and hosted in github, is licensed under MIT/Expat License.

    May I know what’s the license for the image files in this blog post?
    For example:
    https://www.backblaze.com/blog/wp-content/uploads/2015/06/blog-rs-7.png

    I hope image files are also under MIT/Expat License. So we can use them together with the source code.
    Could you kindly help to confirm it? Thank you!

  • Joran Dirk Greef

    Thanks Brian, Backblaze for releasing this under the MIT license. The code is really well written and well structured. I have created a port to Javascript for Node.js with an optional C++ native binding: https://www.npmjs.com/package/reed-solomon

  • Bruce Croxall

    How do you determine what the “parity pieces” of the generator matrix are?

    Thanks for the help.

  • agent lee

    why don’t you guys just use jerasure ?

  • cafehunk

    A couple of micro-optimizations: For divide() in galois.java, you can remove the if >8), instead of the iterative reduction loop.[[And you don’t need to add one additional byte entry (a final 1) in the EXP_TABLE to cover the case when log_result == 0xFFFF, because the product of log(a)*n is always smaller. The comment about the range of values in LOG_TABLE is incorrect: the largest value is 254, not 255, and the comment in EXP_TABLE even says that.]] These changes have only a tiny effect on performance, though, as the functions are only used to compute the matrices.

  • DrPizza

    Quick and fairly direct C++ port, with some SSE3 intrinsics. https://github.com/DrPizza/reed-solomon

    On a Core i7-2600, 10 data blocks, 4 parity blocks, 16 MiB per block, across 8 threads it goes at about 7-8 GiB/s. For 17 data, 4 parity, 200 kiB block size, 8 threads hit about 10 GiB/s.

    It is not very extensively tested. Bugs would not tremendously surprise me.

  • Deodatta Barhate

    Hi,
    How do you solve the problem of partial writes? If the node crashed while writing the file into
    shards, then shards (both data and parity) may contain partial data and then it may not be possible
    to recover data from parity (as we don’t know whether parity is old, new or partial).
    So is there any front end logging used?

  • Jeremy

    Can you talk a bit more about how you decided to go with 17+3, and how the blocks of data are allocated across independent disks and independent pods in a vault?

    As I recall, reconstructing the original data if you’re missing one block requires reading the N-1 remaining blocks and one parity block. So when you’re in a “degraded” state with 17+3, does that mean effectively a 17x read penalty?

    • Brian Beach

      Thanks for the link. I’ll have to read that paper carefully, but my first impression is that RDP supports at most two drive failures. For our current vault deployment, we think that three parity shards are required to provide the degree of safety we need.

      We analyzed the potential failure scenarios, and found that 17+3 balanced cost and operational complexity, while providing the level of file durability we needed.

  • Klaus Post

    I have released a fully compatible Go port of the library & tests. It is an order of magnitude faster, mainly since I can use SSE3 assembler.

    Thanks for releasing this. The library was very easy to understand, and cleanly written and I didn’t find any problems while porting the code.

    I have written an extensive “usage” guide, and also described a bit more what you should mind when implementing the library. Together with this excellent blog post it should be a good start.

    The project can be found at https://github.com/klauspost/reedsolomon

    • Klaus! How can I get a hold of you re: Backblaze related things?

      • Klaus Post

        You can catch me at klauspost at gmail dot com

    • Misiek

      Thanks for this. I love when people write “Our Java library for (…) is as fast as a C implementation”. Java is slow and not memory efficient if you really care about speed and performance. Say it loud and proud.

  • A great explanation, but I have tons of questions now! :)

    So, you have to store the encoding matrix, as well as the original data, right? Or is that computed from the original data, whatever parts are still available? Because if you have to store that as well, that practically doubles the size of the storage needed, right?

    Or, can you reverse-engineer the encoding matrix from the parity that’s left? Or do you store the encoding matrix AND the parity? In the above example, you have 16 bytes original data and 8 bytes parity and 16 bytes of encoding matrix — so a ratio of 16:24 original data:encoded data, right?

    I’m probably really off on this. I just saw a chicken and egg problem if you have to store the encoding matrix as well as the original data, then what stops you from losing the encoding matrix due to disk corruption as well? And then you have to have an encoding matrix encoding matrix?

    Ok, my eyes are crossed. ;)

    • Brian Beach

      The encoding matrix is constant, given the number of data shards and the number of parity shards.

      In our library, when you create a ReedSolomon object, and give it the data count and parity count, it computes the encoding matrix for that configuration. There is a fixed, reproducible algorithm to do that, so you don’t have to store the encoding matrix with the data.