We are thrilled to share the results of a project that has engaged the Backblaze engineering team for the last year: Backblaze Vaults, a major step forward in our cloud storage service’s technology stack. Currently Backblaze stores over 150 petabytes of data and has recovered over 10 billion files for customers of our cloud backup service. The new storage Vaults will form the core of our cloud services moving forward. Backblaze Vaults are not only incredibly durable, scalable, and performant, but they dramatically improve availability and operability, while still being incredibly cost-efficient at storing data. We shared the design of the original Storage Pod hardware we developed, here we’ll share the architecture and approach of the cloud storage software that makes up a Backblaze Vault.
Backblaze Vault Architecture for Cloud Storage
The Vault design follows the overriding design principle that Backblaze has always followed: “keep it simple.” As with the storage pods themselves, the new Vault storage software relies on tried and true technologies, used in a straightforward way, to build a simple, reliable, and inexpensive system.
A Backblaze Vault is the combination of the Backblaze Vault cloud storage software and the Backblaze Storage Pod hardware.
Putting The Intelligence in the Software
Another design principle for Backblaze is to expect as little as possible from the hardware and abstract the intelligence of the cloud storage into the software. This was the case when we originally designed our Storage Pod hardware and continues as a design goal with Vaults. In addition to leveraging our low-cost Storage Pods, Vaults continue to take advantage of the cost advantage of consumer-grade hard drives, and cleanly handle their common failure modes.
Distributing Data Across 20 Storage Pods
A Backblaze Vault is comprised of 20 Storage Pods, with the data evenly spread across all 20 pods. Each Storage Pod in a given vault has the same number of drives, and the drives are all the same size.
Drives in the same drive position in each of the 20 Storage Pods are grouped together into a storage unit we call a “tome”. Each file is stored in one tome, and is spread out across the tome for reliability and availability.
Every file uploaded to a Vault is broken into pieces before being stored. Each of those pieces is called a “shard”. Parity shards are added to add redundancy, so that a file can be fetched from a vault even if some of the pieces are not available.
Each file is stored as 20 shards: 17 data shards and 3 parity shards. Because those shards are distributed across 20 storage pods in 20 cabinets, the Vault is resilient to the failure of a storage pod, or even a power loss to an entire cabinet.
Files can be written to the Vault when one pod is down, and still have 2 parity shards to protect the data. Even in the extreme and unlikely case where three Storage Pods in a Vault lose power, the files in the vault are still available because they can be reconstructed from the 17 pieces that are available.
Each of the drives in a Vault has a standard Linux file system, ext4, on it. This is where the shards are stored. There are fancier file systems out there, but we don’t need them for Vaults. All that is needed is a way to write files to disk, and read them back. Ext4 is good at handling power failure on a single drive cleanly, without losing any files. It’s also good at storing lots of files on a single drive, and providing efficient access to them.
Compared to a conventional RAID, we have swapped the layers here by putting the file systems under the replication. Usually, RAID puts the file system on top of the replication, which means that a file system corruption can lose data. With the file system below the replication, a Vault can recover from a file system corruption, because it can lose at most one shard of each file.
Creating Flexible and Optimized Reed-Solomon Erasure Coding
Just like RAID implementations, the Vault software uses Reed-Solomon erasure coding to create the parity shards. But, unlike Linux software RAID, which offers just 1 or 2 parity blocks, our Vault software allows for an arbitrary mix of data and parity. We are currently using 17 data shards plus 3 parity shards, but this could be changed in the future with a simple configuration update.
For Backblaze Vaults, we threw out the Linux RAID software we had been using and wrote a Reed-Solomon implementation from scratch. It was exciting to be able to use our group theory and matrix algebra from college. We’ll be talking more about this in an upcoming blog post.
The beauty of Reed-Solomon is that we can then re-create the original file from any 17 of the shards. If one of the original data shards is unavailable, it can be re-computed from the other 16 original shards, plus one of the parity shards. Even if three of the original data shards are not available, they can be re-created from the other 17 data and parity shards. Matrix algebra is awesome!
Handling Drive Failures
The reason for distributing the data across multiple Storage Pods and using erasure coding to compute parity is to keep the data safe and available. How are different failures handled?
If a disk drive just up and dies, refusing to read or write any data, the Vault will continue to work. Data can be written to the other 19 drives in the tome, because the policy setting allows files to be written as long as there are 2 parity shards. All of the files that were on the dead drive are still available, and can be read from the other 19 drives in the tome.
When a dead drive is replaced, the Vault software will automatically populate the new drive with the shards that should be there; they can be recomputed from the contents of the other 19 drives.
A Vault can lose up to three drives in the same tome at the same moment without losing any data, and the contents of the drives will be re-created when the drives are replaced.
Handling Data Corruption
Disk drives try hard to correctly return the data stored on them, but once in a while they return the wrong data, or are just unable to read a given sector.
Every shard stored in a Vault has a checksum, so that the software can tell if it has been corrupted. When that happens, the bad shard is recomputed from the other shards, and then re-written to disk. Similarly, if a shard just can’t be read from a drive, it is recomputed and re-written.
Conventional RAID can reconstruct a drive that dies, but does not deal well with corrupted data because it doesn’t checksum the data.
Each vault is assigned a number. We carefully designed the numbering scheme to allow for a lot of vaults to be deployed, and designed the management software to handle scaling up to that level in the Backblaze data centers.
Each vault is given a 7-digit number that looks like a phone number, such as: 555-1001. The first three digits specify the data center number, and the last four specify the vault number within that data center.
The overall design scales very well because file uploads (and downloads) go straight to a vault, without having to go through a central point that could become a bottleneck.
There is an authority server that assigns incoming files to specific Vaults. Once that assignment has been made, the client then uploads data directly to the Vault. As the data center scales out and adds more Vaults, the capacity to handle incoming traffic keeps going up. This is horizontal scaling at its best.
We could deploy a new data center with 10,000 Vaults, and it could accept uploads fast enough to reach its full capacity of 90 exabytes in just over a month!
Backblaze Vault Benefits
The Backblaze Vault architecture has 6 benefits:
- Extremely Durable
- Infinitely Scalable
- Always Available
- Highly Performant
- Operationally Easier
- Astoundingly Cost Efficient
The Vault architecture is designed for 99.999999% annual durability. At cloud-scale, you have to assume hard drives die on a regular basis, and we replace about 10 drives every day. We have published a variety of articles sharing our hard drive failure rates.
The beauty with Vaults is that not only does the software protect against hard drive failures, it also protects against the loss of entire storage pods or even entire racks. A single Vault can have 3 storage pods – a full 135 hard drives – die at the exact same moment without a single byte of data being lost or even becoming unavailable.
A Backblaze Vault is comprised of 20 storage pods, each with 45 disk drives, for a total of 900 drives. Depending on the size of the hard drive, each vault will hold:
4TB hard drives => 3.6 petabytes/vault (Deploying today.)
6TB hard drives => 5.4 petabytes/vault (Currently testing.)
8TB hard drives => 7.2 petabytes/vault (Small-scale testing.)
10TB hard drives => 9.0 petabytes/vault (Announced by WD & Seagate.)
At our current growth rate, Backblaze deploys a little over one Vault each month. As the growth rate increases, the deployment rate will also increase. We can incrementally add more storage by adding more and more Vaults. Without changing a line of code, the current implementation supports deploying 10,000 Vaults per location. That’s 90 exabytes of data in each location. The implementation also supports up to 1,000 locations, which enables storing a total of 90 zettabytes! (Also known as 90,000,000,000,000 GB.)
Data backups have always been highly available: if a storage pod was in maintenance, the Backblaze online backup application would contact another storage pod to store data. Previously, however, if a storage pod was unavailable, some restores would pause. For large restores this was not an issue since the software would simply skip the storage pod that was unavailable, prepare the rest of the restore, and come back later. However, for individual file restores and remote access via the Backblaze iPhone and Android apps, it became increasingly important to have all data be highly available at all times.
The Backblaze Vault architecture enables both data backups and restores to be highly available.
With the Vault arrangement of 17 data shards plus three parity shards for each file, all of the data is available as long as 17 of the 20 Storage Pods in the Vault are available. This keeps the data available while allowing for normal maintenance, and rare expected failures.
The original Backblaze storage pods could individually accept 950 Mbps (megabits per second) of data for storage.
The new Vault pods have more overhead, because they must break each file into pieces, distribute the pieces across the local network to the other storage pods in the vault, and then write them to disk. In spite of this extra overhead, the Vault is able to achieve 1000 Mbps of data arriving at each of the 20 pods.
This does require a new type of Storage Pod, and we’ll be sharing the design of the new pod soon. The net of this: a single Vault can accept a whopping 20 Gbps of data.
Because there is no central bottleneck, adding more Vaults linearly adds more bandwidth.
When Backblaze launched in 2008 with a single Storage Pod, many of the operational analyses (e.g. how to balance load) could be done on a simple spreadsheet and manual tasks (e.g. swapping a hard drive) could be done by a single person. As Backblaze grew to nearly 1000 storage pods and over 40,000 hard drives, the systems we developed to streamline and operationalize the cloud storage became more and more advanced. However, because our system relied on Linux RAID, there were certain things we simply could not control.
With the new Vault software, we have direct access to all of the drives, and can monitor their individual performance, and any indications of upcoming failure. And, when those indications say that maintenance is needed, we can shut down one of the pods in the Vault without interrupting any service.
Even with all of these wonderful benefits that Backblaze Vaults provide, if they raised costs significantly, it would be nearly impossible for us to deploy them since we are committed to keeping our online backup service just $5 per month for completely unlimited data. However, the Vault architecture is nearly cost neutral while providing all these benefits.
When we were running on Linux RAID, we used RAID6 over 15 drives: 13 data drives plus 2 parity. That’s 15.4% storage overhead for parity.
With Backblaze Vaults, we wanted to be able to do maintenance on one pod in a vault and still have it be fully available, both for reading and writing. And we weren’t willing to have fewer than 2 parity shards for every file uploaded, for safety. Using 17 data plus 3 parity drives raises the storage overhead just a little bit, to 17.6%, but still gives us two parity drives even in the infrequent times when one of the pods is in maintenance. In the normal case when all 20 pods in the Vault are running, we have 3 parity drives, which adds even more reliability.
What Does This Mean For Backblaze Cloud Backup Users?
Any Backblaze customer who is using Backblaze Online Backup 3.0 or higher is able to use the Backblaze Vaults. (Read the knowledge base article to check what version you’re running.) This will happen automatically, there is nothing to configure or change. Over time, Backblaze will migrate all customer data from the existing Storage Pod architecture to the Vault Architecture.
Backblaze’s cloud storage Vaults deliver 99.999999% annual durability, horizontal scalability, and 20 Gbps of per-Vault performance, while being operationally efficient and extremely cost effective. Driven from the same mindset that we brought to the storage market with Backblaze Storage Pods, Backblaze Vaults continue our singular focus of building the most cost-efficient cloud storage around.
[4/5/2016 – Updated annual durability to 99.999999% to reflect current operations – Ed.]