Enterprise Drives: Fact or Fiction?

December 4th, 2013


Last month I dug into drive failure rates based on the 25,000+ consumer drives we have and found that consumer drives actually performed quite well. Over 100,000 people read that blog post and one of the most common questions asked was:

“Ok, so the consumer drives don’t fail that often. But aren’t enterprise drives so much more reliable that they would be worth the extra cost?”

Well, I decided to try to find out.

In the Beginning
As many of you know, when Backblaze first started the unlimited online backup service, our founders bootstrapped the company without funding. In this environment one of our first and most critical design decisions was to build our backup software on the premise of data redundancy. That design decision allowed us to use consumer drives instead of enterprise drives in our early Storage Pods as we used the software, not the hardware, to manage redundancy. Given that enterprise drives were often twice the cost of consumer drives, the choice of consumer drives was also a relief for our founders’ thin wallets.

There were warnings back then that using consumer drives would be dangerous with, people saying:

    “Consumer drives won’t survive in the hostile environment of the data center.”
    “Backblaze Storage Pods allow too much vibration – consumer drives won’t survive.”
    “Consumer drives will drop dead in a year. Or two years. Or …”

As we have seen, consumer drives didn’t die in droves, but what about enterprise ones?

Failure Rates
In my post last month on disk drive life expectancy, I went over what an annual failure rate means. It’s the average number of failures you can expect when you run one disk drive for a year. The computation is simple:

Annual Failure Rate = (Number of Drives that Failed / Number of Drive-Years)

Drive-years a measure of how many drives have been running for how long. This computation is also simple:

Drive-Years = (Number of Drives x Number of Years)

For example, one drive for one year is one drive-year. Twelve drives for one month is also one drive-year.

Backblaze Storage Pods: Consumer-Class Drives
We have detailed day-by-day data about the drives in the Backblaze Storage Pods since mid-April of 2013. With 25,000 drives ranging in age from brand-new to over 4 years old, that’s enough data to slice the data in different ways and still get accurate failure rates. Next month, I’ll be going into some of those details, but for the comparison with enterprise drives, we’ll just look at the overall failure rates.

We have data that tracks every drive by serial number, which days it was running, and if/when it was replaced because it failed. We have logged:

    14719 drive-years on the consumer-grade drives in our Storage Pods.
    613 drives that failed and were replaced.

Commercially Available Servers: Enterprise-Class Drives
We store customer data on Backblaze Storage Pods which are purpose-built to store data very densely and cost-efficiently. However, we use commercially available servers for our central servers that store transactional data such as sales records and administrative activities. These servers provide the flexibility and throughput needed for such tasks. These commercially available servers come from Dell and from EMC.

All of these systems were delivered to us with enterprise-class hard drives. These drives were touted as solid long-lasting drives with extended warranties.

The specific systems we have are:

  • Six shelves of enterprise-class drives in Dell PowerVault storage systems.
  • One EMC storage system with 124 enterprise drives that we just brought up this summer. One of the drives has already failed and been replaced.
  • We have also been running one Backblaze Storage Pod full of enterprise drives storing users’ backed-up files as an experiment to see how they do. So far, their failure rate, has been statistically consistent with drives in the commercial storage systems.

    In the two years since we started using these enterprise-grade storage systems, they have logged:

      368 drive-years on the enterprise-grade drives.
      17 drives that failed and were replaced.

    Enterprise vs. Consumer Drives
    At first glance, it seems the enterprise drives don’t have that many failures. While true, the failure rate of enterprise drives is actually higher than that of the consumer drives!

    Enterprise Drives Consumer Drives
    Drive-Years of Service 368 14719
    Number of Failures 17 613
    Annual Failure Rate 4.6% 4.2%

    It turns out that the consumer drive failure rate does go up after three years, but all three of the first three years are pretty good. We have no data on enterprise drives older than two years, so we don’t know if they will also have an increase in failure rate. It could be that the vaunted reliability of enterprise drives kicks in after two years, but because we haven’t seen any of that reliability in the first two years, I’m skeptical.

    You might object to these numbers because the usage of the drives is different. The enterprise drives are used heavily. The consumer drives are in continual use storing users’ updated files and they are up and running all the time, but the usage is lighter. On the other hand, the enterprise drives we have are coddled in well-ventilated low-vibration enclosures, while the consumer drives are in Backblaze Storage Pods, which do have a fair amount of vibration. In fact, the most recent design change to the pod was to reduce vibration.

    Overall, I argue that the enterprise drives we have are treated as well as the consumer drives. And the enterprise drives are failing more.

    So, Are Enterprise Drives Worth The Cost?
    From a pure reliability perspective, the data we have says the answer is clear: No.

    Enterprise drives do have one advantage: longer warranties. That’s a benefit only if the higher price you pay for the longer warranty is less than what you expect to spend on replacing the drive.

    This leads to an obvious conclusion: If you’re OK with buying the replacements yourself after the warranty is up, then buy the cheaper consumer drives.

    Brian Beach

    Brian Beach

    Brian has been writing software for three decades at HP Labs, Silicon Graphics, Netscape, TiVo, and now Backblaze. His passion is building things that make life better, like the TiVo DVR and Backblaze Online Backup.
    • Garrett D’Amore

      So, there is another serious concern. That has to do with firmware handling of errors. Many Consumer grade firmwares just keep retrying a failed I/O indefinitely. If this drive is in a laptop, that’s probably the best choice.

      But if you’re working in a big enterprise system with arrays and multiple levels of redundancy, this is actually tragically bad. You’d far rather just have the failed I/O fail hard and fast, so that all that redundancy you built can go off and deal with it — e.g. ZFS can do self healing.

      So, don’t choose based on *quality*, but *do* choose based upon *application. If your application is a single drive system, or you don’t care if your I/Os pend forever (and can tolerate long hits to latency waiting for upper layers of the stack to time out), then by all means go for consumer grade drives. But if you’re in a datacenter, and the failure mode handling is important to you (minimum impact to your operations when the drive fails, and you have mirroring, etc.), then go for the enterprise drives.

      Its a shame that vendors differentiate on price, since I think the internals are largely the same, but that’s market economics for you.

    • lbr

      Please provide example of HDD which “keep retrying a failed I/O indefinitely”.

      The problem is, that not all controllers(RAID/SATA/IDE) will wait long enough for HDD after HDD failed in some way.
      Meaning that if UNC occurs some controllers will “drop” HDD as dead. Some will wait. Some have configurable time-out.
      However, in any case UNC in some(most?) cases means that you need to replace HDD.
      So.. u have either degraded RAID or alive RAID with S.M.A.R.T.(195) on one of the HDDs increased.
      If that was really unrecoverable block, then in case of RAID0 u have a problem either way – data is lost. No “self healing” can possibly occur at this stage. In case if it was not damaged.. maybe corrected with ECC(?) – then yes, maybe offline scan will mark it(sector) as “good”.

      So, long story short, you are talking not on “enterprise vs consumer” topic, but on general compatibility.

      Not to mention the fact that ” If this drive is in a laptop, that’s probably the best choice” – have u ever seen pc reading a CD with a bad block? Laptop with indefinity retry will behave similary.. no good at all. And “You’d far rather just have the failed I/O fail hard and fast”… right… imagine ECC worked that way.. drives would fail all the time(point is – statement logic does not work or statement is not specific enough).

      I smell W.D. and their RE1 ad ; )

    • Julian

      I’ve come to the conclusion that hard drive warranties are basically worthless, because they don’t send you a new replacement, they send you a refurb. Are you going to trust refurb drives in your storage pods?