Enterprise Drives: Fact or Fiction?

By | December 4th, 2013


Last month I dug into drive failure rates based on the 25,000+ consumer drives we have and found that consumer drives actually performed quite well. Over 100,000 people read that blog post and one of the most common questions asked was:

“Ok, so the consumer drives don’t fail that often. But aren’t enterprise drives so much more reliable that they would be worth the extra cost?”

Well, I decided to try to find out.

In the Beginning
As many of you know, when Backblaze first started the unlimited online backup service, our founders bootstrapped the company without funding. In this environment one of our first and most critical design decisions was to build our backup software on the premise of data redundancy. That design decision allowed us to use consumer drives instead of enterprise drives in our early Storage Pods as we used the software, not the hardware, to manage redundancy. Given that enterprise drives were often twice the cost of consumer drives, the choice of consumer drives was also a relief for our founders’ thin wallets.

There were warnings back then that using consumer drives would be dangerous with, people saying:

    “Consumer drives won’t survive in the hostile environment of the data center.”
    “Backblaze Storage Pods allow too much vibration – consumer drives won’t survive.”
    “Consumer drives will drop dead in a year. Or two years. Or …”

As we have seen, consumer drives didn’t die in droves, but what about enterprise ones?

Failure Rates
In my post last month on disk drive life expectancy, I went over what an annual failure rate means. It’s the average number of failures you can expect when you run one disk drive for a year. The computation is simple:

Annual Failure Rate = (Number of Drives that Failed / Number of Drive-Years)

Drive-years a measure of how many drives have been running for how long. This computation is also simple:

Drive-Years = (Number of Drives x Number of Years)

For example, one drive for one year is one drive-year. Twelve drives for one month is also one drive-year.

Backblaze Storage Pods: Consumer-Class Drives
We have detailed day-by-day data about the drives in the Backblaze Storage Pods since mid-April of 2013. With 25,000 drives ranging in age from brand-new to over 4 years old, that’s enough data to slice the data in different ways and still get accurate failure rates. Next month, I’ll be going into some of those details, but for the comparison with enterprise drives, we’ll just look at the overall failure rates.

We have data that tracks every drive by serial number, which days it was running, and if/when it was replaced because it failed. We have logged:

    14719 drive-years on the consumer-grade drives in our Storage Pods.
    613 drives that failed and were replaced.

Commercially Available Servers: Enterprise-Class Drives
We store customer data on Backblaze Storage Pods which are purpose-built to store data very densely and cost-efficiently. However, we use commercially available servers for our central servers that store transactional data such as sales records and administrative activities. These servers provide the flexibility and throughput needed for such tasks. These commercially available servers come from Dell and from EMC.

All of these systems were delivered to us with enterprise-class hard drives. These drives were touted as solid long-lasting drives with extended warranties.

The specific systems we have are:

  • Six shelves of enterprise-class drives in Dell PowerVault storage systems.
  • One EMC storage system with 124 enterprise drives that we just brought up this summer. One of the drives has already failed and been replaced.
  • We have also been running one Backblaze Storage Pod full of enterprise drives storing users’ backed-up files as an experiment to see how they do. So far, their failure rate, has been statistically consistent with drives in the commercial storage systems.

    In the two years since we started using these enterprise-grade storage systems, they have logged:

      368 drive-years on the enterprise-grade drives.
      17 drives that failed and were replaced.

    Enterprise vs. Consumer Drives
    At first glance, it seems the enterprise drives don’t have that many failures. While true, the failure rate of enterprise drives is actually higher than that of the consumer drives!

    Enterprise Drives Consumer Drives
    Drive-Years of Service 368 14719
    Number of Failures 17 613
    Annual Failure Rate 4.6% 4.2%

    It turns out that the consumer drive failure rate does go up after three years, but all three of the first three years are pretty good. We have no data on enterprise drives older than two years, so we don’t know if they will also have an increase in failure rate. It could be that the vaunted reliability of enterprise drives kicks in after two years, but because we haven’t seen any of that reliability in the first two years, I’m skeptical.

    You might object to these numbers because the usage of the drives is different. The enterprise drives are used heavily. The consumer drives are in continual use storing users’ updated files and they are up and running all the time, but the usage is lighter. On the other hand, the enterprise drives we have are coddled in well-ventilated low-vibration enclosures, while the consumer drives are in Backblaze Storage Pods, which do have a fair amount of vibration. In fact, the most recent design change to the pod was to reduce vibration.

    Overall, I argue that the enterprise drives we have are treated as well as the consumer drives. And the enterprise drives are failing more.

    So, Are Enterprise Drives Worth The Cost?
    From a pure reliability perspective, the data we have says the answer is clear: No.

    Enterprise drives do have one advantage: longer warranties. That’s a benefit only if the higher price you pay for the longer warranty is less than what you expect to spend on replacing the drive.

    This leads to an obvious conclusion: If you’re OK with buying the replacements yourself after the warranty is up, then buy the cheaper consumer drives.

    Brian Beach

    Brian Beach

    Brian has been writing software for three decades at HP Labs, Silicon Graphics, Netscape, TiVo, and now Backblaze. His passion is building things that make life better, like the TiVo DVR and Backblaze Online Backup.
    Category:  Backblaze Bits · TechBytes
    • Hexaglow

      Are there physical properties within each pod that are micro-monitored?
      There could be potentially be very valuable data here which would help discover factors effecting drive health. Ideally each individual drive should multiple sensors around it for vibration and heat.

      There could be different amount of vibration in different areas of the pod or for individual hard drives. Drives with higher vibration are likely to fail earlier but does a drive with higher vibration effect drives immediate around it? If so, having a reasonably long time record of individual drive vibration would provide detailed statistics of where in the pod drives are failing. This could even enable you to calculate an acceptable amount of vibration and introduce an upper limit, over which you could reject new drives which have vibration levels which may be effecting the life span of other drives. This could save you money in the long term.

      A thermal sensor for individual drives may enable you to map out hot spots in your pods airflow and correct it.

      Since the pod idea is a new one I’m guessing there are design characteristics to be discovered and improved on so there probably should be individual drive monitoring.

    • Brian Johnson

      Not only drive hardware, I even run an old server with an updated maintenance (https://www.spectra.com/support-maintenance/) guide which doesnt give my any issues at all.

    • Timo Witte

      IMHO the difference between “Enterprise” and Consumer drives ist just the firmware in many cases.
      Enterprise RAID oriented drives don´t do as much retries as consumer ones.. Furthermore they don´t spin down to save power and so on..
      If you look at the replacement drives from WD from example, they come with a white sticker which looks custom printed.. So i guess the hardware of the drive is the same, if they need to send a replacement, they just flash the appropriate firmware, print out the label and send it to you!
      If you look in tools like PC-3000 you typically select drives by hardware “platform” i guess this is a pretty good overview on which drives are essentially identical hardware vise..

      • IanMak

        I think you are right in the case of Western Digital Blacks vs Western Digital RE drives. I owned both.

        However Seagate Enterprise/Constellation drives are NOT Barracudas. Oh I remember those Barracudas. I claimed warranty on the 7200.8 like 5 times within 1 year. Sure they warranty it and send you infinite refurbished ones but who wants to deal with 5 drive replacements in 1 year? I just threw it out and bought myself a set of WD RE2. My friends had problems with Barracuda 7200.11 and 7200.12. Barracudas were so bad they discontinued the name. Their enterprise drives have a completely different enclosure and circuit board design.

        Im currently running 4x Seagate Enterprise drives in RAID 0 for max performance. Pretty neat drives. They are fast and don’t fail like other Seagate drives.

    • Question: When did the drives fail? For instance, it might be that a large portion of drive failures occur in the first week or month, then drop off significantly after that.

    • Andrew

      I don’t see a mention of heat. Enterprise drives are expected to be deployed in tight stacks. Consumer class dries expect a lot of air around them. I had a desktop with 5 disks and it was very hard to stop the disks from overheating. Even a little dust was enough to start causing problems.

      • ZeDestructor

        In actual practice, the consumer drives end up in a shitty, passively-cooled plastic chassis, in all manner of fine hot and humid tropical weather while the enterprise drives get cuddled in chassis with lots of airflow from very loud fans in air-conditioned, low-humidity rooms…

    • tymwltl

      I don’t understand much of anything you folks are talking about, but it’s for sure a lot of fun to read it all and pretend I someday might. Thanks.

    • Logan

      I am glad this review has confirmed my overall theory about the hdd at consumer level vs the enterprise level. Thank you

    • SteveC

      Run a RAID configuration with a BackBlaze.com account as a secondary off-site backup and viola. You can buy the cheapest drives you can find and never ever worry. This is what I do!

    • ColoradoMatt

      A few thoughts on your experimental methods here.

      Your concept of drive-year is flawed in an experimental sense. For example, if I were to buy 50 Chevy’s and drive them for 1 year and then buy 1 Ford and drive it for 50 years, are those equivalent machine reliability tests? The obvious answer is “no”. No one expects a vehicle to continuously drive for 50 years but it is a reasonable expectation for 50 brand new cars to still be driving one year from today. In fact, if one of the 50 Chevy’s were to die, we would consider that just as much of a fluke as if the Ford were still driving 50 years from now (regardless of your auto loyalties!)

      In your experiment here, your have a maturation threat because your subjects will change over time. In fact, that’s the very thing you are trying to measure: since most hard drive failures are physical failures, you are looking for physical changes over time. If you collapse twelve, 30 days periods into the same unit as one, 365 day period, you are comparing apples with Fords… different brand AND a different unit of measure.

      A better design would be to track individual drives and hours of operation. Since you have serial numbers for each you should easily be able to track how long each has been in service. I would have to think this through a bit more but it seems like you could do a logistic regression using time of service as the independent variable and live / die as the dependent variable. Seems like you would need to use group as an interaction variable to compare them… like I said, would have to give that some thought. In any case though, that would give you a statistically valid measure of the impact of hours of service on failure rates, which your current number does not. At any given time, regardless of when you started using each drive, you could run the numbers and do a comparison.

      In addition, it appears you include replaced drives in your “drive-year” calculations which further pollutes your numbers by introducing a history threat. This is because not all of your drives experience the same events (i.e. temperature fluctuations etc.). In the above proposed analysis, not including replaced drives would introduce a mortality threat (i.e. a bad batch of drives might seriously skew your numbers over time) so I would stick with the history threat since you have controlled environments. If you use the proposed analysis, you could potentially uncover experience changes… whereas your “drive-year” calculation simply masks them.

      As with any experimental design, you can’t remove all threats to validity, that’s fine, but you have to choose a method and identify the threats and explain why you are OK with them. You do that above when you point out that your drives are used differently (this is selection threat… group membership matters.) Nothing you can do about that one.

      I understand that from a business perspective, you are just crunching numbers and hey… it looks like these cheap drives last longer. But understand that you don’t know whether that 4.6% is statistically significantly different than that 4.2% in the first place… and if it is, you don’t know why. It may well be that a significant portion of your consumer drives are still in their infancy while most of your enterprise drives are chugging away on “year 39 of driving”… but your methodology has lost all of that information by inappropriately collapsing it into a single metric.

      Just some thoughts that are hopefully helpful.

      • Kevin Samuel Coleman

        The Margin of Error is dramatically higher on the enterprise drive dataset compared to the consumer drive, so this is a bad comparison, agreed.

      • Mekronid

        If they wanted to keep it simple, they could just tack the reliability metrics to I/O ops. Probably still not very accurate but surely a more useful metric than drive-years. Reading this article only left me with a deep impression of hand-waving rather than having said anything useful.

    • facepalmfrank

      Actually premium desktop drives have longer warranties than most server solutions.

    • Tom from North Carolina

      I have HP Proliant servers DL360 Gen8 and Dell PowerEdge R610 servers. All of them use consumer grade SSD drives and I have been using them since January, 2011. The Dell servers are a more recent purchase and quite frankly, are very disappointing. The PERC 6/i controller rejects regular hard disks unless they are Dell certified and performs awfully with SSDs. I recognize that Gen8 HP servers are newer than the X5650’s in the Dell, but as you will see from my tests below, even older HP servers using older SSDs, outperform the Dell RAID controller.

      Here are the stats:
      Dell R610 running four Crucial MX100 512GB disks in a RAID 5 averages 181 MB/sec.
      HP Proliant DL360 (Gen8), running four Crucial MX100 512GB in RAID 5 averages 1045 MB/sec.

      That’s right, the HP RAID controller is running more than 5 times faster with the same brand and model of SSD. Now, here’s the killer that really shows how poorly the PERC 6/i controller performs when using SSDs.

      My older HP Proliant servers using the same processor as the Dells, with older Intel 160GB X25-M SSDs which are slower (SATA II), still kicks the a$$ on the Dell Perc controllers. This configuration using X5650 processors and SATA II Intel SSDs, delivers an average of 447 MB/sec of throughput. *Note: Drive performance was calculated using HD Tune 2.55.

      I was so disappointed in the performance of the Dell Perc 6/i that I ordered a new controller from LSI designed to work with SSDs.

      • ZeDestructor

        You do know RAID controllers also evolve with time.

        Oh, btw, PERC controllers have been LSI-based since the PERC6 (LSI1078 chipset).

        Here’s a much more complete listing of what chipset is on which controller: https://forums.servethehome.com/index.php?threads/lsi-raid-controller-and-hba-complete-listing-plus-oem-models.599/

        • D. Johnson

          “my 2014 SAS2 raid controller outperforms my 2007 SAS1 raid controller with half or a quarter of the ram when connected to storage devices that didn’t even exist when the 2007 controller came out”

          film at 11

      • Timo Witte

        Just use cheap HBAs without battery backup and RAID functionality and do a software raid! That way you can easily recover from dead controllers, as you don´t have to buy the same one again / don´t waste money on expensive RAID controllers..

        • ZeDestructor

          That’s basically what backblaze and the ZFS, distributed FS (Ceph, Lustre) and SAN cabals do for the most part. It’s just so much nicer to deal with.

    • John Robert

      Thanks for your post,…Very good information

    • Stratila Dimitrie

      It looks like Backblaze personnel limit themselves to „enterprise” name rather than specifying a list of all tested enterprise-class hard drives! For example: WD has its Re, Xe, Se and Ae series of server HDDs, Seagate has its Constellation, Cheetach, and other series, HGST, has its Ultrastar and Megascale series, and I just don’t know what series of enterprise-class HDDs Toshiba has.

    • Julian

      I’ve come to the conclusion that hard drive warranties are basically worthless, because they don’t send you a new replacement, they send you a refurb. Are you going to trust refurb drives in your storage pods?

      • M4ssacre

        Actually it doesn’t really matter, because they provide you drives until the warranty is up. And that is what you pay the extra amount for.

        • Julian

          Yeah, but why would you use a refurbished drive?

          • SteveC

            If you are using it in a Raid configuration as these servers are who really cares? Also anyone that is concerned with data should be running raid and/or have an off-site backup plan like Backblaze. I have both. For local failures I swap a drive out with another one and fire it back up and it continues where it left off. Where there is more catastrophic failure from lightening etc. I have my Back Blaze Backup. I’m happy to say that since I have been storing stuff in this manner I have had multiple failures and not lost one single piece of data. I go with the cheapest drives I can find because it doesn’t matter with this strategy!

          • Mudder Fukker

            In my experience, Refurbs are more reliable than new drives. Of course, my ‘data set’ is prolly less than 1,000 drives (20 years as tech). My guess is that they have had the one or two things that went wrong replaced, and had a good second testing. I have some refurbs around here going on 8+ years.

    • lbr

      Please provide example of HDD which “keep retrying a failed I/O indefinitely”.

      The problem is, that not all controllers(RAID/SATA/IDE) will wait long enough for HDD after HDD failed in some way.
      Meaning that if UNC occurs some controllers will “drop” HDD as dead. Some will wait. Some have configurable time-out.
      However, in any case UNC in some(most?) cases means that you need to replace HDD.
      So.. u have either degraded RAID or alive RAID with S.M.A.R.T.(195) on one of the HDDs increased.
      If that was really unrecoverable block, then in case of RAID0 u have a problem either way – data is lost. No “self healing” can possibly occur at this stage. In case if it was not damaged.. maybe corrected with ECC(?) – then yes, maybe offline scan will mark it(sector) as “good”.

      So, long story short, you are talking not on “enterprise vs consumer” topic, but on general compatibility.

      Not to mention the fact that ” If this drive is in a laptop, that’s probably the best choice” – have u ever seen pc reading a CD with a bad block? Laptop with indefinity retry will behave similary.. no good at all. And “You’d far rather just have the failed I/O fail hard and fast”… right… imagine ECC worked that way.. drives would fail all the time(point is – statement logic does not work or statement is not specific enough).

      I smell W.D. and their RE1 ad ; )

      • Alec Weder

        The biggest issue with RAID are the unrecoverable read

        If you loose the drive, the RAID has to read 100% of the
        remaining drives even if there is no data on portions of the drive. If you get
        an error on rebuild, the entire array will die.


        A UER on SATA of 1 in 10^14 bits read means a read failure every
        12.5 terabytes. A 500 GB drive has 0.04E14 bits, so in the worst case rebuilding
        that drive in a five-drive RAID-5 group means transferring 0.20E14 bits. This means
        there is a 20% probability of an unrecoverable error during the rebuild. Enterprise class disks are less prone to this problem:


        • lbr

          What u said is absolutely true. However somewhat not applicable to modern HDDs – capacities increased much more than durability. As far as I understand that’s the main reason for f.e. Dell not recommending raid5 at all.

          Also, RAID5 dying on rebuild(UNC while reading 100% of the data) not neccessarily means loosing all of the data.

      • D. Johnson

        Please provide an example of a RAID controller that “if UNC occurs [it] will drop HDD as dead” when configured as a RAID0 array [hint: this does not happen].

        An enterprise raid controller will however drop a malfunctioning drive in a redundant array earlier rather than later because the design is not primarily to prevent drive replacement, but to ensure data integrity, even above cost. Though they may not fit as nicely into your prosumer usage model, they are deliberate design choices based on product usage.

        • lbr

          ‘some controllers will “drop”‘
          I think I’ve seen it on Intel ICH(7?). Controller declaring HDD as dead on boot if UNC on it occured.
          [hint: this should not happen not neccessarily means that it won’t happen]

          Also imo it should happen on early WD GP drives, which had issues with certain RAID controllers not waiting for them to wake up. It deffo happened for me in RAID5/10 configurations.

          “An enterprise raid controller will however drop a malfunctioning drive in a redundant array earlier rather than later..”
          d2607(at least my two still in production env.) based on lsi sas2008 won’t drop anything on UNC event(malfunctioning).

          Anyway, my point was, that compatibility(and/or configuration or lack of it) issues and consumer vs enterprise drives issues are not the same thing.

    • Garrett D’Amore

      So, there is another serious concern. That has to do with firmware handling of errors. Many Consumer grade firmwares just keep retrying a failed I/O indefinitely. If this drive is in a laptop, that’s probably the best choice.

      But if you’re working in a big enterprise system with arrays and multiple levels of redundancy, this is actually tragically bad. You’d far rather just have the failed I/O fail hard and fast, so that all that redundancy you built can go off and deal with it — e.g. ZFS can do self healing.

      So, don’t choose based on *quality*, but *do* choose based upon *application. If your application is a single drive system, or you don’t care if your I/Os pend forever (and can tolerate long hits to latency waiting for upper layers of the stack to time out), then by all means go for consumer grade drives. But if you’re in a datacenter, and the failure mode handling is important to you (minimum impact to your operations when the drive fails, and you have mirroring, etc.), then go for the enterprise drives.

      Its a shame that vendors differentiate on price, since I think the internals are largely the same, but that’s market economics for you.