Reliability Data Set For 41,000 Hard Drives Now Open Source

By | February 4th, 2015

blog-stats-data

Stats Geeks: Now it’s Your Turn.

Backblaze Online Backup has released the raw data collected from the more than 41,000 disk drives in our data center. To the best of our knowledge this is the largest data set on disk drive performance ever to be made available publicly.

Over the past 16 months, I have been posting information about hard drive reliability based on the raw data that we collect in the Backblaze data center. I have crunching those numbers to correlate drive failures with drive model numbers, with SMART statistics, and other variables.

There are lots of smart people out there who like working with data, and you may be one of them. Now it’s your turn to pore over the data and find hidden treasures of insight. All we ask is that if you find something interesting, that you post it publicly for the benefit of the computing community as a whole.

What’s In The Data?

The data that we have released is in two files, one containing the 2013 data and one containing the 2014 data. We’ll add data for 2015 and so on in a similar fashion.

Every day, the software that runs the Backblaze data center takes a snapshot of the state of every drive in the data center, including the drive’s serial number, model number, and all of its SMART data. The SMART data includes the number of hours the drive has been running, the temperature of the drive, whether sectors have gone bad, and many more things. (I did a blog post correlating SMART data with drive failures a few months ago.)

Each day, all of the drive “snapshots” are processed and written to a new daily stats file. Each daily stats file has one row for every drive operational in the data center that day. For example, there are 365 daily stats files in the 2014 data package with each file containing a “snapshot” for each drive operational on any given day.

What Does It Look Like?

Each daily stats file is in CSV (comma-separated value) format. The first line lists the names of the columns, and then each following line has all of the values for those columns. Here are the columns:

    Date – The date of the file in yyyy-mm-dd format.
    Serial Number – The manufacturer-assigned serial number of the drive.
    Model – The manufacturer-assigned model number of the drive.
    Capacity – The drive capacity in bytes.
    Failure – Contains a “0” if the drive is OK. Contains a “1” if this is the last day the drive was operational before failing.
    SMART Stats – 80 columns of data, that are the Raw and Normalized values for 40 different SMART stats as reported by the given drive. Each value is the number reported by the drive.

The Wikipedia page on SMART (https://en.wikipedia.org/wiki/S.M.A.R.T.) has a good description of all of the data, and what the raw and scaled values are. The short version is that the raw value is the data directly from the drive. For example, the Power On Hours attribute reports the number of hours in the raw value. The normalized value is designed to tell you when the drive is OK. It starts at 100 and goes down to 0 as the drive gets sick. (Some drives count down from 200.)

How To Compute Failure Rates

One of my statistics professors once said, “it’s all about counting.” And that’s certainly true in this case.

A failure rate says what fraction of drives have failed over a given time span. Let’s start by calculating a daily failure rate, which will tell us what fraction of drives fail each day. We’ll start by counting “drive days” and “failures”.

To count drive days, we’ll take a look every day and see how many drives are running. Here’s a week in the life of a (small) data center:

blog_datacenter_dots_1

Each of the blue dots represents a drive running on a given day. On Sunday and Monday, there are 15 drives running. Then one goes away, and from Tuesday through Saturday there are 14 drives each day. Adding them up we get 15 + 15 + 14 + 14 + 14 + 14 + 14 = 100. That’s 100 drives days.

Now, let’s look at drive failures. One drive failed on Monday and was not replaced. Then one died on Wednesday and was promptly replaced. The red dots indicate the drive failures:

blog_datacenter_dots_2

So we have 2 drive failures in 100 drive days of operation. To get the daily failure rate, you simply divide. 2 divided by 100 is 0.02, or 2%. The daily failure rate is 2%.

The annual failure rate is the daily failure rate multiplied by 365. If we had a full year made of weeks like the one above, the annual failure rate would be 730%.

Annual failures rates can be higher than 100%. Let’s think this through. Say we keep 100 drives running in our data center at all times, replacing drives immediately when they fail. At a daily failure rate of 2%, that means 2 drives fail each day, and after a year 730 drives will have died. We can have an annual failure rate above 100% if drives last less than a year on average.

Computing failure rates from the data that Backblaze has released is a matter of counting drive days and counting failures. Each row in each daily drive stats file is one drive day. Each failure is marked with a “1” in the failure column. Once a drive has failed, it is removed from subsequent daily drive stats files.

To get the daily failure rate of drives in the Backblaze data center, you can take the number of failures counted in a given group of daily stats files, and divide by the number of rows in the same group of daily stats files. That’s it!

Where is the Data?

You’ll find links to download the data files at https://www.backblaze.com/hard-drive-test-data.html. You’ll also find instructions on how to create your own sqlite database for the data, and other information related to the files you can download.

Let Us Know What You Find

That’s about all you need to know about the data to get started. If you work with the data and find something interesting, let us know!

Brian Beach

Brian Beach

Brian has been writing software for three decades at HP Labs, Silicon Graphics, Netscape, TiVo, and now Backblaze. His passion is building things that make life better, like the TiVo DVR and Backblaze Online Backup.
Category:  Cloud Storage
  • Scott

    Not sure if there is a better venue to ask a question about the nature of the data, but I’m looking at the 2014 dataset and I noticed that some serial numbers show up at the end of the first day in service as having accumulated a large number of drive hours (using smart_9_raw variable – I’m assuming this is the number of hours the drive was operational and in service but not certain as I see POH on the Wikipedia page and I can’t confirm if these are referring to the same column). I don’t understand how this could happen if a particular serial number showing up for the first time in the dataset implies that it was the first day the device was put into service. Repaired device from the previous year perhaps?

    A second question – What is going on with devices that have been in service for a number of days and then all of a sudden disappear from the dataset without a failed indicator ever being tripped? Are some devices being removed from service before they fail? If so is this random or are they being removed as somebody notices they are about to fail? The latter could greatly bias any reliability analysis.

    Example:
    Serial number Z3015LM8 encompasses both of these situations. It shows up on Feb 19, 2014 as having a smart_9_raw value of 583 (again, I’m assuming this reflects the number of hours the drive was operational and in service), and can be tracked all the way through the last day it appears in the data on Sep 17, 2014 where the variable failure has a value of 0 (i.e. it didn’t fail). How does this device accumulate 583 hours of service at the end of the first day it’s turned on? Also, why is the device no longer in the dataset if it didn’t fail?

    Thanks!

  • MontyW

    Brian, what model of HDD would you yourself buy for back-up purposes? ($64,000 question!)

  • m4dsk

    Brian, could you please tell us whether most disks labeled as 1 are actual failures or proactive replacements done by the admins? Would there be any chance to distinguish between these 2?

  • Michael T

    Has anyone offered some guidelines (and/or the scripts) to collect this data ourselves?
    Sure I can write my own flavor of it but I’d really rather make my data comparable and scaled the same as the Backblaze data collection process. Regardless of how good/bad the means BB has been using, I think it would be handy if we all did it the same way to consolidate data. I’m spinning a lot of disks and I’d love to provide some of this data to the outside world.

  • jack phelm

    What are the attribute 15 and 255 for, can’t find any reference about them on the web.
    Any help would be appreciated

  • hartfordfive

    For anyone that might be interested, I’ve created a Go application to import the data into Elasticsearch. You can view it at https://github.com/hartfordfive/backblaze-hd-data-importer. I appreciate any positive feedback anyone can provide me with!

  • Adela

    Hi Brian, I was wondering if in the data, there’s a way of distinguishing between disks which have been replaced due to actual drive failure, and which have been replaced due to predicted failure? Are there many disks that are replaced due to the prediction based on the smart parameters, or do you mostly wait for the disks to completely fail? Thanks!

  • Patrick Lynch

    Brian,
    Thank you for having the courage to post real numbers. I am involved in hardware delivery, and I know that real world number often bare only a passing resemblance to published numbers.

  • Patrick Lynch

    My experience over the years has shown that there are 2 other components that are often predictors
    1) location (things at the top of the rack may have higher ambient temperature. Although SMART will give temperature, it does not provide a correlation to location in a chassis, in a rack, in a room.
    2) The other bigger determiner of lifetime seems to be quality of the electrical. Low quality, but within spec kills drives much faster than higher quality power.

  • karl

    Nice to see companies being open and publish data.

    Perhaps they are failing because they’re crammed into a red box. Then stacked on top of each other. The POD’s cooling look inadequate; I notice the centre drives have no cooling (there are fans either side of the first and third row).

    https://www.backblaze.com/blog/why-now-is-the-time-for-backblaze-to-build-a-270-tb-storage-pod/

    However, I have no idea if the said failure rate is ‘normal’ for data centres.

    The POD 4 drives are not separated enough and I cannot see any anti-vibration padding — 45 HDD must create a log of vibrations.

    Why do QNAP and Synology separate their drives into individual drive bays, perhaps for easy of administration or to improve cooling and resiliency.

    Perhaps the position of the drives contribute (upright) but I admit that’s a little far-fetched.

    I would like to see research compare:

    1) HDD in the flat position and upright
    2) Failure rate when the POD has fewer drives
    3) Failure rate when drives are separated in drive bays

    The industry needs more research to compare.

    • physics2010

      This company is just providing a data point. They aren’t claiming they are doing rigorous testing. For their given configuration they show how different drive types are holding up. It’s up to the users to extrapolate to their needs. They don’t necessarily have to release any of this data, they are doing so as a public service. Originally all of the drives were suffering through the same harsh environment. Now that they tend to buy what has been more reliable to them, if a random positional element did exist, you would see their more reliable drives failing at a higher rate.

  • Jean-Nicolas

    Hello, linux homeuser here,

    My possible finding for comments/advice please,
    on infamous ST3000DM001 :
    There are maybe 3 making lines for HD, with different reliability

    Percentage
    working
    on 28/dec/2014

    Percentage
    not working
    on whole 2014

    S1F
    45%

    57%

    W1F
    36%

    36%

    Z1F
    19%

    7%

    – assuming serial number of HD tells us which manufacturing line was used
    – first simplified approach (not sure whether a failed drive stay in the CSV file for ever or just the day it dies)
    – on the 2014 full data set
    – Realy quick and dirty approach on only one day to check if there was an assumption to dig
    – Maybe some bias on when HD are bought (do you get random lots of S1, W1 and Z1?)

    cat *.csv | grep -i ST3000DM001 > ST3000DM001_all.txt
    gawk –field-separator , ‘{print $5,$2,$0}’ ST3000DM001_all.txt | grep ^1 > failed_ST3000DM001_all.txt

    Got the failed stats from failed_ST3000DM001_all.txt
    cat failed_ST3000DM001_all.txt | grep W1F -c

    Got stats on only one day for the working one
    cat 2014-12-28.csv | grep W1F -c

    Anybody andy idea on this and how to go further about it to check simply if
    it is statistically sound ?

    Regards

  • H_Trickler

    Many thanks for publicly sharing your data!

  • Wilson Wang

    Thank you for providing such amount of data!

    I have one question needs to ask. For example, in you 2014 data set, some disk only contains smart data until middle of the year, (e.g. 8/9/2014 for disk 13H3012AS, TOSHIBA DT01ACA300). Does it mean that from that date onwards, this particular disk has been removed from system?

    If it is so, is it because of the disk fault reason? We want to see if there is any relationship between captured smart data information and disk status.

    Thanks.

  • Jun Xu

    Thanks backblaze for sharing such a good data. However, according to previous study (Hard Drive SMART Stats, https://www.backblaze.com/blog/hard-drive-smart-stats/), there are more than 5 SMART status values were collected. I noticed that this data set only provides 5, even without #187 & #188. This is a pity. I would like to suggest that the company might also release the values for # 3, #7, #12, #187, #188, #189, & #195.

  • Mathew Binkley

    I manage a couple of petabytes (CERN data), and have noticed that smart_187 and smart_197 are closely related to “read-element” or “test element” failures, and smart_5 seems to be correlated to “Drive will fail in less than 24 hours” failures.

    It would be useful if, in addition to the SMART statistics, you could log *how* the drives was failing when possible. “smartctl -a /dev/sdb | grep “^Self-test” and log the number in parenthesis. With that, we could do a much better job determining which SMART statistics are useful in predicting various failure modes.

  • amadvance

    Just a clarification. The disks that disappear without being failed, like WD-WMAVU1876177 on 2014-10-23, is it because you replace them when you expect they are going to fail ?

    Thanks for that data!

    • jack phelm

      Still waiting some comment on this!!

    • Riviera

      I believe they mentioned that migrating smaller drives to higher capacity drives was one of the main reasons for replacing drives; drives that went over the drive usage statistic thresholds was another reason, and I think those were counted as ‘failed.’

      I agree though, it would be nice to have the actual numbers on these swaps.

  • David

    Wow. This is awesome! Thank you.

  • David

    Well Mick, I have to agree 100% with that comment ! I would only add… I don’t trust numbers from manufacturers (as much as you might) and it’s a fact that drive warranties are sliding backwards. What manufacturers should be doing is standardizing and fully utilize the S.M.A.R.T. reporting specifications.
    That would level the playing field and make failure stats more relevant and detailed.

  • Jon

    Are drives that are DOA counted or screened out before being used?

  • Mackle

    Very quickly looking through the data – is there any reason why the .csv files dated between 2013-08-20 and 2013-10-14 only contain 900 observations?

    P.S I really appreciate the data!

    • Brian Beach

      We had a glitch in the data collection during that period, so most of the data didn’t get saved. Neither the drive stats nor failures were reported during that time.

      One of the first things I did when I started working at Backblaze in October of 2013 was to notice this problem and make sure it got fixed.

      I’m impressed that you’ve opened the data already and started looking at it.

      • Mackle

        Good work applying the fix (could be a blog post right there)! Did you find that the missing data caused an issues with your analysis?

        As I am using STATA to work with the data (rather than just dropping it into a db), the glitch was more immediately apparent. That said, I’ve always been taught to get to know the data first.

        I normally work with cross-sectional survey data, but have been wanting to work with more diverse data. This time series data coincides with that and is a topic (HDD reliability) that I am interested in exploring so it worked out well.

        • Brian Beach

          The 2014 data is complete. When analyzing the 2013 data, I was careful to use data only when it was complete. That means excluding not just the drive-days, but also the failures from that time period.

          • Mackle

            2014 does not appear be complete – “2014-11-02.csv” is contains 0 observations.

  • David

    No thanks, we don’t need manufactures to post their reliability ratings.
    We need great companies like Backblaze and the heavy lifting done by
    Brian to show us how drives perform in the REAL world! Thanks Brian !!!
    Can’t wait for similar stats on SSD’s.

    • Mick’s Macs

      We disagree, David. Love what BackBlaze is doing, but consumers should be better informed and manufactures should stand behind their products.

      • Kevin Samuel Coleman

        “manufacturers should stand behind their products.”

        Well if the products aren’t doing well in the real world, them standing behind them shows they don’t care to improve upon them.

  • Mick’s Macs

    What a fantastic service you guys are providing, Brian. Thanks for going public with this data. We’d like to see HD manufactures required to post their reliability rating. So many end users have no idea how high (and how catastrophic) the failure rate can be. We always tell the ones that will listen, “Hard drives are extremely convenient, and extremely insecure places to store data. Back up, back up, back up.” We’re happy to be partnered with BackBlaze for our catastrophic backup service. :-)

    • Andreas K

      I have to say I think your comments are utopian and you don’t seem to have a grasp of the overall scientific method.

      HDD manufactures DO publish reliability data as the MTBF rating but we all know it’s pretty much useless. Why? Because you can’t truly speed up time to test this. So they write formulas to calculate this sort of thing.
      If you are asking them to publish their return/failure rates then that’s also not quite fair since the manufacturer can have only a basic idea of the environment that these drives have been run in before being returned.

      The backblaze data is unique because all the drives are being run in a consistent environment meaning that variation in the failure rates can be more clearly linked to the drive build quality and reliability.

      Anyone who does good research will tell you that real-world observation beats lab testing any day. If someone develops a theory and tests it in a lab to be correct, but it fails further testing in the real world, they don’t disregard the real-world results, they adjust their theory.

      That’s why I hold the BackBlaze data in much higher regard than any drive manufacturer disclosure.

      • Mick’s Macs

        I would agree with your final statement here, Andreas. The BackBlaze data is pretty cool from the perspective of a better, more controlled environment. We have no disagreement there.

        As for your assessment of my world view being “utopian,” as you put it, I can only reply that yours seems reductionistic to leap to that assumption. Insisting, demanding that hardware manufacturers have some sense of the reliability and “CRASH RATING,” of their “vehicles,” is not without _thousands_ of precedents. I’m trying to be clever here and cite the auto industry as an example, but there could be better, more analogous industries where this is the case. I’m not sure how that in any way denigrates my “grasp of the scientific method,” either.

        I don’t want to get catty here, but your comments seem to miss the importance of the ordinary consumer, the average user and their ignorance about how easily all of their data can be lost forever on a failed hard drive. This is what I’m speaking to. I care more about our clients than I do the challenges of knowing how reliable a product you manufacture and sell will be. That’s not my problem to solve for them.

        In our shop of 10 years, the much smaller samples we’ve seen leads us to avoid Seagate drives of all kinds, internal and external. And while that’s “anecdotal” data, give the sample size, some of the BackBlaze data validates that. Additionally, we’ve had off-the-record conversations with dedicated data recovery centers around the country (not to mention the lawsuit the Apple went after Seagate with that prompted a recall on Seagate’s dime) that has us avoiding Seagate as well.

        I don’t know how old you are, but I’ve been doing this professionally for 20 years and Seagate used to be one of the most trusted, “best” drives out there. It’s unlikely that they will regain my trust any time soon.

        Your scientific method may vary. ;-)