What SMART Stats Tell Us About Hard Drives

October 6th, 2016

drive-stats-smart-stats
What if a hard drive could tell you it was going to fail before it actually did? Is that possible? Each day Backblaze records the SMART stats that are reported by the 67,814 hard drives we have spinning in our Sacramento data center. SMART stands for Self-Monitoring, Analysis and Reporting Technology and is a monitoring system included in hard drives that reports on various attributes of the state of a given drive.

While we’ve looked at SMART stats before, this time we’ll dig into the SMART stats we use in determining drive failure and we’ll also look at a few other stats we find interesting.

We use Smartmontools to capture the SMART data. This is done once a day for each hard drive. We add in a few elements, such as drive model, serial number, etc. and create a row in the daily log for each drive. You can download these logs files from our website. Drives which have failed are marked as such and their data is no longer logged. Sometimes a drive will be removed from service even though it has not failed, like when we upgrade a Storage Pod by replacing 1TB drives with 4TB drives. In this case, the 1TB drive is not marked as a failure, but the SMART data will no longer be logged.

SMART stats we use to predict Hard Drive failure

For the last few years we’ve used the following five SMART stats as a means of helping determine if a drive is going to fail.

Attribute Description
SMART 5 Reallocated Sectors Count
SMART 187 Reported Uncorrectable Errors
SMART 188 Command Timeout
SMART 197 Current Pending Sector Count
SMART 198 Uncorrectable Sector Count

When the RAW value for one of these five attributes is greater than zero, we have a reason to investigate. We also monitor RAID array status, Backblaze Vault array status and other Backblaze internal logs to identify potential drive problems. These tools generally only report exceptions, so on any given day the number of investigations is manageable even though we have nearly 70,000 drives.

Let’s stay focused on SMART stats and take a look at the table below which shows percentage of both failed and operational drives, which are reporting a RAW value that is greater than zero for the SMART stat listed.

blog-smart-fail-vs-good

While no single SMART stat is found in all failed hard drives, here’s what happens when we consider all five SMART stats as a group.

Operational drives with one or more of our five SMART stats greater than zero – 4.2%

Failed drives with one or more of our five SMART stats greater than zero – 76.7%

That means that 23.3% of failed drives showed no warning from the SMART stats we record. Are these stats useful? I’ll let you decide if you’d like to have a sign of impending drive failure 76.7% of the time. But before you decide, read on.

Having a given drive stat with a value that is greater than zero may mean nothing at the moment. For example, a drive may have a SMART 5 raw value of 2, meaning two drive sectors have been remapped. On it’s own such a value means little until combined with other factors. The reality is it can take a fair amount of intelligence (both human and artificial) during the evaluation process to reach the conclusion that an operational drive is going to fail.

One thing that helps is when we observe multiple SMART errors. The following chart shows the incidence of having one, two, three, four or all five of the SMART stats we track have a raw value that is greater than zero.
Smart StatsTo clarify, a value of 1 means that of the five SMART stats we track only one has a value greater than zero, while a value of 5 means that all five SMART stats we track have a value greater than zero. But, before we decide that multiple errors help, let’s take a look at the correlation between these SMART stats as seen in the chart below.

blog-smart-vs-smart

In most instances the stats have little correlation and can be considered independent. Only SMART 197 and 198 have a good correlation meaning we could consider them as “one indicator” versus two. Why do we continue to collect both SMART 197 and SMART 198? Two reasons: 1) the correlation isn’t perfect so there’s room for error, and 2) not all drive manufacturers report both attributes.

How does understanding the correlation, of lack thereof, of these SMART stats help us? Let’s say, a drive reported a SMART 5 raw value of 10 and SMART 197 raw value of 20. From that we could conclude the drive is deteriorating and should be scheduled for replacement. Whereas, if the same drive had SMART 197 raw value of 5 and a SMART 198 raw value of 20 and no other errors, we might hold off on replacing the drive awaiting more data, such as the frequency of the errors occurring.

Error Distribution

So far it might sound like we will fail a hard drive if we just observe enough SMART values that are greater than zero, but we also have to factor time into the equation. The SMART stats we track, with the exception of SMART 197, are cumulative in nature, meaning we need to consider the time period over which the errors were reported.

For example, let’s start with a hard drive that jumps from zero to 20 Reported Uncorrectable Errors (SMART 187) in one day. Compare that to a second drive which has a count of 60 SMART 187 errors, with one error occurring on average once a month over a five year period. Which drive is a better candidate for failure?

Another stat to consider: SMART 189 – High Fly Writes

This is a stat we’ve been reviewing to see if it will join our current list of five SMART stats we use today. This stat is the cumulative count of the number of times the recording head “flies” outside its normal operating range. Below we list the percentage of operational and failed drives where the SMART 189 raw value is greater than zero.

    Failed Drives: 47.0%

    Operational Drives: 16.4%

The false positive percentage of operational drives having a greater than zero value may at first glance seem to render this stat meaningless. But what if I told you that for most of the operational drives with SMART 189 errors, that those errors were distributed fairly evenly over a long period of time. For example, there was one error a week on average for 52 weeks. In addition, what if I told you that many of the failed drives with this error had a similar number of errors, but they were distributed over a much shorter period of time, for example 52 errors over a one-week period. Suddenly SMART 189 looks very interesting in predicting failure by looking for clusters of High Fly Writes over a small period of time. We are currently in the process of researching the use of SMART 189 to determine if we can define a useful range of rates at which errors occur.

SMART 12 – Power Cycles

Is it better to turn off your computer when you are not using it or should you leave it on? The debate has raged on since the first personal computers hit the market in the 80’s. On one-hand turning off a computer “saves” the components inside and saves a little on your electricity bill. On the other-hand the shut-down / start-up process is tough on the components, especially the hard drive.

Will analyzing the SMART 12 data finally allow us to untie this Gordian knot?

Let’s compare the number of power cycles (SMART 12) of failed drives versus operational drives.

    Failed Drives were power cycled on average: 27.7 times

    Operational Drives were power cycled on average: 10.2 times

At first blush, it would seem we should keep our systems running as the failed drives had 175% more power cycles versus drives that have not failed. Alas, I don’t think we can declare victory just yet. First, we don’t power cycle our drives very often. On average, drives get power-cycled about once every couple of months. That’s not quite the same as turning off your computer every night. Second, we didn’t factor in the age range of the drives. To do that we’d need a lot more data points to get results we could rely on. That means, sadly, we don’t have enough data to reach a conclusion.

Perhaps one of our stat-geek readers will be able to tease out a conclusion regarding power cycles. Regardless, everyone is invited to download and review our hard drive stats data including the SMART stats for each drive. If you find anything interesting let us know.

Andy Klein

Andy Klein

Andy has 20+ years experience in technology marketing. He has shared his expertise in computer security and data backup at the Federal Trade Commission, Rootstech, RSA and over 100 other events. His current passion is to get everyone to back up their data before it's too late.
Andy Klein

Latest posts by Andy Klein (see all)

  • More info please

    I love your data. Do you have any examples of the smartmontools scripts that you use? I would like to do the same. Looking at the output I can probably put together a parser to get the same sort of data. But if you had some examples it would be easier than reinventing the wheel and might include some things I hadn’t thought of.

  • Frank Bulk

    I think this data needs to be poured into a machine learning tool ….

  • mt267

    I was wondering when you have failed drives, do you dispose of them straight away, or run through testing etc first? I work for a hosting company who go through several thousand failed disks a year, some fail our thresholds very clearly, but others require testing, which is a very labour intensive process, testing 12 disks can take 2 weeks at a time… I was wondering if you ever run tests on your ‘failed’ drives and if they return inconclusive/pass tests do you put them back into use?

    Hopefully that makes some sense – I’d appreciate any input as our testing regime is being reviewed at the moment, and we are trying to streamline it!

  • Robert Lucente

    Would you consider making predicting drive failure a kaggle.com competition?

  • Brian

    Do you also collect POH, or Power-On Hours for your drives? That would be another reasonable metric to gather along with SMART errors to help determine if a drive’s age also plays a factor in the distribution of SMART errors.

    Cheers!

  • Gυnnar DaIsnes

    I wonder why do you power cycle every couple months?

  • Chris Parkin

    Do you return your dead drives to the manufacturer if still under warranty? If so are they happy to replace the ‘desktop’ drives you use even though they have been used in an environment they were not designed for?

  • swampwiz0

    As someone who has gone through many hard drive replacement cycles (going all the way back to the IOmega Zip Drive – remember that, LOL?), I have always had the attitude that when the drive seems to be doing copy-from & copy-to slower that it usually had been, it’s time to replace.

  • Vincent

    Hello,
    I think this type type of study falls in the statistic branch called survival analysis because some of the hard drive have not failed yet or have been replaced by larger drives.

    I haven’t had the time to look at Backblaze hard drive stats data yet, but I guess they can be imported in R and analyzed with the help of package “survival”.

    Regards

  • Mark

    I worked for 8 years with a global reverse logistics chain who did laptop, tablet and desktop repairs for the likes of HP, Sony, Asus and Toshiba to name a few. One of the tasks I was faced with was this very thing, can we predict from SMART data the life span of a drive and could we have seen the fail coming form the SMART data of a failed drive.

    I collected data from over 350,000 drives (all brands and models) over a 3 years period to try and find any patterns in the data. I left the position 2 years ago so I don’t have the actual numbers in front of me but we came to a few conclusions:

    A word of note, the term fail had a different meaning for each vendor, Sony had a zero tolerance to any reallocated sectors whilst others allowed a certain percentage relevant to drive size. It’s worth mentioning each manufacture had their own tolerances to when a SMART fail would trigger. Also customers could have a high reallocated sector count but never experience issues or ever know there was an issue

    – There was a direct link between Reallocated Sectors Count and how quickly the drive would fail
    – Once the drive had one reallocated sector the drive would continue to ‘fail’ with the reallocated sector count increasing in relation to the POH (time powered on).
    – A high G-sense Error Rate would increase the chances of a reallocated sector
    – Drives with a higher max recorded temperature had a higher fail rate of reallocated sectors than drives with a lower max temp
    – Even one Uncorrectable sector count would lead to most drives being unusable within 3months
    – There was no collation (differs from your findings) that we could contribute directly to drive fail rates between the Start/Stop count and fail rates
    – Although very rare (there was less than 30) a high number of spin retry count lead to a drive failing within a few hours

    As mentioned above someone can still use a drive long after a SMART error has been reported (depending on the SMART error), if they never hit the faulty sectors then there would generally never be an issue to the end user. In a lot of the cases only the first few GB of a drive would contain data, users simply using the devices for internet browsing etc Obviously this is different in your environment

    • Andy Klein

      Excellent addition to the conversation. Thanks. Your observation about the term “fail” being different by manufacturer is spot on. There also appears to some differences by model, but that’s harder determine.
      To add to your observations, I looked a G-sense errors and couldn’t find anything significant, mostly because we didn’t have many drives reporting an error. Our drives are racked, and not tossed around like some laptops… I saw the same thing you did on spin retry errors, but like you we only observed this a few times. The observation on Uncorrectable sector counts errors leading to failure withing 3 months is really interesting, I’ll take a look at the data to see if it is the same for us.

      • dakishimesan

        Thank you for the information. Do you have a suggestion for reading SMART attributes on Mac/Win? This command line tool seems to be the go-to, but I wanted to ask. Thank you!

        https://sourceforge.net/projects/smartmontools/

    • Tim

      Wow, very interesting.

  • Tim

    Hi. I’m operating ~360.000 disks, mostly from Seagate. We had some discussions with their technicians. Conclusion: We now take a look at ‘End to End Errors’ (Smart ID 184), even small increases here indicate a failed drive. Also we try to monitor Total LBAs Written/Read to determine the throughput of the drive. We noticed that “desktop drives” are meant to be used for smaller throughput and have a higher failure rate if they are overbooked.

    • Andy Klein

      Only 1.5% of our failed drives reported a SMART 184 raw value greater than 0, and 0.1% of our operational drives reported the same error. I’ll see if I can track those operational drives to see if they fail.

      • Tim

        Interesting, maybe thats related to the too high workload we had on many disks. We’re currently improving our logging to get more detailed stats in the next months.

    • Milk Manson

      We noticed that “desktop drives” are meant to be used for smaller throughput and have a higher failure rate if they are overbooked…

      Please define “smaller throughput” and “overbooked”. Please. I’m begging you.

      • Tim

        Most vendors provide a ‘workload rate limit’. For example: seagate says their ST8000NM0045 works for 550 TB/Year and is 24/7 certified. lets assume the drive writes 550TB a year with constant throughput. This should result in:
        550*1024*1024 = 576716800MB
        576716800MB / (365*24*60*60) = ~18,2MB/s
        some of the desktopdrives are around 5MB/s. The fun part is now to determine the amount of written data, on some devices you can just count LBA_TOTAL_WRITTEN and multiply it be the sector size. Now we can compare the maximum recommended throughput vs the actual one. “overbooked”: They are writing way more data like they are supposed to. We detected several hundred drives where the actual workload was 3 times higher than recommended by the vendor. I will try to get a statistic about the amount of failed drives with a too high load.

        • Milk Manson

          Thank you.

    • Alessandro Rota

      Many hard drives does not show ID 184. What can we do in this case?

      • Tim

        I don’t think that it is possible to enable this ID. I guess these aren’t Seagate SATA drives? Depending on the amount if disks you have you can talk to the vendor. Maybe there is another ID provided by this vendor, but with a similar meaning.

        • Alessandro Rota

          I’m just a basic PC technician with absolutely no wide hard drives pool. But on my own PCs (notebooks and towers) and on many PC of customers I’ve installed Western Digital disks.
          By the way, here is the LOG of the disk on the PC I’m working on:

          Model Family: Western Digital RE4
          Device Model: WDC WD5003ABYX-01WERA0
          Serial Number: WD-WMAYP0192618

          SMART Attributes Data Structure revision number: 16
          Vendor Specific SMART Attributes with Thresholds:
          ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
          1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always – 1
          3 Spin_Up_Time 0x0027 134 133 021 Pre-fail Always – 4258
          4 Start_Stop_Count 0x0032 099 099 000 Old_age Always – 1389
          5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always – 0
          7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always – 0
          9 Power_On_Hours 0x0032 084 084 000 Old_age Always – 11711
          10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always – 0
          11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always – 0
          12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always – 1388
          192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always – 38
          193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always – 1350
          194 Temperature_Celsius 0x0022 113 103 000 Old_age Always – 30
          196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always – 0
          197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always – 0
          198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline – 0
          199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always – 0
          200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline – 0

          • Tim

            AFAIK the RE4 don’t have any compareable ID, I would not be too worried as the RE4 is one of the enterprise drives from WD.

          • Alessandro Rota

            That’s why I’ve choosed it for my work PC! ;) The other disk on this PC is a Western Digital Blue (WDC WD5000AAKX-07U6AA0) … i think is an average disk, isn’t it?

          • Tim

            The WD5000AAKX is meant for desktop purpose/average usage, yep.

  • elgselgs

    Can you please show some commands that you use to collect these data? Thanks.

    • Mark

      smartctl should give you everything you need, take a look at the smartmontools damon

  • Michael Schlachter

    Given the amount of raw data available, might this be a good candidate for machine learning to predict failures?

  • maktt

    Have you considered doing a Fourier transform on the stats as well as just the raw numbers? or even second and/or third order transform? (raw stat count (distance) versus frequency (velocity), acceleration, and impulse). For ease of computation / use a DCT would probably suffice, I’m not sure if phase information would be useful but it might be.

  • Nate Barbettini

    Great article! I noticed a small error: “You can download these **logs** files from our website”

  • Kevin Rodriguez Roman

    Did you consider how old where the hard drives that failed?

    • Andy Klein

      We did and we found the results fairly consistent over time. The data gets thin in some places, but it doesn’t appear that one SMART stat fails more often when drives are young and another SMART stat fails more often as drives get older. Good question.

  • Roamer

    Have you done any correlation regarding which SMART attributes, or -with multiples – which combinations of SMART attributes are more likely to result in failure? It seems unlikely that all five attributes are equally likely to leave a drive standing.

  • karl

    So, should one take away form this that in the event of 1 SMART attribute failure, the drive should be replaced.

    • Andy Klein

      It depends on your environment and your tolerance for drive failure. Our set-up allows for failed drives, while still ensuring the data is protected (RAID, Reed-Solomon encoding, consistency checks, etc.) That’s why we generally want to see multiple signs of a potential problem. If your environment is less tolerant of failure, then it probably makes sense for you to “pay attention” at the first sign of trouble and be ready to act if the condition continues.