Hard Drive SMART Stats

November 12th, 2014

blog_smart-stats

I’ve shared a lot of Backblaze data about hard drive failure statistics. While our system handles a drive failing, we prefer to predict drive failures, and use the hard drives’ built-in SMART metrics to help. The dirty industry secret? SMART stats are inconsistent from hard drive to hard drive.

With nearly 40,000 hard drives and over 100,000,000 GB of data stored for customers, we have a lot of hard-won experience. See which 5 of the SMART stats are good predictors of drive failure below. And see the data we have started to analyze from all of the SMART stats to see which other ones predict failure.

S.M.A.R.T.

Every disk drive includes Self-Monitoring, Analysis, and Reporting Technology (SMART https://en.wikipedia.org/wiki/S.M.A.R.T.), which reports internal information about the drive. Initially, we collected a handful of stats each day, but at the beginning of 2014 we overhauled our disk drive monitoring to capture a daily snapshot of all of the SMART data for each of the 40,000 hard drives we manage. We used Smartmontools to capture the SMART data.

But, before we dig into the data, we first need to define what counts as a failure.

What is a Failure?

Backblaze counts a drive as failed when it is removed from a Storage Pod and replaced because it has 1) totally stopped working, or 2) because it has shown evidence of failing soon.

A drive is considered to have stopped working when the drive appears physically dead (e.g. won’t power up), doesn’t respond to console commands or the RAID system tells us that the drive can’t be read or written.

To determine if a drive is going to fail soon we use SMART statistics as evidence to remove a drive before it fails catastrophically or impedes the operation of the Storage Pod volume.

From experience, we have found the following 5 SMART metrics indicate impending disk drive failure:

  • SMART 5 – Reallocated_Sector_Count.
  • SMART 187 – Reported_Uncorrectable_Errors.
  • SMART 188 – Command_Timeout.
  • SMART 197 – Current_Pending_Sector_Count.
  • SMART 198 – Offline_Uncorrectable.

We chose these 5 stats based on our experience and input from others in the industry because they are consistent across manufacturers and they are good predictors of failure.

The Other SMART Stats

We compiled and placed online our list of all the SMART stats across all hard drives we use. For each stat we display the failure rate charts based on the raw and normalized values we recorded. Remember, this is raw data – and since different disk drive manufacturers report SMART stats differently, be careful how you use this.

Choosing the Right Stats to Use

There are over 70 SMART statistics available, but we use only 5. To give some insight into the analysis we’ve done, we’ll look at three different SMART statistics here. The first one, SMART 187, we already use to decide when to replace a drive, it’s really a test of the analysis. The other two are SMART stats we don’t use right now, but have potentially interesting correlations with failure.

SMART 187: Reported_Uncorrect – Backblaze uses this one.

Number 187 reports the number of reads that could not be corrected using hardware ECC. Drives with 0 uncorrectable errors hardly ever fail. This is one of the SMART stats we use to determine hard drive failure; once SMART 187 goes above 0, we schedule the drive for replacement.

This first chart shows the failure rates by number of errors. Because this is one of the attributes we use to decide whether a drive has failed, there has to be a strong correlation:

blog-chart-smart-stats-187a

The next question you might ask is: How many drives fall into each of those ranges? That’s answered by the next chart:
blog-chart-smart-stats-187b

This looks at the full time range of the study, and counts “drive years”. Each day that a drive is in one of the ranges counts as 1/365 of a drive year for that range. Those fractions are all added up to produce the chart above. It shows that most of the daily samples show drives without errors.

For SMART 187, the data appears to be consistently reported by the different manufacturers, the definition is well understood, and the reported results are easy to decipher: 0 is good, above 0 is bad. For Backblaze this is a very useful SMART stat.

SMART 12: Power_Cycle_Count – Backblaze does not use this one.

The number of times the power was turned off and turned back on correlates with failures:
blog-chart-smart-stats-12a

We’re not sure whether this is because cycling the power is bad for the drive, or because working on the pods is bad for the drives, or because “new” drives have flaws that are exposed during the first few dozen power cycles and then things settle down.

Most of our drives have very few power cycles. They just happily sit in their Storage Pods holding your data. If one of the drives in a Storage Pod fails, we cycle down the entire Storage Pod to replace the failed drive. This only takes a few minutes and then power is reapplied and everything cycles back up. Occasionally we power cycle a Storage Pod for maintenance and on rare occasions we’ve had power failures, but generally, the drives just stay up.

As a result, the correlation of power cycles to failure is strong, but the power cycles may not be the cause of the failures because of our limited number of power cycles for a drive (less than 100) and also considering the variety of other possible failure causes during that period.
blog-chart-smart-stats-12b

In addition to reporting the raw value, drives also report a “normalized” value in the range from 253 (the best) down to 1 (the worst). The drive is supposed to know what it’s design criteria and failure modes are, then interpret the raw value and tell you whether it’s good or bad. Unfortunately, with the Power_Cycle_Count, the drives all say the value is 100, which doesn’t lead to a very useful chart.

blog-chart-smart-stats-12c

As shown, SMART 12 does not produce a useful normalized value, it doesn’t think that power cycling is a problem at all.

You may ask whether there is a correlation of power cycle count with failures because power cycle count correlates with age, and age correlates with failures. The answer is no. The correlation of power cycle count with age is very weak: 0.05. New drives can have higher power cycle counts, and old drives can have low power cycle counts.

Because Backblaze does not power-cycle our drives very often, this SMART stat is not very useful to us in determining the potential failure of a drive. It also does not answer the age-old question that asks whether turning off your computer every night is better or worse for the disk – that mystery remains.

SMART 1: Read_Error_Rate – Backblaze does not use this one.

The Wikipedia entry for this one says “The raw value has different structure for different vendors and is often not meaningful as a decimal number.” So the numeric values here probably don’t count anything directly, but it’s clear from the failure rate chart that they have something to do with drives failing, and that non-zero values are worse. Once the value goes above 0, bigger is not worse, though.

blog-chart-smart-stats-1a

And a lot of the drives have a 0 value, presumably meaning “no problem”:

blog-chart-smart-stats-1b

Unlike the Power_Cycle_Count, the scaled value for this the Raw_Read_Error_Rate does what it’s supposed to do: failure rates are higher for lower normalized values, although it’s not a nice smooth progression:
blog-chart-smart-stats-1c

blog-chart-smart-stats-1d

For Backblaze to use this SMART stat we’d like to have a better sense of the values as reported by each vendor. While a value above 0 is not good, the inconsistency of the reported values above 0 is wildly inconsistent as seen in the charts above using normalized values. Since the manufacturers don’t tell us what their attribute values could be, this SMART stat is not very useful, especially across multiple drive manufacturers.

Tell Us What They Mean

Backblaze uses SMART 5, 187, 188, 197 and 198 for determining the failure or potential failure of a hard drive. We would love to use more – ideally the drive vendors would tell us exactly what the SMART attributes mean. Then we, and the rest of the storage community, could examine the data and figure out what’s going on with the drives.

In the meantime, at Backblaze, we’ll continue gathering data and working to correlate it as best we can. One thing we are looking at is to break down each SMART stat by the drive model, but there are challenges with how drive manufacturers change drive model numbers and how firmware changes occur within a given model. We’ll see if there is anything interesting and let you know.

Remember you can find charts like the ones above for all of the SMART attributes on our web site at https://www.backblaze.com/smart. If you see something interesting there and figure out what it means, or know more yourself, be sure to let us know.

 

Brian Beach

Brian Beach

Brian has been writing software for three decades at HP Labs, Silicon Graphics, Netscape, TiVo, and now Backblaze. His passion is building things that make life better, like the TiVo DVR and Backblaze Online Backup.
  • panimus

    I am on a mac using 3 HGST 7K4000 3TB drives. There is a program called DriveDx [https://binaryfruit.com/] that lists 17 selected SMART indicators for these drives. These indicators do NOT include SMART 187 or 188. Does this mean the drive is not reporting these indicators or I need a new/better program that will read them? If so, do you know a MAC program that will read these stats?

  • Russell Hockins

    Is there an app that can be installed and would monitor these values and could alert users to potential problems in advance? Could be a big market for something like that.

  • MontyW

    Hi Brian, I realise this blog post is over a year old but can you give some more info on the total number of drives that were removed (1) because they stopped working and (2) because you predicted they were going to fail imminently. I’d like to get an idea of the number of working drives removed that were predicted to fail. Was any further testing done on these drives? Did they indeed fail?

  • Sam McLeod

    Hi Guys,

    I’m re-writing a script I found to report on critical SMART data from
    all local block devices – https://github.com/sammcj/smart_diskinfo

    It appears that smartmontools 6.4 doesn’t seem to display values 187 and 188 – I’m wondering if someone else has replaced those numbers recently – this is what is currently reported on:

    1 Raw_Read_Error_Rate
    3 Spin_Up_Time
    4 Start_Stop_Count
    5 Reallocated_Sector_Ct
    7 Seek_Error_Rate
    9 Power_On_Hours
    10 Spin_Retry_Count
    11 Calibration_Retry_Count
    12 Power_Cycle_Count
    192 Power-Off_Retract_Count
    193 Load_Cycle_Count
    194 Temperature_Celsius
    196 Reallocated_Event_Count
    197 Current_Pending_Sector
    198 Offline_Uncorrectable
    199 UDMA_CRC_Error_Count
    200 Multi_Zone_Error_Rate

  • Matthew Grab

    Is this just data from early 2014? 26 power cycles in 12 months is a lot for a server. I don’t know if you are using enterprise SAS drives or not. I’m not sure if a windows reboot would count as a power cycle or not. But the servers in a datacenter that I’m familiar with usually remain powered on at all times. Power maintenance might happen once every 2 years, so a regular server would only do a hard power off and back on of the disk drives .5 times per year.

    • Phillip Remaker

      They use consumer grade drives to keep cost down, all SATA. Read their blog entries about their storage pod design.

  • Mathew Binkley

    I’m curious what software you use to log and manage the SMART stats? We have a large storage array with a number of different drive models and would love to start monitoring this.

    I did a regression analysis of a couple hundred drives vs SMART 5/187/197 (the others didn’t prove statistically significant or varied too much between modelsl). A “drive will fail in less than 24 hours” were correlated with SMART 5, but the adjusted R^2 was only 0.022, so while correlated it offered nearly zero predictive value.

    “Read Element Failure”/”Test Element Failure” were strongly correlated with 187/197 (or more accurately, log10(187/197).

    ===
    Coefficients: Estimate Std. Error t value Pr(>|t|)
    SMART_187_LOG10 0.27318 0.03495 7.815 1.95e-13 ***
    SMART_197_LOG10 0.17677 0.02803 6.306 1.45e-09 ***

    Multiple R-squared: 0.4658, Adjusted R-squared: 0.4612
    ===

    And I suspect the actual correlation is even stronger, and that better tracking of SMART values before failure would show that. But at least this gives us a good initial estimate of how close drives are to failure.

    Read Failure Predictor = 0.273 * SMART_187_LOG10 + 0.177 * SMART_197_LOG10

    We normally consider a drive “good” (even though its trending toward failure) if the predictor is < 0.4. If it's between 0.4 and 0.8, we can do things like set the filesystem on the drive to read-only to extend the lifetime, and anything over 0.8 is a very good candidate for replacement. In case it helps anybody.

    • Brian Beach

      Nice work. Very interesting.

  • Oak

    What does Backblaze do with drives that fail either due to failure in production or at end of life?

    • Tim

      I’m curious too. Since this is customer data, do you just drill a hole and chuck’em, or do you send them in for warranty?

      • veelckoo

        Most likely drives are degaussed (magnetically erased) and destroyed.

  • smartctl -A /dev/sda | grep -E –color “^( 5|187|188|197|198).*|”

  • tahlyn

    ID 187 and 188 not in GSmartControl results

    Using GSmartControl on Win 8.1, I have 5 drives, but only one shows ID 187/188 under the ATTRIBUTES tab which is the Seagate Barracuda LP.

    Here are the drives:

    Hitachi HDS724040ALE640 – 4TB

    Western Digital VelociRaptor – WDC WD3000GLFS-01F8U0 – 300GB

    Western Digital Caviar Green – WDC WD20EADS-00R6B0 – 2TB

    Hitachi Deskstar 7K3000 – Hitachi HDS723030ALA640 – 3TB

    Seagate Barracuda LP – ST32000542AS – 2TB

    Since you say ID 187 is so important, is there a way to obtain the information for ID 187/188 for these other drives?

  • Philip Adderley

    Brian thanks for a great article; can i ask you what values you deem unacceptable for the other 4 metrics apart from 187?
    SMART 5 – Reallocated_Sector_Count.???
    SMART 188 – Command_Timeout. ???

    SMART 197 – Current_Pending_Sector_Count. ???
    SMART 198 – Offline_Uncorrectable. ??? (i take it > 0 is bad news ;-)

    • Dennis Levens

      All we can do is guess, since they did not share this data (seems it was left out on purpose). From looking at the SMART data they did post (https://www.backblaze.com/blog-smart-stats-2014-8.html) this would be my best guess on the other 4

      SMART 5: Reallocated_Sector_Count
      1-4 keep an eye on it, more than 4 replace

      SMART 188: Command_Timeout
      1-13 keep an eye on it, more than 13 replace

      SMART 197: Current_Pending_Sector_Count
      1 or more replace

      SMART 198: Offline_Uncorrectable
      1 or more replace

      • Philip Adderley

        thanks so much Dennis for taking the time to reply – that’s of great help to me – Philip Adderley

  • For Windows users, can you recommend a command line tool or a installable program that can provide the values for the SMART attributes you wrote about. I tried Crystal Disk Info and the values are hard to figure out and my hard drive doesn’t show the same attributes as mentioned in the blog.

    • Tom Miller

      argusmonitor.com

      • Tim

        Also check out HDparm to change the APM value of a drive (prevent head parking!)

    • Andy Turner

      HDDGuardian has worked well for me

  • Matthew Lichtenberger

    For Seagate values on Raw Read ERror Rate, Seek Error Rate, and Hardware ECC Corrected, the raw values are apparently logarithmic. See http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html

  • Kevin Trumbull

    As you stated, the issue with SMART is that the standard is not all that
    rigorously defined, which results in variations between manufacturers
    which make statistical comparisons meaningless.

    The best diagnostic tool I’ve used (on a scale of hundred of drives), is MHDD. MHDD is a freeware tool that “speaks” raw ATA. The “scan” command has provided me some insight into drive surface issues.

    Scan does the following: 1) start a timer, issue a read for a sector, report time length, repeat until no more sectors. Scan can also optionally try to force the drive to remap bad sectors when encountered.

    What’s especially useful is that “scan” outputs an overview of the latency of the sectors it’s scanning as it goes. I’ve found that high numbers of bad sectors (20 – ~100) are generally indicative of approaching failure. But interestingly, when the latency of large percentages of working sectors starts to climb it seems to be even more indicative of impending failure.

    The latency seems to be generated when the drives built-in error correction mechanisms kick in. I haven’t noticed SMART properly indicating that lots of sectors are requiring error-correction.

    The issue with MHDD is that requires the drive to be taken out of service. On the other hand, I do know that Brendan Gregg has managed to get that sort of info from a running server using DTrace (See: “Shouting in the Datacenter” on YouTube). There’s a port of DTrace available for Linux now, and if I’m not mistaken there’s also a DTrace work-alike for Linux that might provide some insight.

    Best of luck, and thank you for sharing.
    – Kevin

    • Brian Beach

      Thanks for the info.

  • Gertdus

    Very cool stuff! I would also be very interested in a CSV or TSV file of the raw data. Are you planning to release this?

  • DieterT

    Interesting that 187 suggests (future) drive failure. I have a mere 4 drives that have been running between 32000 and 53000 hours. All drives have between 17 and 282 Reported_Uncorrect ticks. I run short self-tests 6 times a week and a long test once a week on each drive since they’ve been running. When bad blocks/sectors are encountered, 197 increases above 0 and the short or long test will reveal the LBA of the first error encountered. Since the drives are running in RAID1, I remove the drive from the array and ‘fix’ the drive by writing to the LBA that’s failing. At this time, 196 will increase suggesting that the bad sector has been remapped. A subsequent self-test indicates whether the remapping was successful and if so, the drive is added back to the array. After some time, 196 and 197 return back to 0.

    Some drives will frequently/repeatedly show bad sectors. If after repairing these drives, bad sectors continue to show up, then the drive is considered bad and removed/replaced. Of the 4 drives, I have replaced 1 in the last ~50000 hours. That was about 32000 hours ago and concerns the drive that has been running for ~32000 hours.

    While I’m willing to believe your conclusion about 187, for me it’s more of a question of how often 197 goes above 0 and what the self-test if revealing, not just whether 187 > 0. I’m concerned whether 187 is leveling off at an asymptote or is more likely ascending. As such, it would have been more interesting (for me) to relate the rate of increase of 187 compared to drive failures.

  • sttv

    Can you release the full database of all raw data values? Your analysis of the data is very unsatisfactory, could definitely be analyzed better. I am assuming you have everything organized with factor levels? You could host the CSV or TSV file off of your server.

  • AScientist

    I’m a bit confused about the y-axis of these charts. What does annual failure rate as a percentage mean? I thought it would be given n drives that share a property (eg 10-20 uncorrected reads) that (annual %)/100*n of those drives fail. But then you show failure rates >100%. does r=”annual failure rate % ” actually mean that n*exp(-r/100 t) have failed after t years?

    • Brian Beach

      Annual Failure Rate is a confusing term, although the math is actually a little simpler that you suggest.

      An annual failure rate of 100% means that if you have one disk drive slot and keep a drive running in it all the time, you can expect an average of one failure a year. If, on the other hand, you have one failure per month in your one drive slot, then your failure rate is 1200%. If you run n drives for t years with an annual failure rate of r, the number of failures is expected to be n * r * t.

      • AScientist

        Ah! thanks for clarifying. Great article btw. Also thanks for including error bars! It drives me nuts when I see blog posts/web articles with only averages and then the author tries to make conclusions based on probably insignificant differences.

  • Reefiasty

    Could you show/share Normalized_Read_Error_Rate grouped by drive model or at least by drive vendor? It may be not useful for someone who handles drives made by many vendors, but if someone uses only one vendor, it can be a different story.

  • bmlsayshi

    SMART 5 – Reallocated_Sector_Count. I have 1 of these on a drive, but only 1. Should I be worried if there is no other indication of drive trouble?

    • Reefiasty

      Do not replace the drive just yet. Keep an eye on the value. If it starts to rise, replace the drive immediately.

      Better have a backup solution in place which will let you survive a drive failure.

      • bmlsayshi

        Thank you

        • Robert Bohannon

          Usually, in my experience, the number will continue to grow and you will become more and more worried about your data as time goes on.

    • phuzz

      Make a backup *right now*, then decide if you need to replace the drive.

  • Sami Liedes

    Could you consider releasing an unbinned dump of the SMART 1 values and HDD models? More likely than not the data is structured (you could start looking at its hex value or individual bits), but all that is lost when binned. For example, if the high 8 bits contain some counter which does not correlate with failure and the low 16 bits contain something that does correlate, you are not going to see any significant correlation if you just interpret them as integers. Having the raw values and their frequencies would ease the analysis quite a bit.

  • Bltserv

    Another really good piece of Information about a drive is “Power On Time” Its a Factory Log Page that can be looked at with software. I use SCSI ToolBox. Page 3Eh or Page 62 Decimal. PARAMETER
    “CODE 0000h–Power-on Time. This parameter code represents the number of drive poweron minutes. Currently the Power-on Time parameter (0000h) is the only parameter in this Log Page that is visible to OEM/customers.”
    Most Enterprise drives use this Page. And many SATA drives too. SMART goes here to get its reporting intervals too.
    Most drives last about 5 Years of Continous Operation. From that point on you lose about 10% of that remaining lot each year. Thats from Enterprise Drives running at 10,000 RPM Mostly Fibre in a RAID.

    • brianb2backblaze

      Yes. The total time a drive has been running is an interesting number. It’s reported (in hours) as one of the SMART attributes.

      In my first blog post at Backblaze, I looked at how drive age relates to failure rates. I found that almost 80% of drives last 4 years: https://www.backblaze.com/blog/how-long-do-disk-drives-last/

      • Bltserv

        Thats SMART Attribute 09. SMART grabs that data from the Log Page I discribed. So if the Mfg of the drive does not include it in its SMART Log output. Some hide it. Its in the drives Log Pages if they support it there. As is the true Manufacture date. Its Log Page 0Eh. I find the true health of a Disk Drive and its longevity is directly related to its pre installation handling. If it was in its original 20 Pack from a Palletized Shipment when you get it. Your good. But if its changed hands and been mishandled at any point. Or packaged poorly. The falloff rate is pretty high. You might get a year or 2. And the Higher the RPM the more sensitive to handling they become. Seagate did a huge white paper on this several years ago for its larger customers.

  • Arkadiusz Miśkiewicz

    Do you run smart tests (smartctl -t long /dev/xyz for example) and decide based on test results, too?