I’ve shared a lot of Backblaze data about hard drive failure statistics. While our system handles a drive failing, we prefer to predict drive failures, and use the hard drives’ built-in SMART metrics to help. The dirty industry secret? SMART stats are inconsistent from hard drive to hard drive.
With nearly 40,000 hard drives and over 100,000,000 GB of data stored for customers, we have a lot of hard-won experience. See which 5 of the SMART stats are good predictors of drive failure below. And see the data we have started to analyze from all of the SMART stats to see which other ones predict failure.
Every disk drive includes Self-Monitoring, Analysis, and Reporting Technology (SMART https://en.wikipedia.org/wiki/S.M.A.R.T.), which reports internal information about the drive. Initially, we collected a handful of stats each day, but at the beginning of 2014 we overhauled our disk drive monitoring to capture a daily snapshot of all of the SMART data for each of the 40,000 hard drives we manage. We used Smartmontools to capture the SMART data.
But, before we dig into the data, we first need to define what counts as a failure.
What is a Failure?
Backblaze counts a drive as failed when it is removed from a Storage Pod and replaced because it has 1) totally stopped working, or 2) because it has shown evidence of failing soon.
A drive is considered to have stopped working when the drive appears physically dead (e.g. won’t power up), doesn’t respond to console commands or the RAID system tells us that the drive can’t be read or written.
To determine if a drive is going to fail soon we use SMART statistics as evidence to remove a drive before it fails catastrophically or impedes the operation of the Storage Pod volume.
From experience, we have found the following 5 SMART metrics indicate impending disk drive failure:
- SMART 5 – Reallocated_Sector_Count.
- SMART 187 – Reported_Uncorrectable_Errors.
- SMART 188 – Command_Timeout.
- SMART 197 – Current_Pending_Sector_Count.
- SMART 198 – Offline_Uncorrectable.
We chose these 5 stats based on our experience and input from others in the industry because they are consistent across manufacturers and they are good predictors of failure.