As I mentioned before (here and here), you really can’t trust your hardware to maintain a good health all by itself. It can overheat because of bad case design, dirty fans, or it can just burn out because of a bad PSU. It can also die from old age, which can mean any kind of weird symptoms, from random freezes to programs that crash all the time. You can test bad RAM using the free Memtest86+ which is conveniently packaged with Ubuntu’s live CD, and you can test your drives using their built-in SMART capabilities.
SMART (or S.M.A.R.T) stands for Self-Monitoring, Analysis, and Reporting Technology, and it’s basically extra sensors and firmware added to your hard disks so that they can detect hardware failures and other conditions, such as the drive’s temperature. The tool of choice on Linux to access SMART status is Smartmontools, which turned out to be most useful.
The package is installed on Ubuntu (or any other Debian-based distro) by invoking sudo apt-get install smartmontool. You may also want to install the postfix package at the same time as smartmontool will use the Postfix mail system to warn you if something goes wrong should you enable the SMART dæmon. The SMART dæmon will run periodic tests and monitor your drives’ health and send you an automated mail should something bad happens. Saved my life once. Well, saved my data, anyway.
Invoking sudo smartctl -a /dev/adrive will print the current SMART status of your drive (with adrive being something like sda or whatever it is the drive you’re interested in is. The first piece of interesting information is the drive’s identification. On one of my drive, it reads:
=== START OF INFORMATION SECTION === Device Model: WDC XXXXXXXX-XXXXXX Serial Number: WD-XXXXXXXXXXXX Firmware Version: 01.00A01 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Oct 10 19:34:44 2010 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
Should SMART support be disabled, enable it in your machine’s BIOS if possible. It may also be that the drive is not SMART-capable, which is unlikely if the drive is somewhat recent.
It also shows you all the performance/statistics registers gathered by your drive. On the same drive, sudo smartctl -a /dev/adrive
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 110 110 021 Pre-fail Always - 7458 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 9 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 091 091 000 Old_age Always - 6877 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 13 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 961199 194 Temperature_Celsius 0x0022 118 111 000 Old_age Always - 29 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
So this drive’s in good health. It is at a happy 29°C, and it reports no errors of any kind. Should errors had occurred, they would show in this report. But it is sometimes wise to run a long low-level test of the entire drive. Invoking
sudo smartclt -t long /dev/sda
will launch a long (full) test on your drive. Smartmontool will tell you how long it will be (likely a few hours for a 1TB drive). The good thing is that the test is taken care of by the firmware of the drive, and you can continue working normally. If your drive is healthy, you get (after a while) a report (invoking smartctl -a /dev/adrive) such as:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 17 -
otherwise you’d get something like:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read error 70% 17 19137113
and it indicates where the error occurred.
My advice is to get all the data you can from this drive because while read errors do not mean imminent failure, they’re not a good sign. Better be safe than sorry.