ECC on a single block device

Question

I have an SSD that I suspect failing silently now and then. I have run badblocks on it and it is clear that it is not bad sectors but might instead be some race condition in the electronics, in which case a retry would probably read the data correctly.

Normal magnetic disks have some ECC to correct errors by taking up more space. Can Linux add an ECC layer on top of my block device?

I am thinking of something similar to device mapper, so maybe:

dmsetup create-ecc /dev/orig /dev/mapper/with_ecc

so any read and write to /dev/mapper/with_ecc will be converted to an ecc-read/write on /dev/orig.

Edit:

It seems others have been looking for it, too: http://permalink.gmane.org/gmane.linux.kernel.device-mapper.devel/8756

SSDs use ECC as well. What error or behavior are you observing that makes you think the disk is failing? — wingedsubmariner, Sep 20 '13 at 12:23
*Generally* running `badblocks` on a SSD seems like a bad idea because of the limited write cycle count (and writing something across the entire device is bound to stress even perfect wear leveling algorithms). Just saying. — user, Sep 20 '13 at 12:50
@wingedsubmariner I never see errors. I see odd behaviour that is not reproducible. Sorry to be so vague. — Ole Tange, Sep 20 '13 at 14:10
@MichaelKjörling Since the alternative is scrapping the device (life time = 0), then the life time will in my case actually increase. — Ole Tange, Sep 20 '13 at 14:12
@MichaelKjörling `badblocks` (by default) is a read-only test, which an SSD should be fine with. — derobert, Sep 20 '13 at 15:42
I'm not sure the exact symptoms you're seeing, but are you sure you're not seeing bad RAM? — derobert, Sep 20 '13 at 16:02

score 2 · Accepted Answer · answered Sep 25 '13 at 19:47

btrfs and zfs are engineered for data integrity.

By default, btrfs duplicates meta-data on single device configurations. I think you can duplicate data too, although I've never done it.

zfs has copies=n - which I think of as RAID1 for a single-disk. Consider that the amount of redundancy chosen will negatively impact usable device space as well as the device's performance. Fortunately you can specify replication/copies on a per partition/volume basis.

Check this blog post from Richard Elling / Oracle regarding zfs on single device. Unfortunately none of the graph images are loading for me.

Both real and anecdotal evidence suggests that unrecoverable errors can occur while the device is still largely operational. ZFS has the ability to survive such errors without data loss. Very cool. Murphy's Law will ultimately catch up with you, though. In the case where ZFS cannot recover the data, ZFS will tell you which file is corrupted. You can then decide whether or not you should recover it from backups or source media.

score 0 · Answer 2 · answered Sep 20 '13 at 15:14

Run a SMART selftest on the SSD. As root, execute the following command (replace /dev/sda with the device name of your SSD):

smartctl -t long /dev/sda

This will take several hours. When it's completed, you can query the result with

smartctl -a /dev/sda | less

Scroll down to the SMART Selective self-test log block. If the topmost result says Completed without error, your SSD is fine. If it reports an error, it is damaged and you need to save your data as soon as possible.

ECC on a single block device

2 Answers2