5

I recently got an automated email saying "WARNING: mismatch_cnt is not 0 on /dev/md3".

I'm running a software RAID5 array using mdadm on CentOS 6.6

When Googling the message I found this:-

The check operation scans the drives for bad sectors and automatically repairs them. If it finds good sectors that contain bad data (the data in a sector does not agree with what the data from another disk indicates that it should be, for example the parity block + the other data blocks would cause us to think that this data block is incorrect), then no action is taken, but the event is logged. This "do nothing" allows admins to inspect the data in the sector and the data that would be produced by rebuilding the sectors from redundant information and pick the correct data to keep.

My question is, how do I inspect the data and pick the correct data to keep? There doesn't seem to be any mention of how to do this anywhere and I have no idea what files these sectors are affecting

user3800918
  • 53
  • 1
  • 3

2 Answers2

5

The stupid and time consuming method:

For each disk, assemble the RAID with that disk missing, and mount it. Compare all the files of those mounts; if you find any difference in any of the files, that's your mismatch.

Do this in a rescue system, where your RAID is not running. To make sure no changes are made to the RAID members themselves, create read-only loop devices for them.

# losetup --find --show --read-only /dev/diska
/dev/loop0
# losetup --find --show --read-only /dev/diskb
/dev/loop1
# losetup --find --show --read-only /dev/diskc
/dev/loop2

Assemble with one disk missing:

# mdadm --assemble --run --readonly /dev/md42 /dev/loop0 /dev/loop1
mdadm: /dev/md42 has been started with 2 drives (out of 3).
# mount -o ro /dev/md42 loop/
# md5sum loop/file
95e3afde4229e266cb49f1d6e3fba705  file

Assemble with another disk missing: (and do this for each disk in turn, so every disk was the missing disk once)

# mdadm --stop /dev/md42
# mdadm --assemble --run --readonly /dev/md42 /dev/loop0 /dev/loop2
mdadm: /dev/md42 has been started with 2 drives (out of 3).
# mount -o ro /dev/md42 loop/
# md5sum loop/file
679c261d076f268a880c0fe847739e64  file

So there you have a differing file. Whether either of them may be the correct one, you have to decide for yourself.

Locating the mismatch address directly would certainly be smarter; I don't know if md can be coerced to give those addresses to you, though. You'd still have to find which file that address may relate to in your filesystem then. How easily this can be done depends on the filesystem.

frostschutz
  • 47,228
  • 5
  • 112
  • 159
  • 1
    I've accepted this as it's the first time I've seen an answer to this (and I did quite a bit of Googling before posting). It's a shame it's so long winded but it would be useful for any critical data. It would be good if the RAID layer could talk to the filesystem and use CRCs to pick the correct data or at least display a list of filenames – user3800918 Dec 15 '14 at 21:10
  • I'm no expert when it comes to recovering data, but - if you have the problematic sector numbers, and the filesystem on the array allows you to map cluster to files as shown on [link](https://wiki.archlinux.org/index.php/Identify_damaged_files), this approach could be made significantly quicker. I've just had this happen on an array with 4 disks, with ~6.5 TB of data - no way am I spending hours upon hours to compute hashes for the 10000s of files on there. – myxal Jan 05 '21 at 20:56
  • @myxal since sometime in 2017 there is a possible syslog warning `mdX: mismatch sector in range yyy-zzz` (in 1K sector units). If there are multiple mismatches, this list may be incomplete (rate limited). At the time of answer (2014) this message did not exist, and mismatch_cnt still only gives a count, no sector offsets, so there was nothing to go on. – frostschutz Jan 06 '21 at 02:12
  • 1
    Thanks @frostschutz, I actually ran into another QnA on this - [link](https://unix.stackexchange.com/a/291079/100809) - are you sure about the syslog message addresses being 1K blocks? In my logs, I'm seeing ranges, always of size 8, so I assumed the addresses were of 512B sectors. I did the calculation using that number (r_from / 8 * 3; having a RAID5 with 4 disks), but plugging that into debugfs did not get me files that would change when different degraded arrays were assembled. However, if I assume these are 1K blocks, then I get numbers out of range for the filesystem. – myxal Jan 06 '21 at 12:00
  • @myxal OK scratch that, I'm completely wrong. You're right, it's 512 byte sectors relative to member device data section and you have to extrapolate (depending on number of data drives) to get filesystem offset. Thanks for pointing it out! – frostschutz Jan 06 '21 at 15:21
  • @myxal so for a 6-disk raid5, known corruption at byte offset `2173312768`-`2175409920`, it reports `mismatch sector in range 849160-849168`. That `*512*5` is `2173849600-2173870080`. That's still somewhat off. Today is my bad math day... – frostschutz Jan 06 '21 at 15:34
-1

In the case of raid5, there is no alternate data; it simply means the checksum does not match. Write repair to /sys/block/md0/md/sync_action to scan the array and recompute any mismatching checksums.

psusi
  • 17,007
  • 3
  • 40
  • 51
  • There *is* alternate data, even with RAID5: it's what you get by ignoring the data from one of the disks (as if it had failed) and recomputing it based on the checksum and the other disks. It's even possible for that version of the data to be the "correct" one. – Wyzard Dec 15 '14 at 05:35
  • @Wyzard, I suppose, but then there are several alternates, not just one, and there's no real way to find and inspect them, and the correct one is almost certainly just the regular data and it's the checksum that is wrong. – psusi Dec 15 '14 at 14:47
  • I wouldn't say it's "almost certainly" the checksum is wrong. If there's a discrepancy, it's probably because of corruption on one of the drives, and that could happen on any of them. (The alternative is a bug in the kernel code that calculates the checksum, but that code is well-tested and is the same for everyone, so there'd be mismatches on lots of arrays if that were the problem.) – Wyzard Dec 15 '14 at 15:42
  • And yes, there are multiple versions of the data and no way to automatically determine which one is right. That's why the system doesn't try to repair the inconsistency automatically, but leaves it for a human to examine manually instead. – Wyzard Dec 15 '14 at 15:44
  • @Wyzard, drives don't just suddenly and silently corrupt the data on them. The mismatch is either caused by creating the array initially with --assume-clean when they were not actually clean, or power loss or crash in the middle of a write, in which case, there isn't much point in trying to recover some small part of some file that didn't finish being written. – psusi Dec 15 '14 at 15:53
  • 1
    There was a loss of power about a week ago. Once the power came back on all of the RAIDs rebuilt themselves as normal and the following day were all showing clean. A post I read online said that a mismatch_cnt value less than 128 on RAID1 (especially if it has a high number of writes) is usually fine and you should only worry if you see if on RAID5 (hence my worry) – user3800918 Dec 15 '14 at 20:56
  • I have done long SMART tests on all 6 drives overnight and they all passed. I also found the script that sent me the email and re-ran it. The mismatch_cnt is now showing 0. There should have been very few read/writes on that array when the email came through but the server has had high CPU load lately and since the server uses software RAID it could just have been that the scan checked a sector that hadn't finished writing to all disks. – user3800918 Dec 15 '14 at 21:04
  • 2
    @user3800918: mismatch_cnt on RAID1 can have "false positives", where inconsistency happened because a mmapped page was modified again between the time it was DMA-copied to one disk and the time it was DMA-copied to the other disc. It's still dirty at this point, but if the file is removed (or it's swap space), then you end up with out-of-sync free space that nothing will ever read (except the md check action). – Peter Cordes Feb 28 '16 at 03:06
  • raid5 is different: non-zero mismatch_cnt is always a problem. This answer isn't very good because there's no way for RAID5 to figure out which data is wrong. It could rewrite the sector on any one drive, based on all the other drives. – Peter Cordes Feb 28 '16 at 03:10
  • @PeterCordes, "because there's no way for raid5 to figure out which data is wrong" -- that's exactly what I said. You agree with me then downvote? – psusi Feb 28 '16 at 04:38
  • 1
    The other answer gives a way to look at all possible versions, to **pick one based on things RAID5 doesn't know about**, rather than giving up and picking one more or less randomly. We agree that RAID5 can't find the answer on its own, but I'm disagreeing with your conclusion that we should give up at this point. It would be really nice to figure out which block in the RAID5 might be broken (so you can e.g. check it for consistency, if it's a file format that has a checksum). – Peter Cordes Feb 28 '16 at 04:51
  • @PeterCordes, as I said in my second comment to Wyzard, in theory that may be, but unless you know of a way to actually *do* it, then it isn't possible in practice. – psusi Feb 28 '16 at 21:00
  • [This is as far as I've gotten so far](http://unix.stackexchange.com/questions/266432/md-raid5-translate-md-internal-sector-numbers-to-offsets). If I can map the mismatching sector to offset(s) inside the md device, then find the file containing those blocks, then I'm all set to use frostschutz's answer to this question. (Or if it turns out that the file is ok, copy the file and then rewrite the original to rewrite the stripe with fresh parity.) Frostschutz's answer here already does provide a method that's usable in practice (but very slow and cumbersome, hence not using it directly). – Peter Cordes Feb 29 '16 at 05:12