2

Does the md subsystem output any messages (to syslog/systemd-journal) to indicate that it's running in a degraded state (or anything else that might indicate that it has successfully reacted to a drive failure, as hinted at here)?

For example, I see lots of errors from sd indicating things like Unrecovered read error but I don't see anything like "retried successfully on alternate". Maybe no news is good news?

Back in the day, mirroring software/hardware would generate syslog entries that indicated when a device was degraded or otherwise required attention. Does md not do that?

Background: the systems in question are already deployed and are being remotely monitored (via syslog/journald info, so no mdadm or any other interactive commands/access of any sort are available at this point).

jhfrontz
  • 359
  • 5
  • 11
  • 1
    `cat /proc/mdstat` is an essential starting point – roaima Oct 06 '19 at 17:22
  • Previously posted (and closed as "off topic") at [serverfault](https://serverfault.com/questions/980072/what-should-i-expect-to-see-if-md-linux-raid-is-properly-compensating-for-a-fail). – jhfrontz Oct 06 '19 at 17:22
  • @roaima please see "no interactive commands/access of any sort are available". – jhfrontz Oct 06 '19 at 17:23
  • I see that. I still say that it's an essential starting point, but not as an answer because of your restrictions. (I'm looking to see what other options you've got.) – roaima Oct 06 '19 at 17:24
  • 1
    Without interactive access, how to you expect to recover from a failure? If a disk fails and is failed out of the array, then when you replace it you need to tell the `md` driver to add the replacement disk. – Stephen Harris Oct 06 '19 at 17:28
  • @StephenHarris Recovery is by dispatching a technician to the site. – jhfrontz Oct 06 '19 at 20:02
  • @roaima that's the message that I was looking for (i was able to search for `Disk failure` and identify an issue). If you make that an answer, I'll accept. – jhfrontz Oct 07 '19 at 23:28

1 Answers1

2

I set up a quick test on a RAID 1 array built from two loop devices.

dd bs=1M count=100 if=/dev/zero >/tmp/0.img
cp /tmp/0.img /tmp/1.img
i0=$(losetup --show --find /tmp/0.img); echo $i0
i1=$(losetup --show --find /tmp/1.img); echo $i1
mdadm --create /dev/md99 --metadata default --level 1 --raid-devices 2 $i0 $i1

Setting one half faulty

mdadm --manage /dev/md99 --set-faulty $i1    # For me, $i1=/dev/loop1

gives me this from the kernel (amongst other related RAID1 messages)

Oct 6 17:36:10 pi kernel: [4087450.030438] md/raid1:md99: Disk failure on loop1, disabling device
Oct 6 17:36:10 pi kernel: [4087450.030438] md/raid1:md99: Operation continuing on 1 devices.
roaima
  • 107,089
  • 14
  • 139
  • 261