So, during a recent dnf update on my system, apparently according to the system logs at some point selinux suddenly disallowed write access to the kernel and it then marked the raid drive as "Failed".
I got the system back running by taking the raid drive in question out of fstab, (I'd rebooted, since there was a new kernel, and didn't know that the raid drive had suddenly ceased to function).
So, everything looks fine but the raid drive will not "start", and so I I did a
for x in a b c d e f; do
mdadm --examine /dev/sd${x}1 > mdstats.$x;
done
and then looked at the state of each of the raid elements in the mdstats.x files. This was actually good news, as all the devices were marked as "Clean", and all had the exact same number of "events".
Looking at the state of the raid device, from the command line, I did
% cat /proc/mdstat
Personalities :
md127 : inactive sdf1[3] sde1[0] sda1[6] sdb1[5] sdd1[1] sdc1[4]
17581198678 blocks super 1.2
unused devices: <none>
And I'll include two of the mdadm --examine outputs, in case I've missed something, but they all look similar,
% mdadm --examine /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : e90d6691:7ad14e07:68ee7f2a:f5bbcfbc
Name : LostCreek:0
Creation Time : Wed Mar 25 11:56:03 2020
Raid Level : raid6
Raid Devices : 6
Avail Dev Size : 5860399104 sectors (2.73 TiB 3.00 TB)
Array Size : 11720531968 KiB (10.92 TiB 12.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130992 sectors, after=133120 sectors
State : clean
Device UUID : 100e1983:e323daa4:a28faf75:68afde64
Internal Bitmap : 8 sectors from superblock
Update Time : Fri Aug 5 03:45:48 2022
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : f6f4c925 - correct
Events : 128119
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
and for comparison:
% mdadm --examine /dev/sdf1
/dev/sdf1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : e90d6691:7ad14e07:68ee7f2a:f5bbcfbc
Name : LostCreek:0
Creation Time : Wed Mar 25 11:56:03 2020
Raid Level : raid6
Raid Devices : 6
Avail Dev Size : 5860400015 sectors (2.73 TiB 3.00 TB)
Array Size : 11720531968 KiB (10.92 TiB 12.00 TB)
Used Dev Size : 5860265984 sectors (2.73 TiB 3.00 TB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130992 sectors, after=134031 sectors
State : clean
Device UUID : 8a99c085:51469860:99f42094:5dea9904
Internal Bitmap : 8 sectors from superblock
Update Time : Fri Aug 5 03:45:48 2022
Bad Block Log : 512 entries available at offset 24 sectors
Checksum : 73871a85 - correct
Events : 128119
Layout: left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAAAA ('A' == active, '.' == missing, 'R' == replacing)
And in these, the only lines that vary between all 6 devices are
1. /dev/sda1
11. Avail Dev Size
16. Unused Space
18. Device UUID : 100e1983:e323daa4:a28faf75:68afde64
23. Checksum : f6f4c925 - correct
29. Device Role : Active device 3
The leading numbers are the line numbers of the files created with the for loop at the top of this post, obtained via a
vimdiff mdstat.?
The "update time", "events", etc all match. The device role lines (last item noted as different) are all different, and go from to 0 to 5.
This all looks like it should work, but after much google searching, I believe I will need to get it back to functional with the following command.
mdadm --create /dev/md127 --level=6 --raid-devices=6 --chunk=512 --name=LostCreek:0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 --assume-clean
This is from, in part, the question/answer of MDADM - Disaster recovery or move on.. and mdadm-re-added-disk-treated-as-spare.
I am about to try this, but wanted to see if I'm missing anything. I also don't see any similar issue where there was no actual failed drive, but simply selinux "keeping me save" by stopping a write in the software raid. (I had selinux at "permissive", and thought I was safe, but now I'm at "disabled").
Hopefully this will be useful to others, especially if it "just works". The issues seems sufficiently rare that I expect to answer my own question, but I'd really rather not have to restore the whole raid from a backup over a month old. (and yes, as always, I should have more frequent backups, but losing it won't be the end of the world, thankfully).
Cheers
Mike