In short
Following the guide How to Set Up a RAID 1 Under GNU/Linux, I have setup a RAID 1. Checking the RAID Array for validity in various ways, appeared to be normal.
However, after rebooting the system, the RAID Array was not working. The partition in question was not mounted as instructed in /etc/fstab. Assembling the Array manually worked and no data has been lost.
Newly added internal/external disks, which lead to a change of the disk device names (such as a disk being "renamed" from sdd to sde by the system), lead me to accept that it was a problem related with this name-changing fact. This is irrelevant, however, as RAID Arrays are (also) built by using unique UUIDs.
The actual question is why does the Array fail to assemble during the boot process? Or else, what is the boot-script of Funtoo, the operating system under which all of plot takes place, doing regarding the mdadm -assemble process?
The long story
Following the above referenced step by step guide, I have set up a RAID 1 under Funtoo. Checking the RAID 1 Array for validity was done in several ways, mostly using functionalities of the mdadm tool itself.
Specifically, the Array's details were retrieved by using the mdadm tool with the -D flag. The disks that pertain to the Array were examined by using the -E flag. The respective configuration file mdadm.conf can be simply read for if it contains the correct instruction (i.e which md device, what is its UUID and more). Finally, watching the file /proc/mdadm was important to ensure that both disks were active and "synced".
Below follow even more detailed information about the confronted situation.
mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Thu Jul 18 00:25:05 2013
Raid Level : raid1
Array Size : 1953382208 (1862.89 GiB 2000.26 GB)
Used Dev Size : 1953382208 (1862.89 GiB 2000.26 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu Jul 18 10:33:37 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : Resilience:0 (local to host Resilience)
UUID : 73bf29ca:89bff887:79a26531:b9733d7a
Events : 6455
Number Major Minor RaidDevice State
2 8 33 0 active sync /dev/sdc1
1 8 49 1 active sync /dev/sdd1
From the command history, I did the following
# trying to follow the guide -- various tests...
...
979 18.Jul.13 [ 00:09:07 ] mdadm --zero-superblock /dev/sdd1
980 18.Jul.13 [ 00:09:17 ] mdadm --create /dev/md0 --level=1 --raid-disks=2 missing /dev/sdd1
990 18.Jul.13 [ 00:15:58 ] mdadm --examine --scan
# creating/checking the configuration file
992 18.Jul.13 [ 00:16:17 ] cat /etc/mdadm.conf
993 18.Jul.13 [ 00:16:33 ] mdadm --examine --scan | tee /etc/mdadm.conf
994 18.Jul.13 [ 00:16:35 ] cat /etc/mdadm.conf
# after some faulty attempts, finally it goes
997 18.Jul.13 [ 00:24:45 ] mdadm --stop /dev/md0
998 18.Jul.13 [ 00:24:54 ] mdadm --zero-superblock /dev/sdd1
999 18.Jul.13 [ 00:25:04 ] mdadm --create /dev/md0 --level=1 --raid-disks=2 missing /dev/sdd1
1005 18.Jul.13 [ 00:26:39 ] mdadm --examine --scan | sudo tee /etc/mdadm.conf
1046 18.Jul.13 [ 03:42:57 ] mdadm --add /dev/md0 /dev/sdc1
The configuration file /etc/mdadm.conf reads:
ARRAY /dev/md/0 metadata=1.2 UUID=73bf29ca:89bff887:79a26531:b9733d7a name=Resilience:0
All works fine as can be seen from /proc/mdadm:
Personalities : [raid6] [raid5] [raid4] [raid1] [raid0] [raid10] [linear] [multipath]
md0 : active raid1 sdc1[2] sdd1[1]
1953382208 blocks super 1.2 [2/2] [UU]
unused devices: <none>
Here after, the disks were synced, access was fine. Shutting down the system, adding another disk, external (via USB) or internal, and restarting the system, caused the RAID1 to stop working! The reason is, if I am not wrong, the change of disk device numbers.
In this example, the former sdd1 became sde1 while sdd1 was reserved by a "new" internal disk or an external USB HDD attached before starting the system and instructed to be automatically mounted.
It was very easy to "recover" the "failed" ARRAY by removing all other disks, stopping the Array and re-assembling it. Some of the commands issued while trying and, finally, successfully getting back the ARRAY were:
# booting and unsuccessfully trying to add the "missing" disk
1091 18.Jul.13 [ 10:22:53 ] mdadm --add /dev/md0 /dev/sdc1
1092 18.Jul.13 [ 10:28:26 ] mdadm --assemble /dev/md0 --scan
1093 18.Jul.13 [ 10:28:39 ] mdadm --assemble /dev/md0 --scan --force
1095 18.Jul.13 [ 10:30:36 ] mdadm --detail /dev/md0
# reading about `mdadm`, trying to "stop", incomplete command though
1096 18.Jul.13 [ 10:30:45 ] mdadm stop
1097 18.Jul.13 [ 10:31:12 ] mdadm --examine /dev/sdd
1098 18.Jul.13 [ 10:31:16 ] mdadm --examine /dev/sdd1
1099 18.Jul.13 [ 10:31:20 ] mdadm --examine /dev/sdc
1100 18.Jul.13 [ 10:31:21 ] mdadm --examine /dev/sdc1
# reading again, stop it -- the right way
1101 18.Jul.13 [ 10:33:19 ] mdadm --stop /dev/md0
# assemble & check
1102 18.Jul.13 [ 10:33:25 ] mdadm --assemble /dev/md0 --scan
1111 18.Jul.13 [ 10:34:17 ] mdadm --examine /dev/sd[cd]1
# does the Array have a UUID?
1112 18.Jul.13 [ 10:37:36 ] UUID=$(mdadm -E /dev/sdd1|perl -ne '/Array UUID : (\S+)/ and print $1')
# below, learning how to report on the Array
1115 18.Jul.13 [ 10:42:26 ] mdadm -D /dev/md0
1116 18.Jul.13 [ 10:45:08 ] mdadm --examine /dev/sd[cd]1 >> raid.status
1197 18.Jul.13 [ 13:16:59 ] mdadm --detail /dev/md0
1198 18.Jul.13 [ 13:17:29 ] mdadm --examine /dev/sd[cd]1
1199 18.Jul.13 [ 13:17:41 ] mdadm --help
1200 18.Jul.13 [ 13:18:41 ] mdadm --monitor /dev/md0
1201 18.Jul.13 [ 13:18:53 ] mdadm --misc /dev/md0
However, I would expect this not to happen, and be able to play safe as with the rest of disks/partitions when mounting is based on UUIDs and/or LABELs.
The corresponding entry in /etc/fstab reads (note, I skipped the options nosuid,nodev)
/dev/md0 geo xfs defaults 0 2
Details for sdc1, output from mdadm -E /dev/sdc1
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 73bf29ca:89bff887:79a26531:b9733d7a
Name : Resilience:0 (local to host Resilience)
Creation Time : Thu Jul 18 00:25:05 2013
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 1953382208 (1862.89 GiB 2000.26 GB)
Used Dev Size : 3906764416 (1862.89 GiB 2000.26 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 5552ba2d:8d79c88f:c995d052:cef0aa03
Update Time : Fri Jul 19 11:14:19 2013
Checksum : 385183dd - correct
Events : 6455
Device Role : Active device 0
Array State : AA ('A' == active, '.' == missing)
Details for the partitions in question
Details for sdd1, output from mdadm -E /dev/sdd1
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 73bf29ca:89bff887:79a26531:b9733d7a
Name : Resilience:0 (local to host Resilience)
Creation Time : Thu Jul 18 00:25:05 2013
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 1953382208 (1862.89 GiB 2000.26 GB)
Used Dev Size : 3906764416 (1862.89 GiB 2000.26 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 076acfd8:af184e75:f83ce3ae:8e778ba0
Update Time : Fri Jul 19 11:14:19 2013
Checksum : c1df68a0 - correct
Events : 6455
Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing)
After adding, again, a "new" internal disk, and rebooting, I experience the same problem.
mdadm -E /dev/sdd1 reports
mdadm: No md superblock detected on /dev/sdd1.
while mdadm -E /dev/sde1
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 73bf29ca:89bff887:79a26531:b9733d7a
Name : Resilience:0 (local to host Resilience)
Creation Time : Thu Jul 18 00:25:05 2013
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 1953382208 (1862.89 GiB 2000.26 GB)
Used Dev Size : 3906764416 (1862.89 GiB 2000.26 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 076acfd8:af184e75:f83ce3ae:8e778ba0
Update Time : Fri Jul 19 11:34:47 2013
Checksum : c1df6d6c - correct
Events : 6455
Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing)
and mdadm --detail /dev/md0
mdadm: md device /dev/md0 does not appear to be active.
while cat /proc/mdstat reads
Personalities : [raid6] [raid5] [raid4] [raid1] [raid0] [raid10] [linear] [multipath]
md0 : inactive sde1[1](S)
1953382400 blocks super 1.2
unused devices: <none>
Note, as per Gilles' observation, at this point the reported blocks (1953382400) do not match the ones reported for the Array Size : 1953382208, as can be seen above (or below). Obviously, something went wrong here.
The (partial) output of mdadm -Evvvvs is
/dev/sde1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 73bf29ca:89bff887:79a26531:b9733d7a
Name : Resilience:0 (local to host Resilience)
Creation Time : Thu Jul 18 00:25:05 2013
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 1953382208 (1862.89 GiB 2000.26 GB)
Used Dev Size : 3906764416 (1862.89 GiB 2000.26 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 076acfd8:af184e75:f83ce3ae:8e778ba0
Update Time : Fri Jul 19 11:34:47 2013
Checksum : c1df6d6c - correct
Events : 6455
Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing)
/dev/sde:
MBR Magic : aa55
Partition[0] : 3907026944 sectors at 2048 (type fd)
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 73bf29ca:89bff887:79a26531:b9733d7a
Name : Resilience:0 (local to host Resilience)
Creation Time : Thu Jul 18 00:25:05 2013
Raid Level : raid1
Raid Devices : 2
Avail Dev Size : 3906764800 (1862.89 GiB 2000.26 GB)
Array Size : 1953382208 (1862.89 GiB 2000.26 GB)
Used Dev Size : 3906764416 (1862.89 GiB 2000.26 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 5552ba2d:8d79c88f:c995d052:cef0aa03
Update Time : Fri Jul 19 11:34:47 2013
Checksum : 385188a9 - correct
Events : 6455
Device Role : Active device 0
Array State : AA ('A' == active, '.' == missing)
/dev/sdc:
MBR Magic : aa55
Partition[0] : 3907026944 sectors at 2048 (type fd)
Checking with fdisk -l, the previous sdc andsdd disks, areis now sdb and sde (recognise them disks from the size among the rest of the drives). It seems that "it" is still looking at/for sdc1sdd1?
Following suggestions found in the comment section, I added more details.
As per derobert's suggestion in the comments, the ARRAY stopped and re-assembled successfully:
# stop it!
mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# re-assemble -- looks good!
mdadm --assemble -v --scan
mdadm: looking for devices for /dev/md/0
mdadm: no RAID superblock on /dev/sdf1
mdadm: no RAID superblock on /dev/sdf
mdadm: no RAID superblock on /dev/sde
mdadm: no RAID superblock on /dev/sdd1
mdadm: no RAID superblock on /dev/sdd
mdadm: no RAID superblock on /dev/sdc
mdadm: no RAID superblock on /dev/sdb
mdadm: no RAID superblock on /dev/sda6
mdadm: no RAID superblock on /dev/sda5
mdadm: no RAID superblock on /dev/sda4
mdadm: no RAID superblock on /dev/sda3
mdadm: no RAID superblock on /dev/sda2
mdadm: no RAID superblock on /dev/sda1
mdadm: no RAID superblock on /dev/sda
mdadm: /dev/sde1 is identified as a member of /dev/md/0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md/0, slot 0.
mdadm: added /dev/sde1 to /dev/md/0 as 1
mdadm: added /dev/sdc1 to /dev/md/0 as 0
mdadm: /dev/md/0 has been started with 2 drives.
# double-check
mdadm --detail --scan
ARRAY /dev/md/0 metadata=1.2 name=Resilience:0 UUID=73bf29ca:89bff887:79a26531:b9733d7a
New question, how is this to be fixed that without loosing data? As per the discussion and recommendations in the comments, it is related with the boot process? Permissions to the mount point in question perhaps?
mdadm was not registered as a boot service. Adding it and rebooting, did not fix the issue though. A few more, probably interesting, details from dmesg on where it fails:
[ 25.356947] md: raid6 personality registered for level 6
[ 25.356952] md: raid5 personality registered for level 5
[ 25.356955] md: raid4 personality registered for level 4
[ 25.383630] md: raid1 personality registered for level 1
[ 25.677100] md: raid0 personality registered for level 0
[ 26.134282] md: raid10 personality registered for level 10
[ 26.257855] md: linear personality registered for level -1
[ 26.382152] md: multipath personality registered for level -4
[ 41.986222] md: bind<sde1>
[ 44.274346] XFS (md0): SB buffer read failed
[ 55.028598] ata7: sas eh calling libata cmd error handler
[ 55.028615] ata7.00: cmd ef/05:fe:00:00:00/00:00:00:00:00/40 tag 0
[ 55.046186] ata7: sas eh calling libata cmd error handler
[ 55.046209] ata7.00: cmd ef/c2:00:00:00:00/00:00:00:00:00/40 tag 0
[ 55.278378] ata8: sas eh calling libata cmd error handler
[ 55.278406] ata8.00: cmd ef/05:fe:00:00:00/00:00:00:00:00/40 tag 0
[ 55.349235] ata8: sas eh calling libata cmd error handler
[ 55.349290] ata8.00: cmd ef/c2:00:00:00:00/00:00:00:00:00/40 tag 0
[ 105.854112] XFS (md0): SB buffer read failed
Further checking the XFS partition in question sde1
xfs_check /dev/sde1
xfs_check: /dev/sde1 is not a valid XFS filesystem (unexpected SB magic number 0x00000000)
xfs_check: WARNING - filesystem uses v1 dirs,limited functionality provided.
xfs_check: read failed: Invalid argument
cache_node_purge: refcount was 1, not zero (node=0x22a64a0)
xfs_check: cannot read root inode (22)
bad superblock magic number 0, giving up
It turns out that the XFS partition in question is not healthy.
An XFS filesystem, although not impossible, is likely not the root of the problem here. Partitions that are part of an RAID Array, are according to Gilles' comment, not an XFS filesystem! The filesystem starts at an offset, there's a RAID header first.
Question(s)
In the beginning
Is it possible to "lock" the RAID 1 Array to only work with specific disks/partitions independently from disk device names?
Would it be, for example, sufficient to use the Array's UUID in
/etc/fstabso as to be immune against changes of the disk device names?
After re-searching the problem
- At what stage of Funtoo's boot process is the assembmel of the RAID Array attempted? How exactly? Where can it be modified/adjusted?