3

Yesterday I wanted to do some maintenance on my server. I shut it down pressing the powerbutton once which works just fine every time.

After the server was still shutting down after 10 minutes I called it a day and forced it off using the powerbutton. (I tried getting into it with ssh before forcing it off but the ssh service was already stopped).

After doing the maintenance and rebooting the server I noticed my RAID5 consisting of 7x 2TB disks did not work anymore. It was split into two RAIDs that consisted of 5 disks and 2 disks all in (S) mode (spare), inactive.

I tried mdadm --assemble --scan --run -f which did not help:

mdadm: Merging with already-assembled /dev/md/128
mdadm: failed to add /dev/sdc1 to /dev/md/128: Invalid argument
mdadm: failed to add /dev/sde1 to /dev/md/128: Invalid argument
mdadm: failed to RUN_ARRAY /dev/md/128: Input/output error
mdadm: No arrays found in config file or automatically

It seemed to half assemble things:

cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md128 : inactive sda1[0] sdg1[6] sdf1[5] sdd1[7] sdb1[1]
      9766891962 blocks super 1.2

unused devices: <none>

I also tried re-assembling it manually using mdadm --assemble --run /dev/md0 /dev/sd[abcdefg]1 --verbose:

mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sde1 is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sdf1 is identified as a member of /dev/md0, slot 6.
mdadm: /dev/sdg1 is identified as a member of /dev/md0, slot 5.
mdadm: added /dev/sdb1 to /dev/md0 as 1
mdadm: failed to add /dev/sdc1 to /dev/md0: Invalid argument
mdadm: failed to add /dev/sde1 to /dev/md0: Invalid argument
mdadm: added /dev/sdd1 to /dev/md0 as 4
mdadm: added /dev/sdg1 to /dev/md0 as 5
mdadm: added /dev/sdf1 to /dev/md0 as 6
mdadm: added /dev/sda1 to /dev/md0 as 0
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

Now examining all disks mdadm --examine /dev/sd[abcdefg]1 has this output see it on hastebin.com which to me looks like everything is just fine.

Here is the disks using lsblk

NAME                      MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                         8:0    0   1,8T  0 disk 
└─sda1                      8:1    0   1,8T  0 part 
sdb                         8:16   0   1,8T  0 disk 
└─sdb1                      8:17   0   1,8T  0 part 
sdc                         8:32   0   1,8T  0 disk 
└─sdc1                      8:33   0   1,8T  0 part 
sdd                         8:48   0   1,8T  0 disk 
└─sdd1                      8:49   0   1,8T  0 part 
sde                         8:64   1   1,8T  0 disk 
└─sde1                      8:65   1   1,8T  0 part 
sdf                         8:80   1   1,8T  0 disk 
└─sdf1                      8:81   1   1,8T  0 part 
sdg                         8:96   1   1,8T  0 disk 
└─sdg1                      8:97   1   1,8T  0 part 

The HDDs in use are not the best but they work. SMART output for all drives from sda to sdg can be found on hastebin.com too.

Due to the fact that two disks of my RAID5 produce errors I assume that all data is already lost. ...

EDIT 1:

dmesg -T returns:

[Sa Okt  7 15:41:08 2017] md/raid:md128: device sda1 operational as raid disk 0
[Sa Okt  7 15:41:08 2017] md/raid:md128: device sdf1 operational as raid disk 6
[Sa Okt  7 15:41:08 2017] md/raid:md128: device sdb1 operational as raid disk 1
[Sa Okt  7 15:41:08 2017] md/raid:md128: device sdd1 operational as raid disk 4
[Sa Okt  7 15:41:08 2017] md/raid:md128: device sdg1 operational as raid disk 5
[Sa Okt  7 15:41:08 2017] md/raid:md128: not enough operational devices (2/7 failed)
[Sa Okt  7 15:41:08 2017] md/raid:md128: failed to run raid set.
[Sa Okt  7 15:41:08 2017] md: pers->run() failed ...
[Sa Okt  7 15:41:12 2017] md: md127 stopped.
[Sa Okt  7 15:41:15 2017] md: md128 stopped.
[Sa Okt  7 15:41:20 2017] md: md0 stopped.
[Sa Okt  7 15:41:20 2017] md: sdc1 does not have a valid v1.2 superblock, not importing!
[Sa Okt  7 15:41:20 2017] md: md_import_device returned -22
[Sa Okt  7 15:41:20 2017] md: sde1 does not have a valid v1.2 superblock, not importing!
[Sa Okt  7 15:41:20 2017] md: md_import_device returned -22

How can I repair the superblocks?


Am I doing something wrong here?

Why do I get:

mdadm: failed to add [...] to [...]: Invalid argument?

What argument is invalid here?

How can I debug this further?

Flatron
  • 403
  • 1
  • 5
  • 12
  • 1
    please show `mdadm --examine /dev/sd*` for all drives – frostschutz Oct 07 '17 at 14:09
  • Please see the link I also posted in my question: https://hastebin.com/yujebiwite.sql – Flatron Oct 07 '17 at 14:45
  • 1
    Missed that, sorry. Very odd it says `Unused Space : before=262056 sectors, after=18446744073709289480 sectors`. Can you check device sizes? (`blockdev --getsize64 /dev/sd*`) Which `mdadm` version? – frostschutz Oct 07 '17 at 15:05
  • There you go: https://hastebin.com/itorilagey.pas – Flatron Oct 07 '17 at 15:35
  • 1
    for the partitions? Also `parted -l`, do those two devices stand out somehow, like GPT vs. msdos table scheme? – frostschutz Oct 07 '17 at 16:41
  • 1
    I can reproduce your problem - happens if the correct metadata is on a too small partition. Thus please re-check your device (partition) sizes and/or see if your kernel supports your partitioning scheme. – frostschutz Oct 07 '17 at 18:36
  • Thanks so far for the replies. What I did which most likely caused that was restoring the partitions on sdc and sde after the issue first occured. I saw in lsblk that there were no sdc1 and sde1 partitions shown anymore but they were present in cfdisk. So what I did is use testdisk to restore the partitions. I assume this is were I broke things. My reason for doing this was the fact that mdadm way not able to even try rebuilding the raid without sdc1 and sde1 "officially" existing. Using /dev/sd[ce] did not work as the raid was initially on its own partition not on the blockdevice itself. – Flatron Oct 08 '17 at 09:15
  • @frostschutz Sorry, I somehow did not notice that you had discovered the problem (or rather the decisive symptom) an hour before my answer. You should write another answer explaining the partition problem so that future readers can find the real solution in an answer and not just in the comments. – Hauke Laging Oct 08 '17 at 09:46
  • If I only knew what exactly is broken. At the moment I don't really know how to proceed as things basically are just there as they should be. I don't even know what really caused the raid to initially fail. I can only speculate. Currently I am still trying to get someone to point me to the right direction. @Hauke Leaving – Flatron Oct 08 '17 at 13:10
  • You have restored the partitions in the same locations and you still get `Unused Space : before=262056 sectors, after=18446744073709289480 sectors`? – Hauke Laging Oct 08 '17 at 13:13
  • @HaukeLaging I will test this tomorrow after some sleep. I made the partitioning mistake in the middle of the night, *classic*. I will use another disk which will be partitioned as the rest and `dd` the remaining RAID partition onto the new disks partition. – Flatron Oct 08 '17 at 19:43
  • Here is `fdisk -l /dev/sd[abcdefg]` > https://hastebin.com/fojemibubu.nginx - I think the issue can be seen at `sda` and `sde` again as the partitioning is off. After creating a backup I will remove the partitions and re-create them like the others. I am certain the data has not moved but due to my `testdisk` fail I moved the partition. – Flatron Oct 09 '17 at 06:33

1 Answers1

1

Warning: This answer is about the decisive symptom but it turned out that the real answer is different from what I suggested.

However this may have happened: The problem is probably this:

Unused Space : before=262056 sectors, after=177 sectors
Unused Space : before=262056 sectors, after=177 sectors
Unused Space : before=262056 sectors, after=18446744073709289480 sectors
Unused Space : before=262056 sectors, after=177 sectors
Unused Space : before=262056 sectors, after=18446744073709289480 sectors
Unused Space : before=262056 sectors, after=177 sectors
Unused Space : before=262056 sectors, after=177 sectors

I cannot offer a pleasant way of correcting that. You should make a backup of the MD metadata of sdc1 and then have a look at the on-disk format and use a hex editor to fix this.

Maybe you can just copy the relevant part with dd from one of the other disks. You "just" have to find out where these bytes are.

Kind of funny is this:

   Checksum : 85f67f98 - correct
   Checksum : 6a4fb921 - correct
   Checksum : 92db2c10 - correct
   Checksum : ad5c81b8 - correct
   Checksum : a657023 - correct
   Checksum : 6880d6c7 - correct
   Checksum : c0c31cf - correct

So maybe correcting the metadata destroys the checksum. I do not know if that is a real problem but at that point it probably makes sense to ask a new question.

Hauke Laging
  • 88,146
  • 18
  • 125
  • 174
  • Thanks for your reply. Please have a look at my last comment above. Most likely this caused the issue. – Flatron Oct 08 '17 at 09:23
  • Can you go a bit deeper into how to use what hex editor to change what exactly? I'd greatly appreciate this. – Flatron Oct 08 '17 at 13:12
  • @Flatron Currently I assume that there is nothing wrong with the MD metadata and that the problem is caused by a problem with the partitions. Information about the superblock format is here: https://raid.wiki.kernel.org/index.php/RAID_superblock_formats – Hauke Laging Oct 08 '17 at 13:16