Investigating a failed drive in a raid array

Question

Is there any way to determine what may have happened in the following scenario? We have a RAID5 that should have been originally built with 5 drives.

As it stands, this is some of the state that I'm seeing:

/dev/md/dcp-data:
           Version : 1.2
     Creation Time : Fri Aug 16 14:15:40 2019
        Raid Level : raid5
        Array Size : 23441679360 (22355.73 GiB 24004.28 GB)
     Used Dev Size : 7813893120 (7451.91 GiB 8001.43 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Jan 21 13:00:36 2020
             State : active, degraded, recovering
    Active Devices : 3
   Working Devices : 4
    Failed Devices : 0
     Spare Devices : 1

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

    Rebuild Status : 16% complete

              Name : localhost:dcp-data
              UUID : 0bd03b0a:59e1665c:d393f6fe:a032dac6
            Events : 165561

    Number   Major   Minor   RaidDevice State
       0       8      113        0      active sync   /dev/sdh1
       1       8      129        1      active sync   /dev/sdi1
       5       8      160        2      spare rebuilding   /dev/sdk
       4       8      177        3      active sync   /dev/sdl1

[root@cinesend ~]# lsblk
NAME                                          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdc                                             8:32   0 465.8G  0 disk
└─sdc1                                          8:33   0 465.8G  0 part  /mnt/drive-51a7a5af
sdd                                             8:48   0 465.8G  0 disk
└─sdd1                                          8:49   0 465.8G  0 part  /mnt/drive-299a7133
sde                                             8:64   0 223.6G  0 disk
├─sde1                                          8:65   0   200M  0 part  /boot/efi
├─sde2                                          8:66   0   200M  0 part  /boot
└─sde3                                          8:67   0 223.2G  0 part
  └─luks-cf912397-326e-42eb-a729-bce4de6bff14 253:0    0 223.2G  0 crypt /
sdh                                             8:112  0   7.3T  0 disk
└─sdh1                                          8:113  0   7.3T  0 part
  └─md127                                       9:127  0  21.9T  0 raid5 /mnt/library
sdi                                             8:128  0   7.3T  0 disk
└─sdi1                                          8:129  0   7.3T  0 part
  └─md127                                       9:127  0  21.9T  0 raid5 /mnt/library
sdj                                             8:144  0   7.3T  0 disk
sdk                                             8:160  0   7.3T  0 disk
└─md127                                         9:127  0  21.9T  0 raid5 /mnt/library
sdl                                             8:176  0   7.3T  0 disk
└─sdl1                                          8:177  0   7.3T  0 part
  └─md127                                       9:127  0  21.9T  0 raid5 /mnt/library

As well as one relevant mail message:

Date: Tue, 21 Jan 2020 07:59:30 -0800 (PST)

This is an automatically generated mail message from mdadm
running on cinesend

A DegradedArray event had been detected on md device /dev/md127.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4]
md127 : active (auto-read-only) raid5 sdh1[0] sdl1[4] sdi1[1] sdk[5](S)
      23441679360 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UU_U]
      bitmap: 0/59 pages [0KB], 65536KB chunk

What happened here? Originally, /dev/sdj should have been included in the RAID5 array. I see now that /dev/sdk is rebuilding/recovering itself back into the array...

But shouldn't RAID5 not be able to sustain two drive failures?

What are some possible scenarios for what happened here?

RAID6 is capable to cope with 2 concurrent disk failures, not RAID5. — Vlastimil Burián, Jan 21 '20 at 21:18
Right, so I'm a bit confused about what may have happened here - we originally built the raid with 5 drives as a RAID5. And yet now there's one "spare", and 3 active... Shouldn't it have fully failed? — Rail24, Jan 21 '20 at 21:57
I'm unsure, I've never felt so desperate to build RAID**5**, just because it's unreliable. If building with only 4 disks, do yourself a favor next time you build it, and do a RAID**6** ... [my implementation procedure](https://unix.stackexchange.com/a/320330/126755). — Vlastimil Burián, Jan 22 '20 at 05:15

Investigating a failed drive in a raid array

0 Answers0