6

I'm running debian stable with a 2 x nvme Raid 1.
Here is the hardware/hoster it's running on https://www.hetzner.com/dedicated-rootserver/ex62-nvme?country=us
Almost every second day mdadm monitoring reports a fail event and leaves the array degraded.
It only disables 1 partition as you can see here:

This is an automatically generated mail message from mdadm
running on xxx

A Fail event had been detected on md device /dev/md/2.

It could be related to component device /dev/nvme1n1p3.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 nvme1n1p3[1](F) nvme0n1p3[0]
      465895744 blocks super 1.2 [2/1] [U_]
      bitmap: 4/4 pages [16KB], 65536KB chunk

md0 : active (auto-read-only) raid1 nvme1n1p1[1] nvme0n1p1[0]
      33521664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 nvme0n1p2[0] nvme1n1p2[1]
      523712 blocks super 1.2 [2/2] [UU]

unused devices: <none>

This happens on both disks. One time it's nvme0n1p3 and next time it's nvme1n1p3.
I then just re-add the failed partition with

mdadm --re-add /dev/md2 /dev/nvme0n1p3

or

mdadm --re-add /dev/md2 /dev/nvme1n1p3

and after the resync it works for a day or two.

In dmesg I found this:

[94879.144892] nvme nvme1: I/O 311 QID 1 timeout, reset controller
[94879.252851] nvme nvme1: completing aborted command with status: 0007
[94879.252970] blk_update_request: I/O error, dev nvme1n1, sector 452352001
[94879.253091] nvme nvme1: completing aborted command with status: fffffffc
[94879.253223] blk_update_request: I/O error, dev nvme1n1, sector 68159504
[94879.253418] md: super_written gets error=-5

I tried to check the health of the devices with these commands, but they don't give me stats like "Reallocated_Sector_Ct" or "Reported_Uncorrect".

smartctl -x /dev/nvme1

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KXG50ZNV512G TOSHIBA
Serial Number:                      28SS10F6TYST
Firmware Version:                   AAGA4102
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon May 13 10:34:11 2019 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     78 Celsius
Critical Comp. Temp. Threshold:     82 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.0500W       -        -    3  3  3  3     1500    1500
 4 -   0.0050W       -        -    4  4  4  4     6000   14000
 5 -   0.0030W       -        -    5  5  5  5    50000   80000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        47 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    57%
Data Units Read:                    31,858,921 [16.3 TB]
Data Units Written:                 293,589,002 [150 TB]
Host Read Commands:                 4,130,502,428
Host Write Commands:                889,121,505
Controller Busy Time:               13,552
Power Cycles:                       7
Power On Hours:                     6,720
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               47 Celsius

Error Information (NVMe Log 0x01, max 128 entries)
No Errors Logged

nvme smart-log /dev/nvme1

Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 47 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 57%
data_units_read                     : 31,858,921
data_units_written                  : 293,589,023
host_read_commands                  : 4,130,502,429
host_write_commands                 : 889,122,059
controller_busy_time                : 13,552
power_cycles                        : 7
power_on_hours                      : 6,720
unsafe_shutdowns                    : 0
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 47 C
Temperature Sensor 2                : 0 C
Temperature Sensor 3                : 0 C
Temperature Sensor 4                : 0 C
Temperature Sensor 5                : 0 C
Temperature Sensor 6                : 0 C
Temperature Sensor 7                : 0 C
Temperature Sensor 8                : 0 C

nvme smart-log-add /dev/nvme1

NVMe Status:INVALID_LOG_PAGE(4109)

smartctl -A /dev/nvme1

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-8-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    57%
Data Units Read:                    31,858,924 [16.3 TB]
Data Units Written:                 293,591,327 [150 TB]
Host Read Commands:                 4,130,502,490
Host Write Commands:                889,172,096
Controller Busy Time:               13,552
Power Cycles:                       7
Power On Hours:                     6,721
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius

I only noticed the issue after apache failed to start and I repaired the filesystem with fsck.ext4 -f. Before I didn't have setup root mail correctly.

So looks to me like a hardware error and I should get rid of both nvmes.
Is there anything I can try to fix these issues and save the nvmes? Or at least to get all the smart values like "Reported_Uncorrect" or "Offline_Uncorrectable".

treffner
  • 61
  • 1
  • 5

2 Answers2

0

How about smartctl -A /dev/nvme<xxxx>?

From help:

-A, --attributes Show device SMART vendor-specific Attributes and values

Edward
  • 2,364
  • 3
  • 16
  • 26
0

Since the media and data integrity errors come up as 0 in the smart log you shared, there seems to be no uncorrectable ECC or CRC issues. To get the uncorrectable information if any occurred on PCIe, you can try reading PCIe AER for the device.