5

Configuring two PCIe NVMe SSDs as a raid1 Linux software raid instead of boosting read performance has roughly halved the read speed.

In similar Linux software raid1 setups (also SSDs) I have seen an increase in read performance, since now two mirrored block devices can be used for the reads.

What could be potential reasons and lines of investigation to address this performance issue?

Benchmarking was done using fio read with 4k on /dev/md125 (the raid1) , /dev/nvme1n1 and /dev/nvme0n1 its members. Reading from them is faster than reading from /dev/md125.

It seems also other people using software Linux raid1 face counter-intuitive speed reduction instead of a speed gain with raid1 reads (see https://serverfault.com/questions/235199/poor-software-raid10-read-performance-on-linux).

Here are some numbers of the performance benchmarks using fio with random 4k reads on concurrently /dev/nvme1n1p1 and /dev/nvme0n1p1 devices I get this:

 fio4k /dev/nvme1n1p1
 [...]
 read: IOPS=637k, BW=2487MiB/s (2608MB/s)(146GiB/60001msec)
 
 fio4k /dev/nvme0n1p1
 read: IOPS=652k, BW=2545MiB/s (2669MB/s)(149GiB/60001msec)

if I generate a raid1 /dev/md125 with both (/dev/nvme1n1p1, /dev/nvme0n1p1, even skipping a bitmap as not to cause any negative impact)

  mdadm --verbose  --create /dev/md/raid1_nvmes --bitmap=none --assume-clean --level=1 --raid-devices=2 /dev/nvme0n1p1 /dev/nvme1n1p1
  fio4k /dev/md125
  [...]
  read: IOPS=337k, BW=1317MiB/s (1381MB/s)(77.2GiB/60001msec)

Update fio comand line and other infos

this is the fio command used (with the vairables BLOCKDEVICE and BLOCKSIZE being set according to the provided values above, BLOCKSIZE=4k and BLOCKDEVICE, being /dev/nvme0n1p1, /dev/nvme1n1p1 and /dev/md/raid1_nvmes

fio --filename="$BLOCKDEVICE" \
    --direct=1 \
    --rw=randread \
    --readonly \
    --bs="$BLOCKSIZE" \
    --ioengine=libaio \
    --iodepth=256 \
    --runtime=60 \
    --numjobs=4 \
    --time_based \
    --group_reporting \
    --name=iops-test-job \
    --direct=1 \
    --eta-newline=1 2>&1

This is the output of the fio tests I run:

test fio benchmark direct block device /dev/nvme0n1p1

root@ada:/virtualization/machines# cat /usr/local/bin/nn_scripts/nn_fio
#!/bin/bash

set -x
BLOCKDEVICE="$1"
test -b "$BLOCKDEVICE" || { echo "usage: $0 <blockdev> [size_of_io_chunk] [mode: randread]" >&2; exit 1; }

BLOCKSIZE="$2"
test "${BLOCKSIZE%%[kMGT]}" -eq "${BLOCKSIZE%%[kMGT]}" 2>/dev/null || { echo "Run FIO benchmark with block size of 4k";  BLOCKSIZE=4k; }


fio --filename="$BLOCKDEVICE" \
    --direct=1 \
    --rw=randread \
    --readonly \
    --bs="$BLOCKSIZE" \
    --ioengine=libaio \
    --iodepth=256 \
    --runtime=60 \
    --numjobs=4 \
    --time_based \
    --group_reporting \
    --name=iops-test-job \
    --direct=1 \
    --eta-newline=1 2>&1 | tee /root/fio.logs/fio.$(basename "$BLOCKDEVICE:").$BLOCKSIZE.$(date -Iseconds)

root@ada:/virtualization/machines# time /usr/local/bin/nn_scripts/nn_fio /dev/nvme0n1p1                                                        [125/294]
+ BLOCKDEVICE=/dev/nvme0n1p1
+ test -b /dev/nvme0n1p1
+ BLOCKSIZE=
+ test '' -eq ''
+ echo 'Run FIO benchmark with block size of 4k'
Run FIO benchmark with block size of 4k
+ BLOCKSIZE=4k
+ fio --filename=/dev/nvme0n1p1 --direct=1 --rw=randread --readonly --bs=4k --ioengine=libaio --iodepth=256 --runtime=60 --numjobs=4 --time_based --grou
p_reporting --name=iops-test-job --direct=1 --eta-newline=1
++ basename /dev/nvme0n1p1:
++ date -Iseconds
+ tee /root/fio.logs/fio.nvme0n1p1:.4k.2021-02-26T11:41:03+01:00
tee: '/root/fio.logs/fio.nvme0n1p1:.4k.2021-02-26T11:41:03+01:00': No such file or directory
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.12
Starting 4 processes

iops-test-job: (groupid=0, jobs=4): err= 0: pid=28221: Fri Feb 26 11:42:04 2021
  read: IOPS=626k, BW=2446MiB/s (2565MB/s)(143GiB/60001msec)
    slat (usec): min=2, max=625, avg= 4.59, stdev= 3.06
    clat (usec): min=90, max=10696, avg=1629.07, stdev=128.82
     lat (usec): min=96, max=10700, avg=1633.79, stdev=129.08
    clat percentiles (usec):
     |  1.00th=[ 1401],  5.00th=[ 1434], 10.00th=[ 1450], 20.00th=[ 1516],
     | 30.00th=[ 1582], 40.00th=[ 1614], 50.00th=[ 1647], 60.00th=[ 1663],
     | 70.00th=[ 1696], 80.00th=[ 1729], 90.00th=[ 1762], 95.00th=[ 1811],
     | 99.00th=[ 1909], 99.50th=[ 1975], 99.90th=[ 2245], 99.95th=[ 2606],
     | 99.99th=[ 3458]
   bw (  KiB/s): min=479040, max=691888, per=25.00%, avg=626199.33, stdev=37403.47, samples=477
   iops        : min=119760, max=172972, avg=156549.78, stdev=9350.91, samples=477
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.63%, 4=0.36%, 10=0.01%, 20=0.01%
  cpu          : usr=30.55%, sys=69.28%, ctx=38473, majf=0, minf=6433
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=37573862,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=2446MiB/s (2565MB/s), 2446MiB/s-2446MiB/s (2565MB/s-2565MB/s), io=143GiB (154GB), run=60001-60001msec

Disk stats (read/write):
  nvme0n1: ios=37487591/1001, merge=14/185, ticks=15999825/331, in_queue=24175124, util=100.00%

real    1m0.698s
user    1m20.593s
sys     2m46.774s

test fio benchmark on raid1 /dev/nvme1n1p1

root@ada:/virtualization/machines# time /usr/local/bin/nn_scripts/nn_fio "$(realpath "/dev/md/ada:raid1_nvmes")"                                [10/330]
+ BLOCKDEVICE=/dev/md127
+ test -b /dev/md127
+ BLOCKSIZE=
+ test '' -eq ''
+ echo 'Run FIO benchmark with block size of 4k'
Run FIO benchmark with block size of 4k
+ BLOCKSIZE=4k
+ fio --filename=/dev/md127 --direct=1 --rw=randread --readonly --bs=4k --ioengine=libaio --iodepth=256 --runtime=60 --numjobs=4 --time_based --group_re
porting --name=iops-test-job --direct=1 --eta-newline=1
++ basename /dev/md127:
++ date -Iseconds
+ tee /root/fio.logs/fio.md127:.4k.2021-02-26T11:49:06+01:00
tee: '/root/fio.logs/fio.md127:.4k.2021-02-26T11:49:06+01:00': No such file or directory
iops-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=256
...
fio-3.12
Starting 4 processes

iops-test-job: (groupid=0, jobs=4): err= 0: pid=67832: Fri Feb 26 11:50:07 2021
  read: IOPS=322k, BW=1257MiB/s (1318MB/s)(73.6GiB/60001msec)
    slat (usec): min=3, max=535, avg=10.44, stdev= 5.29
    clat (usec): min=47, max=14172, avg=3170.20, stdev=142.99
     lat (usec): min=59, max=14179, avg=3180.78, stdev=143.44
    clat percentiles (usec):
     |  1.00th=[ 2900],  5.00th=[ 2966], 10.00th=[ 2999], 20.00th=[ 3032],
     | 30.00th=[ 3097], 40.00th=[ 3163], 50.00th=[ 3195], 60.00th=[ 3228],
     | 70.00th=[ 3261], 80.00th=[ 3294], 90.00th=[ 3326], 95.00th=[ 3359],
     | 99.00th=[ 3425], 99.50th=[ 3458], 99.90th=[ 3621], 99.95th=[ 3818],
     | 99.99th=[ 5866]
   bw (  KiB/s): min=293472, max=350408, per=24.99%, avg=321583.77, stdev=11302.31, samples=477
   iops        : min=73368, max=87602, avg=80395.91, stdev=2825.56, samples=477
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  lat (usec)   : 1000=0.01%
  lat (msec)   : 2=0.01%, 4=99.96%, 10=0.03%, 20=0.01%
  cpu          : usr=18.54%, sys=81.47%, ctx=342, majf=0, minf=11008
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=19303258,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=1257MiB/s (1318MB/s), 1257MiB/s-1257MiB/s (1318MB/s-1318MB/s), io=73.6GiB (79.1GB), run=60001-60001msec

The Linux kernel version is:

root@ada:/virtualization/machines# uname -a
Linux ada 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux

Schedulers used on the nvmes is none:

root@ada:/virtualization/machines# grep . /sys/block/{md127,nvme0n1,nvme1n1}/queue/scheduler
/sys/block/md127/queue/scheduler:none
/sys/block/nvme0n1/queue/scheduler:[none] mq-deadline
/sys/block/nvme1n1/queue/scheduler:[none] mq-deadline

There was a request to provide iostat output for the case a) direct nvme ssd perfomance and b) performace of raid1 of the nvme ssds.

a) direct nvme performance

tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn Device
 543201.33         2.1G         1.5M       6.2G       4.6M nvme1n1
    20.67         1.3k         1.5M       4.0k       4.6M nvme0n1
    25.67         1.3k         1.5M       4.0k       4.6M md127

b) performance of the raid1

tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn Device
 169797.33       663.3M        32.3k       1.9G      97.0k nvme1n1
 159573.67       623.3M        32.3k       1.8G      97.0k nvme0n1
 329367.33         1.3G        32.0k       3.8G      96.0k md127

c) performance of parallel fio benchmark of /dev/nvme1n1p1 and /dev/nvme0n1p1

tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn Device                                                                                       [0/747]
 585589.67         2.2G        20.7M       6.7G      62.0M nvme1n1
 405723.00         1.5G        20.7M       4.6G      62.0M nvme0n1
   421.67         1.1M        20.7M       3.4M      62.0M md127

The two involved NVME devices are Samsung Evo 970

root@ada:/sys/module# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S4EWNM0NC28151E      Samsung SSD 970 EVO Plus 1TB             1         284.89  GB /   1.00  TB    512   B +  0 B   2B2QEXM7
/dev/nvme1n1     S4EWNM0NC28144V      Samsung SSD 970 EVO Plus 1TB             1         284.89  GB /   1.00  TB    512   B +  0 B   2B2QEXM7
r

They are inserted into PCIe Slots into the System using this Adapter. The output of lspci is hence:

root@ada:/sys/module# lspci -vv | grep -i 'nvme ssd controller'
41:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981
62:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981

The system is a DELL server system with 512GiB ram and two sockets equiped with AMD EPYC 7551 32-Core Processors.

During the benchmarks there have been no dmesg errors.

humanityANDpeace
  • 13,722
  • 13
  • 61
  • 107
  • 1
    What's the performance if you put one NVMe drive in a RAID-0 "array"? If that's about half as fast as reading directly from the drive, it's the software RAID getting in the way. – Andrew Henle Feb 16 '21 at 17:26
  • @AndrewHenle thanks for the suggestion. I removed the RAID1 (with about 330k 4kB iops) and created upon your suggestion a RAID0 (yielding a higher 546k 4kB iops), which is sitll less the single drive performance ( 560k 4kB iops ). It seems at best software raid1 does not reduce severly the read performance. – humanityANDpeace Feb 16 '21 at 17:53
  • the raid might be trying to cache pages larger than 4k and thus this benchmark generates a lot of overhead, but that's just a guess. to use more than one drive for reading you also need multiple processes, not sure if fio still defaults to 1 thread only – frostschutz Feb 16 '21 at 20:47
  • @frostschutz. I missed Yes to add the rather lengthy fio command line. But I used 4 concurrent fio processes, knowing about that with a single process serially accessing 4k parts I could only reach raid1 at speed of single raid member drive. Other tools like `hdparm -t` and also 4 processes of alike `dd conv=sync if=/dev/md125 bs=4k of=/dev/null count=2000 ` at different offsets also did not show largey diverging results. Maybe raid1 can not reach conceptually level of modern NVME (in MB/s and IOPS) ? – humanityANDpeace Feb 17 '21 at 07:54
  • Please also post what's in your PC (CPU, RAM, Motherboard, your SSDs model, every PCIe cards you have, etc). Temps, PCIe topology and SSD specs may have their words on this. Also watch for CPU usage and see if something stalls somewhere on one/two/etc thread or that it's evenly shared among threads to a round number (e.g. 12%, 25%, 33$, 50%, etc), you're doing software RAID after all. – X.LINK Feb 24 '21 at 12:52
  • 1
    @frostschutz "not sure if fio still defaults to 1 thread only" - [fio does default to one "thread" by default](https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-numjobs) but what happens with the default depends on [subtleties depending on platform](https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-thread), whether you are using an async ioengine etc... [insert explanation too lengthly for a comment] – Anon Feb 25 '21 at 06:29
  • @d.c. I provided the information about the kernel version and schedulers. – humanityANDpeace Feb 26 '21 at 11:01
  • Was there any additional output at the end of the second (`/dev/md127`) fio run? – Anon Feb 27 '21 at 11:06
  • 1
    Just to check - have you written random data across the entirety of your NVMe devices without doing any trimming before starting these tests? – Anon Feb 28 '21 at 10:03
  • From another post, your hardware seems to be a "DELL server multi-core system with 64 cores 512GiB system to create virtual machines" that should not be older than 2017 if I do consider that you have a dual-socket motherboard, as quad-socket are less common since around 2017 now. So you either have a dual or single AMD Epyc CPU or a dual Xeon setup, which means that you should have enough PCIe lanes so no bottleneck are involved, but that still depends on how you put your SSDs in, and if there's some other PCIe cards there. But on an Intel platform, do mind the QPI links between the CPUs. – X.LINK Feb 28 '21 at 17:27
  • @X.LINK that is correct. The system is a DELL server and with two AMD Epyc CPUs (which each provide 32 cores/ 64 threads). I have to confess that I have had some coworker perform the task to inserting the 2x Samsung SSD 970 EVO Plus 1TB on this [PCIe adapter card](https://www.amazon.com/dp/B07RZZ3TJG). I have added info about this into the question – humanityANDpeace Feb 28 '21 at 17:38
  • @Anon no there are no dmesg entries during the fio runs – humanityANDpeace Feb 28 '21 at 19:04
  • @Anon I will edit "no dmesg errors" into the question so we can remove these comments – humanityANDpeace Feb 28 '21 at 19:44
  • @humanityANDpeace: You should then have a Dell PowerEdge R7425 server with a R740 motherboard, that however doesn't tell which riser you did use. Please also put an `lstopo` ( https://unix.stackexchange.com/questions/113605/is-there-a-tool-that-i-can-use-to-create-a-diagram-of-my-systems-architecture) map of the whole server so we can see what those risers are conected to. If you do have a GPU riser with x16 ports, that should do it, but the condensators along the ports aren't there like the usual x16 ones. It would be stupid for GPUs to only have x1 logical ports, but OEMs are OEMs... – X.LINK Feb 28 '21 at 21:54
  • Especially Dell, which are well known to cripple hardware that aren't "genuine" (e.g. laptop chargers, etc), your adapters are cheapos, but may be fine. Do note that we don't know what gen (3.0 ?) those risers you have are, this help. Furthermore, the lspci you've uploaded only tells the SSDs' own controllers, not the riser's or what's linking everything at the end of the risers or the motherboard's own PCIe ports. Since you have an EPYC-based system, CPU to CPU bus (Infinity Fabric) isn't the bottleneck, it may be something smaller than that (PCH, etc). Halving the performance is too even. – X.LINK Feb 28 '21 at 22:08

1 Answers1

6

(For those posting questions involving fio, I strongly recommend that you clearly post the full job you are running and fio version number because these things can have a huge impact on whether you get a correct answer to your question)

fio is reporting more kernel overhead in the mdadm case plus check out the difference in the number of job context switches. You may want to look into letting fio do batching -- https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth-batch-submit -- so each call is allowed to submit more in one go. Additionally, you may want to use /dev/md/raid1_nvmes as the RAID device name IF fio is failing to give disk stats for it with your previous command line.

Another thing to check is the speeds you get when you read from both the underlying disks at the same time. An example job is something like this:

fio --direct=1 --rw=randread --readonly --bs=4k --ioengine=libaio \
  --iodepth=1024 --runtime=60 --time_based \
  --name=solo1 --filename=/dev/nvme0n1p1 --stonewall \
  --name=solo2 --filename=/dev/nvme0n1p1 --stonewall \
  --name=duo1 --filename=/dev/nvme0n1p1 --name=duo2 --filename=/dev/nvme1n1p1

Hopefully the solo job runs by itself and the duo jobs run simultaneously but my fio job format may be a bit rusty so feel free to play about with it or split the duo run off into a separate fio invocation.

Dead end ideas

The concept of mdadm RAID chunk size sadly won't have any bearing on this particular problem. Unlike RAID 0/4/5/6/10 mdadm's RAID 1 doesn't have chunks (see this answer to mdadm raid1 and what chunksize (or blocksize) on 4k drives? or search for --chunk in the mdadm man page).

If the I/O you are doing is a single sequential stream it is not expected that mdadm RAID1 reads should be any faster than those of a single disk. As above this shouldn't apply in this case because a) the reads are random b) multiple parallel readers (via numjobs in the fio case) were taking place.

Anon
  • 3,694
  • 22
  • 29
  • Thank you for the answer. I was aware of that linux md providing at best combined read speed of raid members depends on parallel reads (the link to superuser however is nice here). I will provide the mentioned information in an edit to the question. With regards that the I/O is being aligned I can currenlty only say that I assume this to be the case, but will look for a way to check it. Which format of the `iostat` output should be best for gaining insight? – humanityANDpeace Feb 26 '21 at 10:47
  • `iostat -xhz 1` is a nice starting point. You want to keep an eye on the `avgrq-sz` to try and take a guess about alignment. If you REALLY want to deep dive you'll have to look towards something like [blktrace](https://linux.die.net/man/8/blktrace) or by observing the appropriate kernel probes (see http://brendangregg.com/linuxperf.html for an overview). – Anon Feb 27 '21 at 10:10
  • Hi Anon, I am sorry to have let the bounty reward lapse which I should have awared to the only answer available yours (I indeed wanted to wait as long as possible hoping for more answers...) Regarding `/dev/md/raid1_nvmes` to be used in `fio` I will run such a test. – humanityANDpeace Feb 28 '21 at 09:31
  • Hi! That's OK and I understand this wasn't the answer you were looking for (it is more questions after all). I hadn't originally intended to update it but when I saw the bounty time running out without another answer... FWIW I would draw your attention to cpu line in the fio outputs you posted. There is a suggestion that you're using up all of one CPU even without mdadm (30.55 + 69.28 = 99.83). If for whatever reason the particular load being generated isn't/can't be spread across more CPUs (both user/kernel/interrupt etc) and mdadm itself introduces CPU overhead then.. [out of space - sorry!] – Anon Feb 28 '21 at 10:21
  • @humanityANDpeace were you ever able to test out the "sending data to two disks simultaneously case"? – Anon Mar 09 '21 at 05:48