I'm trying to understand what I see in iostat, specifically the differences between the output for md and sd devices.
I have a couple of quite large Centos Linux servers, each with E3-1230 CPU, 16 GB RAM and 4 2TB SATA disk drives. Most are JBOD, but one is configure with software RAID 1+0. The servers have very similar type and amount of load, but the %util figures I get with iostat on the software raid one is much higher than others, and I'm trying to understand why. All servers are usually 80-90% idle with regard to CPU.
Example of iostat on a server without RAID:
avg-cpu: %user %nice %system %iowait %steal %idle
9.26 0.19 1.15 2.55 0.00 86.84
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 2.48 9.45 10.45 13.08 1977.55 1494.06 147.50 2.37 100.61 3.86 9.08
sdc 4.38 24.11 13.25 20.69 1526.18 1289.87 82.97 1.40 41.14 3.94 13.36
sdd 0.06 1.28 1.43 2.50 324.67 587.49 232.32 0.45 113.73 2.77 1.09
sda 0.28 1.06 1.33 0.97 100.89 61.63 70.45 0.06 27.14 2.46 0.57
dm-0 0.00 0.00 0.17 0.24 4.49 1.96 15.96 0.01 18.09 3.38 0.14
dm-1 0.00 0.00 0.09 0.12 0.74 0.99 8.00 0.00 4.65 0.36 0.01
dm-2 0.00 0.00 1.49 3.34 324.67 587.49 188.75 0.45 93.64 2.25 1.09
dm-3 0.00 0.00 17.73 42.82 1526.17 1289.87 46.50 0.35 5.72 2.21 13.36
dm-4 0.00 0.00 0.11 0.03 0.88 0.79 12.17 0.00 19.48 0.87 0.01
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 1.17 1.17 0.00
dm-6 0.00 0.00 12.87 20.44 1976.66 1493.27 104.17 2.77 83.01 2.73 9.08
dm-7 0.00 0.00 1.36 1.58 95.65 58.68 52.52 0.09 29.20 1.55 0.46
Example of iostat on a server with RAID 1+0:
avg-cpu: %user %nice %system %iowait %steal %idle
7.55 0.25 1.01 3.35 0.00 87.84
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb 42.21 31.78 18.47 59.18 8202.18 2040.94 131.91 2.07 26.65 4.02 31.20
sdc 44.93 27.92 18.96 55.88 8570.70 1978.15 140.94 2.21 29.48 4.60 34.45
sdd 45.75 28.69 14.52 55.10 8093.17 1978.16 144.66 0.21 2.95 3.94 27.42
sda 45.05 32.59 18.22 58.37 8471.04 2040.93 137.24 1.57 20.56 5.04 38.59
md1 0.00 0.00 18.17 162.73 3898.45 4013.90 43.74 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 4.89 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.07 0.26 3.30 2.13 16.85 0.04 135.54 73.73 2.38
dm-1 0.00 0.00 0.25 0.22 2.04 1.79 8.00 0.24 500.99 11.64 0.56
dm-2 0.00 0.00 15.55 150.63 2136.73 1712.31 23.16 1.77 10.66 2.93 48.76
dm-3 0.00 0.00 2.31 2.37 1756.39 2297.67 867.42 2.30 492.30 13.08 6.11
So my questions are:
1) Why is there such a relatively high %util on the server with RAID vs the one without.
2) On the non-RAID server the %util of the combined physical devices (sd*) are more or less the same as the combined LVM devices (dm-*). Why is that not the case for the RAID server?
3) Why does it seem like the software RAID devices (md*) are virtually idle, while the underlying physical devices (sd*) are busy? My first thought was that it might be caused by RAID checking, but /proc/mdadm shows all good.
Edit: Apologies, I thought the question was clear, but that seems there is some confusion about it. Obviously the question is not about the difference in the %util between drives on one server, but why the total/avg %util value on one server is so different from the other. Hope that clarifies any misunderstanding.