I am using OmniOS which is based off of Illumos.
I have a ZFS pool of two SSD's that are mirrored; the pool, known as data is reporting its %b as 100; below is iostat -xn:
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 8.0 0.0 61.5 8.7 4.5 1092.6 556.8 39 100 data
Unfortunately, there is not actually a lot of throughput going on; iotop reports about 23552 bytes a second.
I also ran iostat -E and it reported quite a bit of Transport Errors; we changed the port and they went away.
I figured there might be an issue with the drives; SMART reports no issues; I've ran multiple smartctl -t short and smartctl -t long; no issues reported.
I ran fmadm faulty and it reported the following:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jun 01 18:34:01 5fdf0c4c-5627-ccaa-d41e-fc5b2d282ab2 ZFS-8000-D3 Major
Host : sys1
Platform : xxxx-xxxx Chassis_id : xxxxxxx
Product_sn :
Fault class : fault.fs.zfs.device
Affects : zfs://pool=data/vdev=cad34c3e3be42919
faulted but still in service
Problem in : zfs://pool=data/vdev=cad34c3e3be42919
faulted but still in service
Description : A ZFS device failed. Refer to http://illumos.org/msg/ZFS-8000-D3
for more information.
Response : No automated response will occur.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
Like it suggests I ran zpool status -x and it reports all pools are healthy.
I ran some DTraces and found that all the IO activity is from <none> (for the file); which is metadata; so there actually isn't any file IO going on.
When I run kstat -p zone_vfs it reports the following:
zone_vfs:0:global:100ms_ops 21412
zone_vfs:0:global:10ms_ops 95554
zone_vfs:0:global:10s_ops 1639
zone_vfs:0:global:1s_ops 20752
zone_vfs:0:global:class zone_vfs
zone_vfs:0:global:crtime 0
zone_vfs:0:global:delay_cnt 0
zone_vfs:0:global:delay_time 0
zone_vfs:0:global:nread 69700628762
zone_vfs:0:global:nwritten 42450222087
zone_vfs:0:global:reads 14837387
zone_vfs:0:global:rlentime 229340224122
zone_vfs:0:global:rtime 202749379182
zone_vfs:0:global:snaptime 168018.106250637
zone_vfs:0:global:wlentime 153502283827640
zone_vfs:0:global:writes 2599025
zone_vfs:0:global:wtime 113171882481275
zone_vfs:0:global:zonename global
The high amount of 1s_ops and 10s_ops are very concerning.
I'm thinking that it's the controller but I can't be sure; anyone have any ideas? Or where I can get more info?