How to troubleshoot disk controller on Illumos based systems?

Question

I am using OmniOS which is based off of Illumos.

I have a ZFS pool of two SSD's that are mirrored; the pool, known as data is reporting its %b as 100; below is iostat -xn:

r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0.0    8.0    0.0   61.5  8.7  4.5 1092.6  556.8  39 100 data

Unfortunately, there is not actually a lot of throughput going on; iotop reports about 23552 bytes a second.

I also ran iostat -E and it reported quite a bit of Transport Errors; we changed the port and they went away.

I figured there might be an issue with the drives; SMART reports no issues; I've ran multiple smartctl -t short and smartctl -t long; no issues reported.

I ran fmadm faulty and it reported the following:

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jun 01 18:34:01 5fdf0c4c-5627-ccaa-d41e-fc5b2d282ab2  ZFS-8000-D3    Major     

Host        : sys1
Platform    : xxxx-xxxx       Chassis_id  : xxxxxxx
Product_sn  : 

Fault class : fault.fs.zfs.device
Affects     : zfs://pool=data/vdev=cad34c3e3be42919
                  faulted but still in service
Problem in  : zfs://pool=data/vdev=cad34c3e3be42919
                  faulted but still in service

Description : A ZFS device failed.  Refer to http://illumos.org/msg/ZFS-8000-D3
              for more information.

Response    : No automated response will occur.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

Like it suggests I ran zpool status -x and it reports all pools are healthy.

I ran some DTraces and found that all the IO activity is from <none> (for the file); which is metadata; so there actually isn't any file IO going on.

When I run kstat -p zone_vfs it reports the following:

zone_vfs:0:global:100ms_ops     21412
zone_vfs:0:global:10ms_ops      95554
zone_vfs:0:global:10s_ops       1639
zone_vfs:0:global:1s_ops        20752
zone_vfs:0:global:class zone_vfs
zone_vfs:0:global:crtime        0
zone_vfs:0:global:delay_cnt     0
zone_vfs:0:global:delay_time    0
zone_vfs:0:global:nread 69700628762
zone_vfs:0:global:nwritten      42450222087
zone_vfs:0:global:reads 14837387
zone_vfs:0:global:rlentime      229340224122
zone_vfs:0:global:rtime 202749379182
zone_vfs:0:global:snaptime      168018.106250637
zone_vfs:0:global:wlentime      153502283827640
zone_vfs:0:global:writes        2599025
zone_vfs:0:global:wtime 113171882481275
zone_vfs:0:global:zonename      global

The high amount of 1s_ops and 10s_ops are very concerning.

I'm thinking that it's the controller but I can't be sure; anyone have any ideas? Or where I can get more info?

score 1 · Accepted Answer · 2015-06-03T21:59:32.983

The data pool is a lofi encrypted ZFS container; this is the problem.

I'm able to confirm that it's a performance issue with lofi's "virtual" controller because of the following:

lofi + zfs + encryption has a throughput of about 10-25MB/s
lofi + zfs + no-encryption has a throughput of about 30MB/s
No lofi with plain old ZFS has a throughput of about 250MB/s
The data controller reports 100% utilization whereas the real controller has virtually none.
Tested on multiple machines with the same setup and the results were largely identical.

The issue here is lofi; and not the disk controller.

How to troubleshoot disk controller on Illumos based systems?

1 Answers1