Introduction
In this blog post series I will expose the methodology and DTrace scripts which I developed for analyzing a latency problem on the HBA Emulex driver.
In the case that you are facing a problem with an I/O subsystem, the analysis should be started on the higher levels of the I/O stack. To get yourself familiarized with troubleshooting in this area, I recommend reading Gregg Brandan’s blog and his Dtrace book which has two chapters dedicated to analyzing I/O problems.
However, if you do the analysis down the I/O stack and still observe a high latency on block devices, this article might apply to your problem. In today’s blog post I will explain how to measure the end-to-end latency on the HBA driver.
SDT DTrace Probes
First, I looked if there are any existing DTrace providers for the HBA drivers, but I couldn’t find any.
As the next best choice, I looked for sdt (Statically Defined Tracing) providers, but no luck there as well:
dtrace -ln 'sdt:::' | grep emlxs
Note: emlxs is the name of the Emulex driver.
Since the driver code has not been instrumented with the SDT probes, the Function Boundary Tracing provider (fbt) has to be used. As the fbt provider traces the execution of raw kernel code, DTrace scripts that are based on it, may break between kernel and driver releases.
Stack Fishing for I/O Start
As for the next step, the driver functions that initiate I/O operations have to be identified. Therefore, I’ll collect the call stacks when any of the Emulex driver functions are being executed:
dtrace -n 'fbt:emlxs::entry { @[stack()] = count(); }'
I eyeballed the collected stack backtraces to identify the stack which initiates the transfer:
[...truncated...]
emlxs`emlxs_sli4_prep_fcp_iocb+0xda
emlxs`emlxs_send_fcp_cmd+0x137
emlxs`emlxs_fca_transport+0x256
fctl`fc_ulp_transport+0x22a
fcp`fcp_transport+0x27
fcp`fcp_scsi_start+0x149
scsi`scsi_transport+0x233
scsi_vhci`vhci_scsi_start+0x45d
scsi`scsi_transport+0x2b2
sd`sd_start_cmds+0x34f
sd`sd_core_iostart+0x186
sd`sd_mapblockaddr_iostart+0x1ef
sd`sd_xbuf_strategy+0x4e
sd`xbuf_iostart+0x1e2
sd`ddi_xbuf_qstrategy+0xd7
sd`sdstrategy+0x175
genunix`bdev_strategy+0x66
genunix`ldi_strategy+0x56
zfs`vdev_disk_strategy+0x158
zfs`vdev_disk_io_start+0x10c
[...truncated...]
The stack was easily identified because it contains calls to the scsi_transport function. fctl`fc_ulp_transport is the first Solaris HBA generic driver function on the stack before the first emlxs call. Therefore, I’ll consider using the return boundary of this function for recording the I/O start operations.
Stack Fishing for I/O Done
The stack above cannot be used for identifying the complete stack of the I/O done event, becuse the functions will be called in the reversed order, namely the emulex functions first and than the generic ones. However, we can use the I/O start stack for identifying the module names of the functions which process the I/O and then collect stacks backtraces of all the functions involved in the I/O operation:
dtrace -n 'fbt:emlxs::entry,fbt:scsi::entry,fbt:fctl::entry,fbt:fcp::entry { @[stack()] = count(); }'
Obviously, the functions destroying the packet complete the IO:
[...truncated...]
fctl`fc_ulp_uninit_packet+0x2f
fcp`fcp_pkt_teardown+0x8b
scsi`scsi_cache_destroy_pkt+0x46
scsi`scsi_destroy_pkt+0x21
scsi_vhci`vhci_intr+0x7af
scsi`scsi_hba_pkt_comp+0x293
fcp`fcp_post_callback+0x19
fcp`fcp_cmd_callback+0x72
emlxs`emlxs_handle_fcp_event+0x9fa
emlxs`emlxs_proc_channel_event+0x647
emlxs`emlxs_proc_channel+0x62
emlxs`emlxs_thread+0x169
unix`thread_start+0x8
[...truncated...]
The function fctl`fc_ulp_uninit_packet is close to the emulex driver functions on the I/O completion stack, so it could be used for marking the end of the I/O operation.
Cross-Checking by Counting Functions Calls
Let’s check whether the number of the executions of the functions, that have been picked for measuring starting and finishing point of a single IO, match:
dtrace -n 'fbt:emlxs::entry,fbt:scsi::entry,fbt:fctl::entry,fbt:fcp::entry { @[probefunc] = count(); }'
[...truncated...]
emlxs_fca_pkt_init 22356
emlxs_fca_pkt_uninit 22356
fc_ulp_init_packet 22356
fc_ulp_uninit_packet 22356
fcp_pkt_setup 22356
fcp_pkt_teardown 22356
emlxs_CQE_to_IOCB 22358
emlxs_check_dma 22358
emlxs_fca_transport 22358
emlxs_handle_fcp_event 22358
emlxs_initialize_pkt 22358
emlxs_msiid_to_chan 22358
emlxs_node_find_did 22358
emlxs_proc_channel_event 22358
emlxs_select_fcp_channel 22358
emlxs_select_msiid 22358
emlxs_send_fcp_cmd 22358
emlxs_sli4_alloc_xri 22358
emlxs_sli4_bde_setup 22358
emlxs_sli4_free_xri 22358
emlxs_sli4_prep_fcp_iocb 22358
emlxs_sli4_process_wqe_cmpl 22358
emlxs_sli4_write_wqdb 22358
emlxs_unswap_pkt 22358
emlxs_update_sd_bucket 22358
fc_ulp_transport 22358
fcp_cmd_callback 22358
fcp_complete_pkt 22358
fcp_post_callback 22358
fcp_prepare_pkt 22358
fcp_reconfig_wait 22358
fcp_scsi_start 22358
fcp_transport 22358
emlxs_mem_get 22362
emlxs_mem_pool_get 22362
emlxs_mem_pool_put 22362
emlxs_mem_put 22362
scsi_cache_destroy_pkt 22640
scsi_cache_dmafree_attr 22640
scsi_hba_pkt_alloc 22640
scsi_hba_pkt_free 22640
scsi_cache_init_pkt 22642
emlxs_sli4_cqid_to_index 22920
emlxs_sli4_msi_intr 22920
emlxs_sli4_process_cq 22920
emlxs_sli4_process_eq 22920
scsi_pkt2bp 39577
emlxs_swap_fcp_pkt 44716
emlxs_tx_get 44948
emlxs_sli4_issue_iocb_cmd 44983
scsi_destroy_pkt 45280
scsi_init_pkt 45282
scsi_hba_pkt_comp 45286
scsi_transport 45286
emlxs_sli4_write_cqdb 45840
[...truncated...]
The output has confirmed that the number of executions of fctl`fc_ulp_uninit_packet and fctl`fc_ulp_transport are 22356 and 23358 respectively, which corresponds to the number of the I/O operations during the observed time period. The small difference is attributable to two in-flight I/O operations that have not been finished yet, in the moment the count script was interrupted.
It can also be seen that some of the the scsi functions get executed twice as many. The script scsicmds.d helps to clarify this phenomenon:
[...truncated...]
fp synchronize_cache 8712
scsi_vhci synchronize_cache 8712
fp read(10) 14558
scsi_vhci read(10) 14586
fp write(10) 21197
scsi_vhci write(10) 21886
Because of the multipathing (scsi_vhci) the scsi functions will be called twice.
Pairing I/O Start with I/O Done
A little difficulty in associating the I/O start of a given paket with its I/O done event is caused by the fact that I/O start and I/O done are not performed within the same thread. For the sake of efficiency, the thread gets off the CPU after an I/O was submitted. After receiving the data from storage, the HBA device sends an interrupt which afterwards get handled by another thread. As a consequence, the start time must be stored in an associative array instead in a thread variable. Thus, we need to find a unique key to access the start time once the I/O gets finished. An appropriate key can be identified by examining the function arguments of the both functions:
fctl.c:fc_ulp_transport(opaque_t port_handle, fc_packet_t *pkt)
fctl.c:fc_ulp_uninit_packet(opaque_t port_handle, fc_packet_t *pkt)
The second argument, the address of the packet variable (pkt), is unlikely to change during the whole life-cycle of the I/O. Based on this assumption, I wrote the following dtrace script for measuring the latency on the HBA driver:
#!/usr/sbin/dtrace -s
/* hbalatency.d */
fbt::fc_ulp_transport:entry
{
start_time[arg1] = timestamp ;
}
fbt::fc_ulp_uninit_packet:entry
/ start_time[arg1] /
{
this->diff = (timestamp - start_time[arg1])/1000000 ;
@["latency (ms)"] = quantize(this->diff) ;
start_time[arg1] = 0 ;
}
Verification
hbalatency.d can be verified by scsi.d .
dtrace -qCs scsi.d -D QUIET -D PERF_REPORT
scsi_vhci 0
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 368543
1 |@@ 16447
2 |@ 8801
4 |@ 12638
8 |@ 10302
16 | 2190
32 | 306
64 | 87
128 | 2
256 | 0
dtrace -s hba_latency.d
latency (ms)
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 356419
1 |@@ 15541
2 |@ 8432
4 |@ 12351
8 |@ 10079
16 | 2081
32 | 292
64 | 70
128 | 2
256 | 0
Note: I changed the time units from ns to ms in scsi.d.
It seems to be a pretty good match. The minor differencies are attributable to very small delay in starting and initializing the script.
What Do We Measure?
The following functional diagram can be deduced from the collected stacks:
It should be kept in mind that the end-to-end latency measured on the HBA device driver also contains round-trips to storage.
Conclusion
In today’s blog post I scratched the surface of the Emulex HBA driver and built the script for measuring the latency on it. In the next installments I’ll dig into the driver’s internals and provide DTrace scripts for instrumenting different parts of the processing chain.