Emulex HBA Driver Performance Analysis (3) – Measuring Overhead

Introduction

In the previous installment I wrote about the processing workflow within the Emulex HBA driver. Three processing stages were identified: I/O start, I/O interrupt and I/O. In this blog post I’ll explain the internal functioning of the DTrace script which I developed for measuring the duration of each stage and the total amount of time spent in the driver code.

Setting Boundaries

The fbt probes will be used for recording the timestamps at the beginning and the end of each stage. Therefore, the appropriate functions need to be identified. The entry probe of the first emlxs function on the stack is a good choice for recording the starting point of a processing stage.

Since scripts based on fbt trace raw kernel code, it might happen that they have to be adapted for different kernel and driver releases.

Let’s examine the functions for each of the stages.

I/O Start

First, I’ll collect the stacks for the emlxs functions and identify the stack. The stack for I/O will contain some transport functions:

dtrace -n 'fbt:emlxs::entry { @[stack()] = count(); }'
[...truncated...]
emlxs`emlxs_sli4_bde_setup+0x233
emlxs`emlxs_sli4_prep_fcp_iocb+0x10a
emlxs`emlxs_send_fcp_cmd+0x137
emlxs`emlxs_fca_transport+0x256
fctl`fc_ulp_transport+0x22a
[...truncated...]

The first emlxs function on the stack is emlxs_fca_transport:

emlxs_solaris.c:
emlxs_fca_transport(opaque_t fca_port_handle, fc_packet_t *pkt)

The second argument pkt is the pointer to the packet structure. The address of the buffer will be used to uniquely identify the packet while it’s traversing different stages.

Next, I’ll verify that the function executes once and only once for each I/O by looking at the number of executions under load:

dtrace -n 'fbt:emlxs::entry,fbt:fctl:fc_ulp_transport:entry { @[probefunc] = count(); }'
[...truncated...]
fc_ulp_transport        23316
[...truncated...]
emlxs_fca_transport     23316
[...truncated...]

Since this prerequisite is fulfilled, the entry and the return probe can be used to record the timestamp of the beginning and the end of the I/O start stage, respectively:

/* I/O Start Begin */
fbt::emlxs_fca_transport:entry
{
  self->pkt_addr = arg1 ;
  ts_start_begin[self->pkt_addr] = timestamp ;
}

/* I/O Start End */
fbt::emlxs_fca_transport:return
/ self->pkt_addr /
{
  ts_start_end[self->pkt_addr] = timestamp ;
  self->pkt_addr = 0 ;
}

I/O Interrupt

Second, the stack of the interrupt handler thread will be examined:

[...truncated...]
emlxs`emlxs_mem_put+0x9d
emlxs`emlxs_sli4_free_xri+0xe1
emlxs`emlxs_sli4_process_wqe_cmpl+0xff
emlxs`emlxs_sli4_process_cq+0x139
emlxs`emlxs_sli4_process_eq+0xda
emlxs`emlxs_sli4_msi_intr+0x7f
apix`apix_dispatch_by_vector+0x89
apix`apix_dispatch_lowlevel+0x32
unix`switch_sp_and_call+0x13
[...truncated...]

As the processing of the interrupt begins with the function emlxs_sli4_msi_intr, the begin timestamp will be collected in the entry probe. However, this function doesn’t contain any obvious arguments which could identify the scsi packet. From the the previous installment we already know that the first function on the stack which could be associated with the packet is emlxs_sli4_process_wqe_cmpl. Therefore, the start time has to be initialized in the entry probe of emlxs_sli4_msi_intr and then stored in the associative array within the entry probe of emlxs_sli4_process_wqe_cmpl:

/* I/O Interrupt Begin */
fbt::emlxs_sli4_msi_intr:entry
{
  self->start_ts_intr = timestamp ;
}

fbt::emlxs_sli4_process_wqe_cmpl:entry
/ self->start_ts_intr /
{
  self->pkt_addr \
    = (int64_t)args[0]->fc_table[args[2]->RequestTag]->pkt ;
  ts_interrupt_begin[self->pkt_addr] = self->start_ts_intr ;
}

The return probe of fbt::emlxs_sli4_process_wqe_cmpl is suitable for storing the timestamp of the end of the I/O interrupt stage:

/* I/O Interrupt End */
fbt::emlxs_sli4_process_wqe_cmpl:return
/ self->pkt_addr /
{
  ts_interrupt_end[self->pkt_addr] = timestamp ;
}

I/O Done (Completion Thread)

Again, the first thing to do is to explore the stack and find appropriate functions for storing the timestamps:

[...truncated...]
fcp`fcp_cmd_callback+0x72
emlxs`emlxs_handle_fcp_event+0x9fa
emlxs`emlxs_proc_channel_event+0x647
emlxs`emlxs_proc_channel+0x62
emlxs`emlxs_thread+0x169
unix`thread_start+0x8
[...truncated...]

emlxs_thread, the first function on the stack, gets executed only during driver initialization, so its entry and return probes can’t be used for storing the timestamp information. The next choice is emlxs_proc_channel which comes immediately after:

/* I/O Done Begin */
fbt::emlxs_proc_channel:entry
{
  self->start_ts = timestamp ;
}

The entry probe of fcp_cmd_callback might be a good candidate for recording the end time of an I/O done operation, because it is the first non-emlxs function on the stack. I’ll just quickly confirm that it fires once and only once for each I/O:

dtrace -n 'fbt::fcp_cmd_callback:entry,fbt:fctl:fc_ulp_transport:entry { @[probefunc] = count(); }'
fcp_cmd_callback   14800
fc_ulp_transport   14803

Below is the code excerpt for storing the timestamps:

/* I/O Done End */
fbt::fcp_cmd_callback:entry
/ ts_start_begin[arg0] && ts_interrupt_begin[arg0] && self->start_ts /
{
  this->ts_done_end = timestamp ;

  this->pkt_addr = arg0 ;
  this->ts_done_begin = self->start_ts ;
  self->start_ts = 0 ;
[...truncated...]

Elapsed Times

The flow chart below visualizes the processing stages with the points where the timestamps are collected:

Emulex Flow Chart

Emulex Flow Chart

Note that the interrupt handler puts the packets onto the completion queue which is consumed by the I/O done (completion) thread.

The timestamps collected according to the flow chart above are sufficient for calculating the elapsed times for each stage:

duration_start = ts_start_end - ts_start_begin 
duration_interrupt = ts_interrupt_end - ts_interrupt_begin 
duration_done = ts_done_end - ts_done_begin

The total elapsed time for an I/O is the difference between ts_done_end and ts_start_begin:

duration_total = ts_done_end - ts_start_begin

The time spent waiting in the completion queue:

duration_queue = ts_done_begin - ts_interrupt_end

And now comes the most interesting part: by setting the measurement points as in the flow chart above we can calculate the total elapsed time spent in the driver code, which is the key point for analysing driver’s performance:

duration_emlxs = duration_start + ts_done_end - ts_interrupt_begin

Moreover, we can even calculate how much time in total was spent outside of the Solaris kernel (Emulex HBA, SAN network and SAN storage) without knowing anything about those devices:

duration_external = duration_total - duration_emlxs

emlxsoverhead.d

I assembled the code excerpts above into the script emlxsoverhead.d. The script prints the latency of each I/O and the durations of all processing stages. Optionally, the threshold for displaying only longer I/Os can be provided. For example, the following command will display only I/Os lasting more than 1 ms:

dtrace -Cs emlxsoverhead.d

The script prints following data:

TIMESTAMP:     time when I/O was performed 
TOTAL(us):     I/O latency measured on the driver 
EMLXS(us):     overhead of the driver code 
I/O START(us): duration of the I/O start stage 
I/O INTR(us):  time spent in the interrupt handler 
QUEUE(us):     time spent on the completion queue 
I/O DONE:      time spent on the completion thread

Below is the example output:

TIMESTAMP            TOTAL(us) EMLXS(us) I/O START(us)  I/O INTR(us)  QUEUE(us) I/O DONE(us)
2016 Apr  9 19:31:21 1028      35        7              9             12        6
2016 Apr  9 19:31:21 4928      35        8              5             11        10
2016 Apr  9 19:31:21 4721      46        6              9             14        16

In the example above, the first I/O took 1028 microseconds and the most of this time was spent outside  the kernel. The longest processing step in the driver was waiting in the completion queue (12 microseconds).

Longer I/O latencies, where the most time is elapsed in the driver code (EMLXS), are not normal and should be investigated further.

Conclusion

In this installment I explained how to measure the time spent in each processing stage of the Emulex device driver and provided the script for it. In the next installment I’ll explain how I analyzed a production problem with a high latency.

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.