Emulex HBA Driver Performance Analysis (2) – Packet Flow

Introduction

In the previous installment I described how to measure end-to-end latency on the HBA device driver. In this article I’ll explore the workflow within the Emulex HBA driver code. Since I couldn’t find any documentation on Internet I used DTrace and the driver’s source code to figure this out.

Stack Fishing

First, the call stacks will be collected to get an initial idea about the packet processing workflow in the code. In the previous post we learned which modules participate in the I/O processing (emlxs, fbt and fcp):

dtrace -n 'fbt:emlxs::entry,fbt:scsi::entry,fbt:fctl::entry,fbt:fcp::entry { @[stack()] = count(); }'

Three different types of stack can be identified for processing the following stages of the I/O life-cycle: I/O start, I/O done and I/O interrupt handling.

I/O Start Stack

The I/O start stack can be recognized by the functions like scsi_transport and scsi_start:

[...truncated...]
emlxs`emlxs_sli4_bde_setup+0x233
emlxs`emlxs_sli4_prep_fcp_iocb+0x10a
emlxs`emlxs_send_fcp_cmd+0x137
emlxs`emlxs_fca_transport+0x256
fctl`fc_ulp_transport+0x22a
fcp`fcp_transport+0x27
fcp`fcp_scsi_start+0x149
scsi`scsi_transport+0x233
scsi_vhci`vhci_scsi_start+0x45d
scsi`scsi_transport+0x2b2
sd`sd_start_cmds+0x34f
sd`sd_core_iostart+0x186
sd`sd_mapblockaddr_iostart+0x1ef
sd`sd_xbuf_strategy+0x4e
sd`xbuf_iostart+0x1e2
sd`ddi_xbuf_qstrategy+0xd7
sd`sdstrategy+0x175
genunix`bdev_strategy+0x66
genunix`ldi_strategy+0x56
zfs`vdev_disk_strategy+0x158
[...truncated...]

I/O Done Stack

The I/O done stack can be recognized by the functions *destroy_pkt and *uninit_packet:

[...truncated...]
fctl`fc_ulp_uninit_packet+0x2f
fcp`fcp_pkt_teardown+0x8b
scsi`scsi_cache_destroy_pkt+0x46
scsi`scsi_destroy_pkt+0x21
scsi_vhci`vhci_scsi_destroy_pkt+0x2f
scsi`scsi_destroy_pkt+0x21
sd`sd_destroypkt_for_buf+0x20
sd`sd_return_command+0x17c
sd`sdintr+0x584
scsi`scsi_hba_pkt_comp+0x293
scsi_vhci`vhci_intr+0x7df
scsi`scsi_hba_pkt_comp+0x293
fcp`fcp_post_callback+0x19
fcp`fcp_cmd_callback+0x72
emlxs`emlxs_handle_fcp_event+0x9fa
emlxs`emlxs_proc_channel_event+0x647
emlxs`emlxs_proc_channel+0x62
emlxs`emlxs_thread+0x169
unix`thread_start+0x8
[...truncated...]

Interrupt Stack

The interrupt handler can be recognized by the function switch_sp_and_call which is the solaris Kernel function for dispatching interrupts (see this blog post by Saurabh Mishra):

[...truncated...]
emlxs`emlxs_mem_put+0x9d
emlxs`emlxs_sli4_free_xri+0xe1
emlxs`emlxs_sli4_process_wqe_cmpl+0xff
emlxs`emlxs_sli4_process_cq+0x139
emlxs`emlxs_sli4_process_eq+0xda
emlxs`emlxs_sli4_msi_intr+0x7f
apix`apix_dispatch_by_vector+0x89
apix`apix_dispatch_lowlevel+0x32
unix`switch_sp_and_call+0x13
[...truncated...]

The first Emulex device driver’s function on the stack is emlxs_sli4_msi_intr, which should be the interrupt handler. According to the manual Writing Device Drivers for Oracle® Solaris 11.2 , the device driver’s interrupt handler must be registered in the driver’s attach entry point. This can be confirmed by looking into the source code:

emlxs_hba.c:
emlxs_msi_add(emlxs_hba_t *hba)
{
[...truncated...]
 ret =
   ddi_intr_add_handler(htable[i], EMLXS_SLI_MSI_INTR,
     (char *)hba, (char *)((unsigned long)i));
[...truncated...]

emlxs_fc.h:
#define      EMLXS_INTR_ADD(_hba)    emlxs_msi_add(_hba)
#define      EMLXS_SLI_MSI_INTR      (hba->sli_api.sli_msi_intr)
#define      EMLXS_SLI_ONLINE        (hba->sli_api.sli_online)

emlxs_sli4.c:
emlxs_sli4_online(emlxs_hba_t *hba)
{
...
...
if (EMLXS_INTR_ADD(hba) != DDI_SUCCESS) {
...

emlxs_sli_api_t emlxs_sli4_api = {
...
        emlxs_sli4_online,
...
        emlxs_sli4_msi_intr,
...};

emlxs_hba.c:
hba->sli_api = emlxs_sli4_api;

Workflow

The following model of the workflow can be deduced from the findings above:

I/O start -> I/O interrupt handler -> I/O complete

As for the next step, we will build a DTrace script to verify this assumption. The DTrace script will follow a single packet through the hba driver’s execution workflow. To be able to do this, we have to identify one function on each stack which fullfils the following requirements:
– the packet can be unambigously identified by one of the arguments,
– the function gets executed once and only once for each I/O operation.
For the IO start and stop stacks these functions and arguments have already been identified in the previous blog post (I’ve chosen the address of the pkt variable as the key for the associative arrays used to track the packet across different threads):

fctl.c:fc_ulp_transport(opaque_t port_handle, fc_packet_t *pkt)
fctl.c:fc_ulp_uninit_packet(opaque_t port_handle, fc_packet_t *pkt)

The same is required for the I/O interrupt stack. First, it makes sense to find out which functions on the I/O interrupt stack get executed for each I/O . This can be done by counting function calls and comparing them with the number of exections of fc_ulp_transport, for which we already know that gets fired for each I/O operation:

dtrace -n 'fbt:emlxs::entry,fbt:fctl:fc_ulp_transport:entry { @[probefunc] = count(); }'
[...truncated...]
emlxs_sli4_free_xri          39334
[...truncated...]
emlxs_sli4_process_wqe_cmpl  39334
[...truncated...]
fc_ulp_transport             39334
[...truncated...]
emlxs_mem_put                39340
[...truncated...]
emlxs_sli4_cqid_to_index     39828
emlxs_sli4_msi_intr          39828
emlxs_sli4_process_cq        39828
emlxs_sli4_process_eq        39828
[...truncated...]

emlxs_sli4_process_wqe_cmpl is the first function on the stack which executes for each I/O operation. At the first glance, it’s arguments don’t look promising:

emlxs_sli4.c: 
emlxs_sli4_process_wqe_cmpl(emlxs_hba_t *hba, CQ_DESC_t *cq, CQE_CmplWQ_t *cqe)

I’ll examine the first couple of lines in the source code:

   emlxs_buf_t *sbp;
[...truncated...]
  request_tag = cqe->RequestTag;
[...truncated...]
  sbp = hba->fc_table[request_tag];

cq is some structure that gets passed to the function for each I/O. It contains emlxs_buf_t *sbp
which looks like the buffer which is likely to remain the same during the entire life-cycle of the packet. Let’s examine the structure of this buffer:

emlxs_fc.h
[...truncated...]
typedef struct emlxs_buf
{
  fc_packet_t      *pkt;   /* scsi_pkt reference */
[...truncated...]

Bingo! The buffer variable contains the reference to the scsi packet variable which can be used as the key for the associative arrays which store the timing information when the packet gets processed in  different threads. By looking at the code above, we can deduce how to dereference the pointer to pkt in the DTrace entry probe of the emlxs_sli4_process_wqe_cmpl function:

(int64_t)args[0]->fc_table[args[2]->RequestTag]->pkt

Now, all the information is there to build the emlxssnoop.d script which will be used to verify the assumption about the packet flow I made previously:

#!/usr/sbin/dtrace -Cs

/* emlxssnoop.d */

#pragma D option quiet
 
#define PRINT_PKT(pkt,milestone) \
  printf("%d %-18d %-15s %-15s \n",timestamp,pkt,milestone,probefunc)
 
fbt::fc_ulp_transport:entry
{
  PRINT_PKT(arg1,"START");
}
 
fbt::emlxs_sli4_process_wqe_cmpl:entry
{
  PRINT_PKT((int64_t)args[0]->fc_table[args[2]->RequestTag]->pkt,"INTERRUPT");
}
 
fbt::fc_ulp_uninit_packet:entry
{
  PRINT_PKT(arg1,"DONE");
}

By running the script with the command below, we get the timestamp at each processing stage for each packet in the output file emlxssnoop.log:

dtrace -Cs emlxssnoop.d > emlxssnoop.log

We can pick any buffer (for example 67553736358984) and then sort the output on the timestamp to see that the packet flows through the driver code in the assumed order:

grep 67553736358984 emlxssnoop.log | sort 
[...truncated...]
26733383288531382 -67553736358984  START        fc_ulp_transport
26733383294539371 -67553736358984  INTERRUPT    emlxs_sli4_process_wqe_cmpl
26733383294605233 -67553736358984  DONE         fc_ulp_uninit_packet
26733383479942799 -67553736358984  START        fc_ulp_transport
26733383480366376 -67553736358984  INTERRUPT    emlxs_sli4_process_wqe_cmpl
26733383480393173 -67553736358984  DONE         fc_ulp_uninit_packet
26733383491508798 -67553736358984  START        fc_ulp_transport
26733383491570635 -67553736358984  INTERRUPT    emlxs_sli4_process_wqe_cmpl
26733383491623284 -67553736358984  DONE         fc_ulp_uninit_packet
[...truncated...]

Conclusion

In this installment we learned which threads participate in the processing of packets and in which order they process them: start, interrupt and done. I offered the method for testing the model. By doing so, the foundation is laid for further exploring of the internal functioning of the Emulex driver.

In the next blog post, I’ll describe the methodology and provide the scripts for measuring the elapsed time per packet during different stages of the I/O, which is the key point for analysing the driver performance.

Stay tuned!

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.