{"id":701,"date":"2016-05-16T14:15:49","date_gmt":"2016-05-16T14:15:49","guid":{"rendered":"http:\/\/nenadnoveljic.com\/blog\/?p=701"},"modified":"2018-05-17T16:13:56","modified_gmt":"2018-05-17T16:13:56","slug":"emulex-hba-driver-performance-analysis-4-case-study","status":"publish","type":"post","link":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/","title":{"rendered":"Comparing Latency on Emulex and QLogic HBA Driver"},"content":{"rendered":"<h1>Introduction<\/h1>\n<p>In <a href=\"http:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-measuring-overhead\/\" target=\"_blank\">the previous installment<\/a>\u00a0I provided the <a href=\"https:\/\/github.com\/nenadnoveljic\/dtrace\/blob\/master\/emlxsoverhead.d\" target=\"_blank\">emlxsoverhead.d<\/a>\u00a0script for measuring overhead on the Emulex HBA and explained its internal functioning. In this installment I&#8217;ll explain how I used this script to diagnose and analyze a real-world production problem with high I\/O latency. The problem appeared on a Solaris server which is used to run a consolidated Oracle database platform, but the findings can be applied to any low latency application running on a Solaris server equipped with Emulex HBAs.<\/p>\n<h1>Defining the Problem<\/h1>\n<p>Oracle database log writer process was reporting I\/O outliers longer than 500 ms several thousands times per day:<\/p>\n<pre><code>Warning: log write elapsed time 836ms, size 2KB<\/code><\/pre>\n<p>Although the throughput was very good and &#8220;only&#8221; 0.001% of the total I\/O operations were running longer than 500 ms, in the case of the log writer of a OLTP database this is a critical problem, because commit doesn&#8217;t return until the block gets written in the file system.<\/p>\n<p>Firstly, I confirmed with the script\u00a0<a href=\"http:\/\/dtracebook.com\/index.php\/File_System:zfsslower.d\">zfsslower.d<\/a> that the outliers are also seen on the file system level. Secondly, I did the same for the block devices with the script <a href=\"https:\/\/chrisgerhard.wordpress.com\/scsi-d\/\">scsi.d<\/a>\u00a0. After confirming high latencies on the block devices I checked with a storage administrator\u00a0whether he was\u00a0seeing latencies on the SAN components, but there weren&#8217;t any. At this point,\u00a0I had to develop the script <a href=\"https:\/\/github.com\/nenadnoveljic\/dtrace\/blob\/master\/emlxsoverhead.d\" target=\"_blank\">emlxsoverhead.d<\/a>\u00a0as I couldn&#8217;t find anything else to measure the latency on the I\/O stack layer lower than the block devices.<\/p>\n<p>I was able to pinpoint the problem after evaluating the output of <a href=\"https:\/\/github.com\/nenadnoveljic\/dtrace\/blob\/master\/emlxsoverhead.d\" target=\"_blank\">emlxsoverhead.d<\/a>\u00a0:<\/p>\n<pre><code>TIMESTAMP            <span style=\"color: #ff0000;\">TOTAL(us)<\/span>        <span style=\"color: #ff0000;\">EMLXS(us)<\/span>        I\/O START(us)    I\/O INTR(us)     <span style=\"color: #ff0000;\">QUEUE(us)<\/span>        I\/O DONE(us)     \r\n[...truncated...]\r\n2016 Apr  7 12:13:04 <span style=\"color: #ff0000;\">1238999<\/span>          <span style=\"color: #ff0000;\">1238914<\/span>          3                6                <span style=\"color: #ff0000;\">1238738<\/span>          165              \r\n2016 Apr  7 12:13:04 1238999          1238914          3                6                1238738          165              \r\n2016 Apr  7 12:13:04 1238999          1238914          3                6                1238738          165              \r\n2016 Apr  7 12:13:04 1238999          1238914          3                6                1238738          165              \r\n2016 Apr  7 12:13:04 1238999          1238914          3                6                1238738          165              \r\n2016 Apr  7 12:13:04 1238999          1238914          3                6                1238738          165              \r\n2016 Apr  7 12:13:04 1238999          1238914          3                6                1238738          165              \r\n2016 Apr  7 12:13:04 1238999          1238914          3                6                1238738          165             \r\n[...truncated...]<\/code><\/pre>\n<p>The column TOTAL shows\u00a0latencies of more than 1.2 seconds which\u00a0are almost entirely\u00a0due to\u00a0waiting on the\u00a0completion queue (column QUEUE). It is a paradox that the whole round trip to SAN lasted only 85 us (microseconds! , TOTAL &#8211; EMLXCS) and then the packet waited for more than 1.2 seconds on the completion queue! I suspected some serious deficiency in the device driver code and decided to pursue this further.<\/p>\n<h1>Drilling Down<\/h1>\n<p>Let&#8217;s recall the model of processing in the driver I built in <a href=\"http:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-measuring-overhead\/\" target=\"_blank\">the previous installment<\/a>\u00a0:<\/p>\n<p><a href=\"http:\/\/nenadnoveljic.com\/blog\/wp-content\/uploads\/2016\/04\/flow-chart_cmpl_queue.gif\" rel=\"attachment wp-att-709\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-709\" src=\"http:\/\/nenadnoveljic.com\/blog\/wp-content\/uploads\/2016\/04\/flow-chart_cmpl_queue.gif\" alt=\"flow chart_cmpl_queue\" width=\"998\" height=\"358\" \/><\/a><\/p>\n<p>The question is, why some packages stay longer in the completion queue.<\/p>\n<p>Let&#8217;s start with examining the activity in the I\/O done thread:<\/p>\n<pre><code>#!\/bin\/sbin\/dtrace\r\n\r\nfbt:::entry\r\n\/ (uintptr_t)curthread-&gt;t_startpc == (uintptr_t)&amp;emlxs`emlxs_thread &amp;&amp; stackdepth &lt;= 5 \/\r\n{\r\nself-&gt;ts_started[caller,probefunc,stackdepth] = timestamp ;\r\n}\r\n\r\nfbt:::return\r\n\/ self-&gt;ts_started[caller,probefunc,stackdepth] \/\r\n{\r\n@[caller,probefunc,stackdepth] = quantize((timestamp - self-&gt;ts_started[caller,probefunc,stackdepth])\/1000000) ;\r\nself-&gt;ts_started[caller,probefunc,stackdepth] = 0 ;\r\n}\r\n\r\ntick-1s\r\n{\r\nprintf(\"\\n\\n%Y\\n\",walltimestamp);\r\nprinta(\"%a %s %d %@d\\n\",@);\r\ntrunc(@);\r\n}\r\n\r\n[...truncated...]\r\n\r\nunix`swtch+0x165 resume 4\r\nvalue\u00a0 ------------- Distribution ------------- count\r\n-1 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 0\r\n0 |@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 59\r\n1 |@@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 116\r\n2 |@@@@@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 270\r\n4 |@@@@@@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 353\r\n8 |@@@@@@@@@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 505\r\n16 |@@@@@@@@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 447\r\n32 |@@@@@@@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 419\r\n64 |@@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 120\r\n128 |@\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 38\r\n256 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 14\r\n512 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1\r\n1024 |\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 0\r\n[...truncated...]<\/code><\/pre>\n<p>Warning: the dtrace script above causes performance overhead. Don&#8217;t run it on a production system.<\/p>\n<p>It can be seen, that there are outliers when executing <em>resume<\/em> procedure lasting up to 2 seconds. <em>resume<\/em> code is called to handle the swithing of the thread on the processor. The outliers can also be confirmed by using a less invasive script:<\/p>\n<pre><code>#!\/usr\/sbin\/dtrace\r\n\r\nfbt::cv_wait:entry\r\n\/ caller &gt;= (uintptr_t)&amp;emlxs`emlxs_thread &amp;&amp; caller &lt; (uintptr_t)&amp;emlxs`emlxs_thread+sizeof(emlxs`emlxs_thread) \/\r\n{\r\nself-&gt;cv_wait = 1 ;\r\n}\r\n\r\nfbt::resume:entry\r\n\/ self-&gt;cv_wait \/\r\n{\r\nself-&gt;ts_started[stackdepth] = timestamp ;\r\n}\r\n\r\nfbt::resume:return\r\n\/ self-&gt;ts_started[stackdepth] \/\r\n{\r\n@[\"resume(ms)\"] = quantize((timestamp - self-&gt;ts_started[stackdepth])\/1000000);\r\nself-&gt;ts_started[stackdepth] = 0 ;\r\n}\r\n\r\ntick-1s\r\n{\r\nprinta(@);\r\ntrunc(@);\r\n}\r\n\r\nresume(ms)\r\n           value  ------------- Distribution ------------- count\r\n              -1 |                                         0\r\n               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            11145\r\n               1 |@@@@@@@@                                 2856\r\n               2 |@@@                                      1017\r\n               4 |                                         166\r\n               8 |                                         6\r\n              16 |                                         2\r\n              32 |                                         4\r\n              64 |                                         29\r\n             128 |                                         4\r\n             256 |                                         0\r\n             512 |                                         0\r\n            1024 |                                         1\r\n            2048 |                                         0<\/code><\/pre>\n<p>Furthermore, the long <em>resumes<\/em> correlate with I\/O outliers.<\/p>\n<p>On one server I replaced the Emulex HBA with QLogic. After doing so, the problem with the occasional slow I\/O by the oracle database log writer process was resolved!<\/p>\n<p>By using DTrace and looking into the source code of the Emulex driver I explored differences in the internal functioning between the both device drivers.<\/p>\n<h1>Comparison with QLogic<\/h1>\n<h2>Solaris Task Queues<\/h2>\n<p>First, I&#8217;ll take a look at the completion stack of the qlc driver:<\/p>\n<pre><code>dtrace -n 'fbt:scsi::entry { @[stack()] = count() }'\r\n\r\n[...truncated...]\r\nscsi_vhci`vhci_intr+0xb6\r\nscsi`scsi_hba_pkt_comp+0x293\r\nfcp`fcp_post_callback+0x19\r\nfcp`fcp_cmd_callback+0x72\r\nqlc`qlc_completion_thread+0xa4\r\ngenunix`taskq_thread+0x27e\r\nunix`thread_start+0x8\r\n[...truncated...]<\/code><\/pre>\n<p>Having <em>genunix`taskq_thread <\/em>on the stack tells us that the QLogic driver, unlike the Emulex, uses the Solaris task queues for queueing incoming packets. This can be confirmed by running the <em>kstat<\/em> command:<\/p>\n<pre><code>kstat -c taskq\r\n\r\n[...truncated...]\r\nmodule: unix\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 instance: 0\r\nname:\u00a0\u00a0 qlc_0_comp_thd\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 class:\u00a0\u00a0\u00a0 taskq\r\ncrtime\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1453.363526517\r\nexecuted\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 0\r\nmaxtasks\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 8\r\nnactive\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 8\r\nnalloc\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 8\r\npid\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 0\r\npriority\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 99\r\nsnaptime\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1199076.69870013\r\ntasks\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 8\r\nthreads\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 8\r\ntotaltime\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 0\r\n[...truncated...]<\/code><\/pre>\n<p>In the example above the maximum of 8 threads is configured, but\u00a0the kernel adjusts the number of threads dynamically depending on the workload.<\/p>\n<p>The interrupt thread puts the packet in the single task queue which is then asynchronously processed by the completion threads. <a href=\"https:\/\/docs.oracle.com\/cd\/E18752_01\/html\/816-4854\/task_queues.html\" target=\"_blank\">Task queues<\/a> are built and optimized for postponing some tasks and delegating their execution to another kernel thread. The facility is typically used by drivers. Usually, the tasks are dispatched from interrupt context.<\/p>\n<p>On the contrary, the Emulex driver doesn&#8217;t rely on task queues. Instead, they create 8 completion queues per HBA which are implemented as circular linked lists. Each completion queue is protected by its own mutex: either an interrupt thread or a completion thread can access a queue. After the interrupt handler puts a packet on a queue it sends a signal to the corresponding completion thread by calling the function <em>cv_signal<\/em> and supplying the reference to the conditional variable. The completion thread, which is waiting in the <em>cv_wait,<\/em> wakes up, gets switched on the CPU, enters the mutex, and finally processes the packet. As a consequence, the time spent in <em>resume<\/em> contributes to the overall I\/O latency.<\/p>\n<p>For those willing to explore the\u00a0implementation in more detail, I&#8217;m providing the references in the source code of the Emulex driver:<\/p>\n<pre><code>emlxs_thread_trigger2 function is called from interrupt context:\r\nemlxs_thread_trigger2(emlxs_thread_t *ethread, void (*func) (), CHANNEL *cp)\r\n{\r\n\r\n\/*\r\n* If the thread lock can be acquired,\r\n* it is in one of these states:\r\n* 1. Thread not started.\r\n* 2. Thread asleep.\r\n* 3. Thread busy.\r\n* 4. Thread ended.\r\n*\/\r\n...\r\nmutex_enter(&amp;ethread-&gt;lock);\r\n...\r\n...\r\nif (ethread-&gt;flags &amp; EMLXS_THREAD_ASLEEP) {\r\ncv_signal(&amp;ethread-&gt;cv_flag);\r\n}\r\n\r\nmutex_exit(&amp;ethread-&gt;lock);\r\n\r\nreturn;\r\n\r\n} \/* emlxs_thread_trigger2() *\/\r\n\r\nemlxs_thread is the startigng function in the completion thread:\r\nemlxs_thread(emlxs_thread_t *ethread)\r\n{\r\n...\r\nmutex_enter(&amp;ethread-&gt;lock);\r\nethread-&gt;flags |= EMLXS_THREAD_STARTED;\r\n\r\nwhile (!(ethread-&gt;flags &amp; EMLXS_THREAD_KILLED)) {\r\nif (!(ethread-&gt;flags &amp; EMLXS_THREAD_TRIGGERED)) {\r\nethread-&gt;flags |= EMLXS_THREAD_ASLEEP;\r\ncv_wait(&amp;ethread-&gt;cv_flag, &amp;ethread-&gt;lock);\r\n}<\/code><\/pre>\n<h2>Thread Priorities<\/h2>\n<p>Besides using task queues there is another difference between both drivers which might have an impact on the process of context switching &#8211; thread priorities. From the <em>kstat -c taskq<\/em> output in the previous section it can be seen that the completion threads are running under the priority 99. In contrast, Emulex threads are running with the lower priority 97:<\/p>\n<pre><code>#!\/usr\/sbin\/dtrace -s\r\n# pragma D option quiet\r\nfbt:::entry\r\n\/ caller &gt;= (uintptr_t)&amp;emlxs`emlxs_thread &amp;&amp; caller &lt; (uintptr_t)&amp;emlxs`emlxs_thread+sizeof(emlxs`emlxs_thread) \/\r\n{\r\nprintf(\"prio: %d\\n\",curthread-&gt;t_pri);\r\nexit(0);\r\n}\r\n\r\nprio: 97<\/code><\/pre>\n<p>The Emulex thread has to compete with the ZFS and lot of other kernel threads which run with higher priority (99). Unfortunately, I couldn&#8217;t find a parameter for adjusting the default priority of the Emulex completion threads.<\/p>\n<h1>Next Steps<\/h1>\n<p>Replacing Emulex HBAs with Qlogic resolved the problem with the I\/O outliers.<\/p>\n<p>Nevertheless, we opened a service request. The Emulex development acknowledged the problem and offered us to rewrite the device driver in the following way: the completion thread will be removed and whole processing would be done in the interrupt thread. This should completely remove the latencies caused by the completion queue. However, I&#8217;m curious to see how this change will impact the kernel CPU consumption as the interrupts require spin mutex locks (the current implementation with the completion thread\u00a0uses adaptive mutex locks).<\/p>\n<h1>Updates<\/h1>\n<h2>18.07.2016<\/h2>\n<p>Oracle delivered the patch for the following bug which indeed helped to reduce the latency on the Emulex HBA driver on the busy server:<\/p>\n<p>Release : Solaris 11.3 SRU # 8.7.0<br \/>\nPlatform : i386<br \/>\nBug(s) addressed :<br \/>\n23075272 : emlxs completion thread model can affect performance on busy systems<br \/>\nPackage(s) included :<br \/>\npkg:\/driver\/fc\/emlxs<\/p>\n<p>For me it is not clear, what exactly has changed, but I can see by observing stack traces that the completion thread was not removed.<\/p>\n<h1>Additional Information for ZFS Users<\/h1>\n<p>Setting <em>user_reserve_hint_pct,<\/em> which is the default parameter for limiting ZFS ARC has a negative impact on the latencies on the HBA driver. I\u00a0elaborated on this\u00a0<a href=\"http:\/\/nenadnoveljic.com\/blog\/arc-resizing-user_reserve_hint_pct\/\" target=\"_blank\">here<\/a> .<\/p>\n<h1>Conclusion<\/h1>\n<p>The Qlogic (QLC) HBA driver has lower latencies than Emulex driver. There are two implementation differences which might explain the difference in the behavior:<\/p>\n<ul>\n<li>the completion thread of the QLC driver is implemented by the means of the Solaris task queues which generally operates with less completion threads.<\/li>\n<li>the QLC completion threads are running with higher priority (99) than Emulex completion threads (97).<\/li>\n<\/ul>\n<p>The Emulex development is willing to rewrite the driver to\u00a0entirely eliminate the completion threads and handle the whole processing in the interrupt thread. I&#8217;m eager to test the changes and write about the results.<\/p>\n<h1>References<\/h1>\n<ul>\n<li>Richard Dougall and Jim Mauro: <a href=\"http:\/\/www.oracle.com\/technetwork\/server-storage\/solaris\/solaris-internals-137242.html\" target=\"_blank\">Solaris Internals<\/a><\/li>\n<li>Brendan Gregg and Jim Mauro:<a href=\"http:\/\/www.dtracebook.com\/index.php\/Main_Page\" target=\"_blank\"> DTrace Dynamic Tracing in Oracle Solaris<\/a><\/li>\n<li>Oracle Corporation:<a href=\"https:\/\/docs.oracle.com\/cd\/E18752_01\/html\/816-4854\/task_queues.html\" target=\"_blank\"> Writing Device Drivers<\/a><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The blog post describes the implementation differences between Emulex and HBA drivers which are relevant for latency. <a href=\"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/\" class=\"more-link\">Continue Reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[24,15,14],"tags":[],"class_list":["post-701","post","type-post","status-publish","format-standard","hentry","category-dtrace","category-emulex","category-solaris"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Comparing Latency on Emulex and QLogic HBA Driver - All-round Database Topics<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Comparing Latency on Emulex and QLogic HBA Driver - All-round Database Topics\" \/>\n<meta property=\"og:description\" content=\"The blog post describes the implementation differences between Emulex and HBA drivers which are relevant for latency. Continue Reading &rarr;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/\" \/>\n<meta property=\"og:site_name\" content=\"All-round Database Topics\" \/>\n<meta property=\"article:published_time\" content=\"2016-05-16T14:15:49+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-05-17T16:13:56+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/nenadnoveljic.com\/blog\/wp-content\/uploads\/2016\/04\/flow-chart_cmpl_queue.gif\" \/>\n<meta name=\"author\" content=\"Nenad Noveljic\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@NenadNoveljic\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nenad Noveljic\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/\"},\"author\":{\"name\":\"Nenad Noveljic\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#\\\/schema\\\/person\\\/51458d9dd86dbbdd19f5add451d44efa\"},\"headline\":\"Comparing Latency on Emulex and QLogic HBA Driver\",\"datePublished\":\"2016-05-16T14:15:49+00:00\",\"dateModified\":\"2018-05-17T16:13:56+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/\"},\"wordCount\":1206,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/nenadnoveljic.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/04\\\/flow-chart_cmpl_queue.gif\",\"articleSection\":[\"DTrace\",\"Emulex\",\"Solaris\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/\",\"url\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/\",\"name\":\"Comparing Latency on Emulex and QLogic HBA Driver - All-round Database Topics\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/nenadnoveljic.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/04\\\/flow-chart_cmpl_queue.gif\",\"datePublished\":\"2016-05-16T14:15:49+00:00\",\"dateModified\":\"2018-05-17T16:13:56+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#\\\/schema\\\/person\\\/51458d9dd86dbbdd19f5add451d44efa\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/#primaryimage\",\"url\":\"http:\\\/\\\/nenadnoveljic.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/04\\\/flow-chart_cmpl_queue.gif\",\"contentUrl\":\"http:\\\/\\\/nenadnoveljic.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/04\\\/flow-chart_cmpl_queue.gif\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/emulex-hba-driver-performance-analysis-4-case-study\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Comparing Latency on Emulex and QLogic HBA Driver\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/\",\"name\":\"All-round Database Topics\",\"description\":\"Nenad Noveljic\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#\\\/schema\\\/person\\\/51458d9dd86dbbdd19f5add451d44efa\",\"name\":\"Nenad Noveljic\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g\",\"caption\":\"Nenad Noveljic\"},\"sameAs\":[\"nenad-noveljic-9b746a6\",\"https:\\\/\\\/x.com\\\/NenadNoveljic\"],\"url\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/author\\\/nenad\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Comparing Latency on Emulex and QLogic HBA Driver - All-round Database Topics","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/","og_locale":"en_US","og_type":"article","og_title":"Comparing Latency on Emulex and QLogic HBA Driver - All-round Database Topics","og_description":"The blog post describes the implementation differences between Emulex and HBA drivers which are relevant for latency. Continue Reading &rarr;","og_url":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/","og_site_name":"All-round Database Topics","article_published_time":"2016-05-16T14:15:49+00:00","article_modified_time":"2018-05-17T16:13:56+00:00","og_image":[{"url":"http:\/\/nenadnoveljic.com\/blog\/wp-content\/uploads\/2016\/04\/flow-chart_cmpl_queue.gif","type":"","width":"","height":""}],"author":"Nenad Noveljic","twitter_card":"summary_large_image","twitter_creator":"@NenadNoveljic","twitter_misc":{"Written by":"Nenad Noveljic","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/#article","isPartOf":{"@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/"},"author":{"name":"Nenad Noveljic","@id":"https:\/\/nenadnoveljic.com\/blog\/#\/schema\/person\/51458d9dd86dbbdd19f5add451d44efa"},"headline":"Comparing Latency on Emulex and QLogic HBA Driver","datePublished":"2016-05-16T14:15:49+00:00","dateModified":"2018-05-17T16:13:56+00:00","mainEntityOfPage":{"@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/"},"wordCount":1206,"commentCount":0,"image":{"@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/#primaryimage"},"thumbnailUrl":"http:\/\/nenadnoveljic.com\/blog\/wp-content\/uploads\/2016\/04\/flow-chart_cmpl_queue.gif","articleSection":["DTrace","Emulex","Solaris"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/","url":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/","name":"Comparing Latency on Emulex and QLogic HBA Driver - All-round Database Topics","isPartOf":{"@id":"https:\/\/nenadnoveljic.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/#primaryimage"},"image":{"@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/#primaryimage"},"thumbnailUrl":"http:\/\/nenadnoveljic.com\/blog\/wp-content\/uploads\/2016\/04\/flow-chart_cmpl_queue.gif","datePublished":"2016-05-16T14:15:49+00:00","dateModified":"2018-05-17T16:13:56+00:00","author":{"@id":"https:\/\/nenadnoveljic.com\/blog\/#\/schema\/person\/51458d9dd86dbbdd19f5add451d44efa"},"breadcrumb":{"@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/#primaryimage","url":"http:\/\/nenadnoveljic.com\/blog\/wp-content\/uploads\/2016\/04\/flow-chart_cmpl_queue.gif","contentUrl":"http:\/\/nenadnoveljic.com\/blog\/wp-content\/uploads\/2016\/04\/flow-chart_cmpl_queue.gif"},{"@type":"BreadcrumbList","@id":"https:\/\/nenadnoveljic.com\/blog\/emulex-hba-driver-performance-analysis-4-case-study\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/nenadnoveljic.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Comparing Latency on Emulex and QLogic HBA Driver"}]},{"@type":"WebSite","@id":"https:\/\/nenadnoveljic.com\/blog\/#website","url":"https:\/\/nenadnoveljic.com\/blog\/","name":"All-round Database Topics","description":"Nenad Noveljic","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/nenadnoveljic.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/nenadnoveljic.com\/blog\/#\/schema\/person\/51458d9dd86dbbdd19f5add451d44efa","name":"Nenad Noveljic","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g","caption":"Nenad Noveljic"},"sameAs":["nenad-noveljic-9b746a6","https:\/\/x.com\/NenadNoveljic"],"url":"https:\/\/nenadnoveljic.com\/blog\/author\/nenad\/"}]}},"_links":{"self":[{"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/posts\/701","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/comments?post=701"}],"version-history":[{"count":1,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/posts\/701\/revisions"}],"predecessor-version":[{"id":908,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/posts\/701\/revisions\/908"}],"wp:attachment":[{"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/media?parent=701"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/categories?post=701"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/tags?post=701"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}