ZFS ARC Resizing (user_reserve_hint_pct)

Introduction

While analyzing an IO latency issue I’ve noticed that a completion thread of the QLogic HBA  driver occasionally spends a lot of time in the cv_wait kernel function. cv_wait waits for the signal coming from the I/O interrupt handler and atomically allocates the mutex. I traced a single execution of cv_wait to narrow down the reason for the elapsed time:


#!/usr/sbin/dtrace
# pragma D option flowindent

fbt::cv_wait:entry
/ arg0 && arg1 && \
  caller >= (uintptr_t)&qlc`qlc_completion_thread && \
  caller < (uintptr_t)&qlc`qlc_completion_thread+sizeof(qlc`qlc_completion_thread)
/
{
self->key = arg0 ;
printf("%s %s %d \n",probemod,probefunc,timestamp);
}

fbt:::entry,
fbt:::return
/ self->key /
{
printf("%s %s %d \n",probemod,probefunc,timestamp);
}

fbt::cv_wait:return
/ self->key /
{
exit(0);
}

Here I used Max Bruning’s cunning trick for tracing a function when called from a specific caller. In this case it is qlc_completion_thread.

Below is the relevent excerpt from the output:

37 -> resume   unix resume 27471744230937588
37   -> do_splx   unix do_splx 27471744238529161

In the recorded case, 8 ms were spent in the kernel function resume until do_splx was called. resume is the kernel function which handles the switching of the thread on the processor. I have captured resume executions taking up to 200 ms which is abnormal. do_splxs, the next function call, gives us a hint about the possible cause. do_splxs is namely the function used to process CPU cross-calls. The cross-calls are performed to clean up processor’s MMU when user processes issue a virtual address unmap operation.

For further reading, I recommend the Solaris Internals book by Richard Dougall and Jim Mauro as it is the best reference on the Solaris kernel.

Measuring Cross Calls

As for the next step, I examined the cross call activity on the server:

#!/usr/sbin/dtrace

sysinfo:::xcalls
{
@[stack(),execname] = count() ;
}

profile:::tick-1s
{
printf("\n\n%Y\n",walltimestamp);
printa("%k %s %@d\n",@);
trunc(@);
}

The following stack trace generated 608’388 cross calls in just one second! Besides that, it can be seen that the stack has been executed by the kmem_task process.

unix`xc_call+0x46
unix`hat_tlb_inval+0x498
unix`x86pte_inval+0xad
unix`hat_pte_unmap+0xec
unix`hat_unload_callback+0x181
unix`hat_unload+0x41
unix`segkmem_free_pages_internal+0x8d
unix`segkmem_free_vn+0x41
unix`segkmem_zio_free+0x2e
genunix`vmem_xfree+0x10a
genunix`vmem_free+0x27
genunix`kmem_slab_destroy+0x87
genunix`kmem_slab_free+0x2c7
genunix`kmem_magazine_destroy+0xdf
genunix`kmem_depot_ws_reap+0x77
genunix`kmem_cache_reap+0x76
genunix`kmem_do_processing+0x28
genunix`kmem_async+0x159
genunix`kmem_sysdc_pinit+0x9f
unix`thread_start+0x8
kmem_task 608388

The total CPU time of kmem_task was also being very high and was permanently inreasing:

ps -ef | grep kmem_task
...  Jul 02 ?        18554:51 kmem_task

ZFS ARC

By looking at the stack, it can be concluded that the cross-calls have been caused by memory deallocations. Furthermore, there is a zfs function on the stack (segkmem_zio_free). Therefore, the main suspect is the ZFS ARC cache. The ARC resizing operations can be tracked with the arcreap.d script. The execution of arc_adjust indeed correlates with the execution of the observed stack by kmem_task which causes cross-call storms. Lastly, I observed the size of ARC which was changing over time and was revolving around 140GB:

kstat -n arcstats | grep c
module: zfs                             instance: 0
name:   arcstats                        class:    misc
c                               141817191504
c_max                           548525592576

kstat -n arcstats | grep c
module: zfs                             instance: 0
name:   arcstats                        class:    misc
c                               141495408624
c_max                           548525592576
kstat -n arcstats | grep c
module: zfs                             instance: 0
name:   arcstats                        class:    misc
c                               142111343152
c_max                           548525592576
kstat -n arcstats | grep c
module: zfs                             instance: 0
name:   arcstats                        class:    misc
c                               142900093648
c_max                           548525592576
kstat -n arcstats | grep c
module: zfs                             instance: 0
name:   arcstats                        class:    misc
c                               142900038656
c_max                           548525592576
kstat -n arcstats | grep c
module: zfs                             instance: 0
name:   arcstats                        class:    misc
c                               142890142016
c_max                           548525592576

As c_max equals the total physical memory on the system, it can be concluded that zfs_arc_max is not set. Inspite of that, the Kernel struggled to keep the memory at a certain size which caused numerous ARC shrinks, which, in turn, were the reason for the fatal cross-call storms.

user_reserve_hint_pct

Everything above was measured on Solaris 11.2 and 11.3. I crosschecked it with Solaris 11.1 and observerd completetely different behaviour.

  • No cross calls.
  • kmem_task consumed a fraction of the CPU time:
ps -ef | grep kmem_task
...  Oct 04 ?         222:09 kmem_task
  • memory consumption constant = max_size
kstat -n arcstats | grep c
name:   arcstats                        class:    misc
c                               256000000000
c_max                           256000000000
  • max_size < total memory

Then I compared the relevant Kernel parameters and found out the difference:

Solaris 11.1:

zfs_arc_max=0x3B9ACA0000

Solaris 11.2,11.3:

user_reserve_hint_pct=60

Instead of setting the hard max limit in Solaris 11.2 and 11.3, the new parameter user_reserve_hint_pct was set. According to the Oracle Documentation the new parameter was introduced for flexibility with the intention to replace zfs_arc_max. However, as it is not a hard limit it introduced the overhead of continous resizing which causes cross-call storms.

As a solution I removed user_reserve_hint_pct and set zfs_arc_max also on Solaris 11.3 which completely resolved the problem with I/O outliers.

Conclusion

I observed increased I/O latencies on the HBA device driver. The analysis showed that the most of the time was spent on switching the completion thread on the CPU. The reason were the cross call storms which, in turn, were caused by constant shrinking of ZFS ARC cache. The side-effect was caused by the kernel parameter user_reserve_hint_pct. As a workaround, I unset the parameter and falled back to zfs_arc_max.

References

Updates

1. December 2016

This problem is related to the problem described in my other blog post Solaris 11.3 Hangs Because of Kernel Object Manager . In the meantime, Oracle released the patches which help mitigate the problem. However, the issue has not been completely resolved yet. Recently, Oracle published the following note Solaris 11.3: kmem cache reaping of buffers with large number of entries can lead to long delays which could cause Cluster node eviction (Doc ID 2205638.1) .

 

 

 

 

Nenad Noveljic

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *