Wait chains
I observed user sessions intermittently waiting longer on log file sync. Untypically, there weren’t any indications of slow redo log writes. In fact, according to the OS performance statistics, all redo log writes completed within 5 ms. But if the redo writes were fast, what was the log writer (LGWR) actually doing?
If you’ve licensed the Diagnostic and Tuning Pack, Tanel Poder‘s ash_wait_chains.sql is the perfect tool for displaying such hidden dependencies – it proved invaluable for uncovering the root cause:
@ash_wait_chains.sql program2||event2 1=1 "to_date('04.09.2019 07:05:09','dd.mm.yyyy hh24:mi:ss')" "to_date('04.09.2019 07:06:09','dd.mm.yyyy hh24:mi:ss')"
-- Display ASH Wait Chain Signatures script v0.2 BETA by Tanel Poder ( http://blog.tanelpoder.com )
%This SECONDS AAS
------ ---------- ----------
WAIT_CHAIN
--------------------------------------------------------------------------------
46% 146 2.4
-> (App.exe) log file sync -> (LGWR) control file parallel write
19% 60 1
-> (LGWR) control file parallel write
...
Basically, the user sessions were waiting on the LGWR, which, in turn, was mostly waiting on control file writes to complete. Also, the OS statistics confirmed sluggish control files writes. In this case, an OS bug have been causing slow writes on the ZFS data file pool during the peak load periods. It’s worth noting that the IO operations on the other pools were fast during that time. Understandably, the control file that was placed in the data file pool was affected by this problem too. After separating it from the data files, we predictably haven’t suffered any significant log file sync wait events anymore, even when the data file writes were significantly slowing down.
But what information is the LGWR writing into the control files? And when does that happen? I mean, when a session issues a commit, the LGWR normally writes only into the redo logs, like described in [Hoogland 2018].
control file parallel write
I captured the LGWR call stacks during control files writes to figure out what’s triggering them:
ps -ef | grep DMUR1 | grep lgwr
oracle 8098 7951 0 12:07:45 ? 0:37 ora_lgwr_DMUR1
sudo -u root /usr/sbin/dtrace -n '
pid$target::pwrite:entry / arg0 == 256 / { ustack(); }
' -p 8098
256 is the control file descriptor which can be queried with pfiles:
pfiles 8098
256: S_IFREG mode:0640 dev:275,69139 ino:10 uid:1000 gid:1000 size:25706496 (25MB)
O_RDWR|O_DSYNC|O_LARGEFILE FD_CLOEXEC
advisory write lock set by process 3626
/u00/oracle/oradata01/DMUR1/ctl/ctl1DMUR1.dbf
offset:0
257: S_IFREG mode:0640 dev:275,69168 ino:10 uid:1000 gid:1000 size:25706496 (25MB)
O_RDWR|O_DSYNC|O_LARGEFILE FD_CLOEXEC
advisory write lock set by process 3626
/u00/oracle/oradata02/DMUR1/ctl/ctl2DMUR1.dbf
offset:0
258: S_IFREG mode:0640 dev:275,69198 ino:10 uid:1000 gid:1000 size:25706496 (25MB)
O_RDWR|O_DSYNC|O_LARGEFILE FD_CLOEXEC
advisory write lock set by process 3626
/u00/oracle/oradata03/DMUR1/ctl/ctl3DMUR1.dbf
offset:0
The following simple loop was used to create activity on the database:
create table t ( c varchar2(4000) ) ;
begin
for i in 1..1e6
loop
insert into t values (lpad('A',4000)) ;
commit ;
end loop ;
end ;
/
Here’s the captured stack:
libc.so.1`pwrite
oracle`skgfqio+0x5e5
oracle`ksfd_skgfqio+0x22c
oracle`ksfdgo+0x1a7
oracle`ksfd_sbio+0xe12
oracle`ksfdbio+0x109c
oracle`ksfdbio+0x12fa
oracle`kccwbp+0x466
oracle`kccrec_wdf+0x1ab
oracle`kccfhb+0x2b
oracle`kcccmt+0xdd
oracle`kcrfnl+0x3e6
oracle`kcrfwl+0x3b0
oracle`ksb_act_run_int+0x6a
oracle`ksb_act_run+0xad
oracle`ksbabs+0x3c9
oracle`ksbrdp+0xa77
oracle`opirip+0x2c4
oracle`opidrv+0x24b
oracle`sou2o+0x97
kcrfwl is the most interesting function. Its annotation “kernel cache recovery file write/broadcast SCN sWitch logs” (according to Frits Hoogland‘s Oracle C functions annotations) tells us that it’s invoked during a log switch. This information is sufficient to reproduce the behavior observed on the production database.
log file sync
First, I’ll attach with gdb to the LGWR and set a conditional breakpoint when the process writes to the control file:
attach 8098
b *pwrite if $rdi==256
c
Next, I’ll initiate a log switch:
alter system switch logfile ;
Finally, in another session, I’ll insert and commit a row:
insert into t values (1);
commit ;
Expectedly, the commit doesn’t come back, because it’s waiting on the LGWR to complete:
column sid format 999
column program format a30
column event format a30
column state format a10
set linesize 200
select sid,program,event,state,seconds_in_wait from v$session
where sid in (37,97) ;
SID PROGRAM EVENT STATE SECONDS_IN_WAIT
---- ------------------------------ ------------------------------ ---------- ---------------
37 sqlplus@svdbp01i (TNS V1-V3) log file sync WAITING 138
97 oracle@svdbp01i (LGWR) control file parallel write WAITING 228
The LGWR, in turn, is waiting on the “slow” control file write to finish. Of course, that “slow” write is simulated with the breakpoint, and it won’t finish until I resume the execution.
Also, ash_wait_chains.sql confirms the fingerprint of the production issue:
@ash_wait_chains program2||event2 session_id=37 sysdate-400 sysdate
-- Display ASH Wait Chain Signatures script v0.4 BETA by Tanel Poder ( http://blog.tanelpoder.com )
%This SECONDS AAS
------ ---------- ----------
WAIT_CHAIN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FIRST_SEEN LAST_SEEN
------------------- -------------------
60% 418 0
-> (sqlplus) log file sync -> (LGWR) control file parallel write
2019-09-05 15:29:09 2019-09-05 15:36:06
Conclusion
Frankly, it wasn’t my first thought that slower control file writes might impair commit performance. In fact, control files placement isn’t even considered in the white paper Configuring Oracle Solaris ZFS for an Oracle Database. This document recommends to separate data and redo file in separate pools for performance, but it doesn’t even mention control files. Of course, after having had understood the impact, we separated control files from data files.
Further, this case study reminded me that the underlying issue might not be obvious by observing only the affected session. Therefore, we shouldn’t rush into implementing “solutions” before having really understood the problem. Thanks to Tanel we have ash_wait_chains.sql – a superb tool for identifying the main blocker who is sometimes hidden in a remote link of the wait chain.
Last but not least, once we’ve understood the dependencies, we can even go a step further and reproduce the problem with debugger, which is surely a great way of studying the internals and understanding undocumented subtleties of a proprietary software.
References
[Hoogland 2018] Frits Hoogland. (February 27, 2018). A look into Oracle redo, part 5: the log writer writing
Did the control file parallel write statistic show any significant totals, or at least the wait histogram? I would hope this could draw attention to the problem without monitoring single request during log file switch.
Yes, the control file write histograms showed several outliers. Yet, the total control file write wait time was insignificant over a longer time period – as it’s usually the case with short glitches. There are two reasons why we decided to monitor lgwr trace for longer writes. One is that lgwr IOs > 500ms mean that something is broken. The other is we get the exact time when the outlier occurred, so we can correlate it with OS performance data and ASH. Of course, we generate the histograms for the outliers, so that we don’t flood the monitoring console when there is some massive storage or OS problem. This is how the alert looks like in this case:
WARNING – lgwr Ausreisser 2000 ms gefunden 21
betroffene Datenbanken: DB1, DB2,…
max: 8481 msec
timestamp(max): 2019102021:11:03
db(max): DB10
0 ###
1000
2000 ######
3000 ###
4000 ####
5000 #####
6000 ####
7000 ########
8000 ###############
Thanks Nenad.
We had the same issue last year.
Controlfiles were on the same zpool as datafiles. DBW , CKPT and LGW were struggling under long controlfile parallel writes.
We solved moving controlfile to a different position. Can you share more details about the OS bug?
Thanks
Hi Matteo,
Find below the bug list:
17388294 – kmem cache of dnodes should be reaped
22347665 – large kmem caches could be reaped proactively
22908271 – arc metadata fragmentation leads to memory shortage
Also, I wouldn’t rely on user_reserve_hint_pc, see https://nenadnoveljic.com/blog/arc-resizing-user_reserve_hint_pct/
Hi NENAD,
In case the controfiles are placed on ASM then this also applies to this particular situation OR can we see the LGWR as blocker waiting on controlfile sequential reads ?
Hi Ravi,
I assume that it applies to ASM too. It can be tested with the procedure I described in the blog post.
Hi Nenad,
also one more argument to properly size redo logs to avoid frequent log switches and controlfile writes due to kcrfwl.
Thanks for sharing!
Regards,
Martin
Hi Martin,
You’re welcome.
Larger redo logs would have indeed reduced the number of commit outliers. As our goal was to completely eliminate them, we rather moved the control files to a faster file system.
Best regards,
Nenad