Bad commit Performance due to Slow Control File Writes

Wait chains

I observed user sessions intermittently waiting longer on log file sync. Untypically, there weren’t any indications of slow redo log writes. In fact, according to the OS performance statistics, all redo log writes completed within 5 ms. But if the redo writes were fast, what was the log writer (LGWR) actually doing?

If you’ve licensed the Diagnostic and Tuning Pack, Tanel Poder‘s ash_wait_chains.sql is the perfect tool for displaying such hidden dependencies – it proved invaluable for uncovering the root cause:

@ash_wait_chains.sql program2||event2 1=1 "to_date('04.09.2019 07:05:09','dd.mm.yyyy hh24:mi:ss')" "to_date('04.09.2019 07:06:09','dd.mm.yyyy hh24:mi:ss')"                                                   

-- Display ASH Wait Chain Signatures script v0.2 BETA by Tanel Poder ( http://blog.tanelpoder.com )
%This     SECONDS        AAS
------ ---------- ----------
WAIT_CHAIN
--------------------------------------------------------------------------------
  46%         146        2.4
-> (App.exe) log file sync  -> (LGWR) control file parallel write

  19%          60          1
-> (LGWR) control file parallel write

...

Basically, the user sessions were waiting on the LGWR, which, in turn, was mostly waiting on control file writes to complete. Also, the OS statistics confirmed sluggish control files writes. In this case, an OS bug have been causing slow writes on the ZFS data file pool during the peak load periods. It’s worth noting that the IO operations on the other pools were fast during that time. Understandably, the control file that was placed in the data file pool was affected by this problem too. After separating it from the data files, we predictably haven’t suffered any significant log file sync wait events anymore, even when the data file writes were significantly slowing down.

But what information is the LGWR writing into the control files? And when does that happen? I mean, when a session issues a commit, the LGWR normally writes only into the redo logs, like described in [Hoogland 2018].

control file parallel write

I captured the LGWR call stacks during control files writes to figure out what’s triggering them:

ps -ef | grep DMUR1 | grep lgwr
oracle  8098  7951   0 12:07:45 ?           0:37 ora_lgwr_DMUR1

sudo -u root /usr/sbin/dtrace -n '
pid$target::pwrite:entry / arg0 == 256 / { ustack(); }
' -p 8098

256 is the control file descriptor which can be queried with pfiles:

pfiles 8098
256: S_IFREG mode:0640 dev:275,69139 ino:10 uid:1000 gid:1000 size:25706496 (25MB)
  O_RDWR|O_DSYNC|O_LARGEFILE FD_CLOEXEC
  advisory write lock set by process 3626
  /u00/oracle/oradata01/DMUR1/ctl/ctl1DMUR1.dbf
  offset:0 
257: S_IFREG mode:0640 dev:275,69168 ino:10 uid:1000 gid:1000 size:25706496 (25MB)
  O_RDWR|O_DSYNC|O_LARGEFILE FD_CLOEXEC
  advisory write lock set by process 3626
  /u00/oracle/oradata02/DMUR1/ctl/ctl2DMUR1.dbf
  offset:0 
258: S_IFREG mode:0640 dev:275,69198 ino:10 uid:1000 gid:1000 size:25706496 (25MB)
  O_RDWR|O_DSYNC|O_LARGEFILE FD_CLOEXEC
  advisory write lock set by process 3626
  /u00/oracle/oradata03/DMUR1/ctl/ctl3DMUR1.dbf
  offset:0

The following simple loop was used to create activity on the database:

create table t ( c varchar2(4000) ) ;

begin
  for i in 1..1e6
  loop
    insert into t values (lpad('A',4000)) ;
	commit ;
  end loop ;
end ;
/

Here’s the captured stack:

libc.so.1`pwrite
oracle`skgfqio+0x5e5
oracle`ksfd_skgfqio+0x22c
oracle`ksfdgo+0x1a7
oracle`ksfd_sbio+0xe12
oracle`ksfdbio+0x109c
oracle`ksfdbio+0x12fa
oracle`kccwbp+0x466
oracle`kccrec_wdf+0x1ab
oracle`kccfhb+0x2b
oracle`kcccmt+0xdd
oracle`kcrfnl+0x3e6
oracle`kcrfwl+0x3b0
oracle`ksb_act_run_int+0x6a
oracle`ksb_act_run+0xad
oracle`ksbabs+0x3c9
oracle`ksbrdp+0xa77
oracle`opirip+0x2c4
oracle`opidrv+0x24b
oracle`sou2o+0x97

kcrfwl is the most interesting function. Its annotation “kernel cache recovery file write/broadcast SCN sWitch logs” (according to Frits Hoogland‘s Oracle C functions annotations) tells us that it’s invoked during a log switch. This information is sufficient to reproduce the behavior observed on the production database.

log file sync

First, I’ll attach with gdb to the LGWR and set a conditional breakpoint when the process writes to the control file:

attach 8098
b *pwrite if $rdi==256
c

Next, I’ll initiate a log switch:

alter system switch logfile ;

Finally, in another session, I’ll insert and commit a row:

insert into t values (1);
commit ;

Expectedly, the commit doesn’t come back, because it’s waiting on the LGWR to complete:

column sid format 999
column program format a30
column event format a30
column state format a10

set linesize 200

select sid,program,event,state,seconds_in_wait from v$session 
  where sid in (37,97) ;

 SID PROGRAM                        EVENT                          STATE      SECONDS_IN_WAIT
---- ------------------------------ ------------------------------ ---------- ---------------
  37 sqlplus@svdbp01i (TNS V1-V3)   log file sync                  WAITING                138
  97 oracle@svdbp01i (LGWR)         control file parallel write    WAITING                228

The LGWR, in turn, is waiting on the “slow” control file write to finish. Of course, that “slow” write is simulated with the breakpoint, and it won’t finish until I resume the execution.

Also, ash_wait_chains.sql confirms the fingerprint of the production issue:

@ash_wait_chains program2||event2 session_id=37 sysdate-400 sysdate


-- Display ASH Wait Chain Signatures script v0.4 BETA by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS        AAS
------ ---------- ----------
WAIT_CHAIN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FIRST_SEEN          LAST_SEEN
------------------- -------------------
  60%         418          0
-> (sqlplus) log file sync  -> (LGWR) control file parallel write
2019-09-05 15:29:09 2019-09-05 15:36:06

Conclusion

Frankly, it wasn’t my first thought that slower control file writes might impair commit performance. In fact, control files placement isn’t even considered in the white paper Configuring Oracle Solaris ZFS for an Oracle Database. This document recommends to separate data and redo file in separate pools for performance, but it doesn’t even mention control files. Of course, after having had understood the impact, we separated control files from data files.

Further, this case study reminded me that the underlying issue might not be obvious by observing only the affected session. Therefore, we shouldn’t rush into implementing “solutions” before having really understood the problem. Thanks to Tanel we have ash_wait_chains.sql – a superb tool for identifying the main blocker who is sometimes hidden in a remote link of the wait chain.

Last but not least, once we’ve understood the dependencies, we can even go a step further and reproduce the problem with debugger, which is surely a great way of studying the internals and understanding undocumented subtleties of a proprietary software.

References

[Hoogland 2018] Frits Hoogland. (February 27, 2018). A look into Oracle redo, part 5: the log writer writing

Thanks for sharing

Nenad Noveljic

8 Comments

  1. Did the control file parallel write statistic show any significant totals, or at least the wait histogram? I would hope this could draw attention to the problem without monitoring single request during log file switch.

    • Yes, the control file write histograms showed several outliers. Yet, the total control file write wait time was insignificant over a longer time period – as it’s usually the case with short glitches. There are two reasons why we decided to monitor lgwr trace for longer writes. One is that lgwr IOs > 500ms mean that something is broken. The other is we get the exact time when the outlier occurred, so we can correlate it with OS performance data and ASH. Of course, we generate the histograms for the outliers, so that we don’t flood the monitoring console when there is some massive storage or OS problem. This is how the alert looks like in this case:
      WARNING – lgwr Ausreisser 2000 ms gefunden 21
      betroffene Datenbanken: DB1, DB2,…
      max: 8481 msec
      timestamp(max): 2019102021:11:03
      db(max): DB10
      0 ###
      1000
      2000 ######
      3000 ###
      4000 ####
      5000 #####
      6000 ####
      7000 ########
      8000 ###############

  2. Thanks Nenad.
    We had the same issue last year.
    Controlfiles were on the same zpool as datafiles. DBW , CKPT and LGW were struggling under long controlfile parallel writes.
    We solved moving controlfile to a different position. Can you share more details about the OS bug?
    Thanks

  3. Hi NENAD,
    In case the controfiles are placed on ASM then this also applies to this particular situation OR can we see the LGWR as blocker waiting on controlfile sequential reads ?

    • Hi Ravi,
      I assume that it applies to ASM too. It can be tested with the procedure I described in the blog post.

  4. Hi Nenad,

    also one more argument to properly size redo logs to avoid frequent log switches and controlfile writes due to kcrfwl.

    Thanks for sharing!

    Regards,
    Martin

    • Hi Martin,

      You’re welcome.

      Larger redo logs would have indeed reduced the number of commit outliers. As our goal was to completely eliminate them, we rather moved the control files to a faster file system.

      Best regards,
      Nenad

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.