Warning: log write elapsed time - All-round Database Topics

Maybe you’ve already seen the following warning in the log writer (lgwr) trace file correlating with slow commits:

Warning: log write elapsed time 738ms, size 20KB

Many articles found on the Internet claim that this is an indication of an IO bottleneck and recommend fixing it straight away. But that might be misleading, because the code instrumented by this trace entry comprises much more than an OS write call. I’m going to show here what exactly gets measured and how to precisely identify the underlying problem.

While the low-level analysis is Solaris specific, the conclusions are OS independent.

Firstly, the measured time interval starts in the Oracle C procedure kcrfw_redo_write_driver which, in turn, calls the OS kernel function gethrtime to get the current timestamp. Unsurprisingly, kcrfw_redo_write_driver is the function that performs redo write [1].

Secondly, kcrfw_postprocess_write, which is executed after the redo buffer was written, retrieves the end timestamp and calculates the elapsed time.

Finally, the same function writes the warning into the lgwr trace file if the operation lasted longer than 500 ms:

#0  0x00007ffafd13d0c4 in write () from /lib/64/libc.so.1
#1  0x000000000b1910a2 in sdbgrfuwf_write_file ()
#2  0x000000000b19100a in sdbgrfwf_write_file ()
#3  0x000000000b174de0 in dbgtfdFileWrite ()
#4  0x000000000b0e07e7 in dbgtfdFileAccessCbk ()
#5  0x000000000b172665 in dbgtfWriteRec ()
#6  0x000000001d6df55a in dbgtRecVAWriteDisk ()
#7  0x000000000b170cd6 in dbgtTrcVaList_int ()
#8  0x000000000b170226 in dbgtTrc_int ()
#9  0x0000000015adb8f3 in dbktWriteTimestampWCdbInfo ()
#10 0x000000000ac2d62c in kcrfw_postprocess_write ()
#11 0x000000000ac1bf1c in kcrfw_redo_write ()
#12 0x000000000af0462c in kcrfw_redo_write_driver ()
#13 0x00000000194f3fea in ksb_act_run_int ()
#14 0x00000000194f9ebd in ksb_act_run ()
#15 0x000000000af0e263 in ksbcti ()
#16 0x000000000af0263f in ksbabs ()
#17 0x000000000af00bc7 in ksbrdp ()
#18 0x000000000aaccf54 in opirip ()
#19 0x0000000009d10a2b in opidrv ()
#20 0x0000000009d10747 in sou2o ()
#21 0x0000000009d1050f in opimai_real ()
#22 0x0000000009d100a6 in ssthrdmain ()
#23 0x0000000009d0fe17 in main ()

I wrote the following DTrace script to gain an insight into what’s instrumented by the trace file entry:

/*
(c) 2019 Nenad Noveljic All Rights Reserved

usage: dtrace -s log_write_elapsed.d -p PID
*/

#pragma D option quiet

pid$target::kcrfw_redo_write_driver:entry
{
  self->entered_kcrfw_redo_write_driver = 1 ;
}

pid$target::gethrtime*:return
/ self->entered_kcrfw_redo_write_driver /
{
  self->log_write_started = arg1 ;
  self->entered_kcrfw_redo_write_driver = 0 ;
}

pid$target::kcrfw_postprocess_write:entry
{
  self->entered_kcrfw_postprocess_write = 1 ;
}

pid$target::gethrtime*:return
/ self->entered_kcrfw_postprocess_write /
{
  printf("log write elapsed [ms]: %d\n\n", \
    ( arg1 - self->log_write_started ) / 1000000 ) ;
  self->entered_kcrfw_postprocess_write = 0 ;
  self->log_write_started = 0 ;
}

pid$target::pwrite:entry
/ strstr(fds[arg0].fi_pathname,"redo") != 0 && self->log_write_started /
{
  self->pwrite_start = timestamp ;
}

pid$target::pwrite:return
/ self->pwrite_start /
{
  printf("OS write elapsed [ms]: %d\n", \
    ( timestamp - self->pwrite_start ) / 1000000  ) ;
  self->pwrite_start = 0 ;
}

pid$target::kslgetl:entry
/ self->log_write_started /
{
  printf("get latch: 0x%X %d 0x%X %d\n", arg0, arg1, arg2, arg3);
}

This script independently measures the instrumented time by capturing the gethrtime return values and calculating the difference. On top of that, it traces the latch allocations within the observed time period. It is tested on Oracle 18.7 / Solaris x64 11.4 and assumes synchronous IO (disk_asynch_io=FALSE).

Here’s a sample output:

get latch: 0x6000B2B0 1 0x0 322
get latch: 0x6000B2B0 1 0x0 322
get latch: 0x6001EC40 1 0x0 2749
get latch: 0x6001EC40 1 0x0 2749
get latch: 0x969B61B0 1 0x0 3737
get latch: 0x969B61B0 1 0x0 3737
get latch: 0x969B6250 1 0x1 3737
get latch: 0x969B6250 1 0x1 3737
get latch: 0x969B62F0 1 0x2 3737
get latch: 0x969B62F0 1 0x2 3737
get latch: 0x969B6390 1 0x3 3737
get latch: 0x969B6390 1 0x3 3737
get latch: 0x969B6430 1 0x4 3737
get latch: 0x969B6430 1 0x4 3737
get latch: 0x969B64D0 1 0x5 3737
get latch: 0x969B64D0 1 0x5 3737
get latch: 0x969B6570 1 0x6 3737
get latch: 0x969B6570 1 0x6 3737
get latch: 0x6000B2B0 1 0xBE913438 321
get latch: 0x6000B2B0 1 0xBE913438 321
get latch: 0x6000B2B0 1 0x0 322
get latch: 0x6000B2B0 1 0x0 322
get latch: 0x6001EC40 1 0x0 2749
get latch: 0x6001EC40 1 0x0 2749
get latch: 0x969B61B0 1 0x0 3737
get latch: 0x969B61B0 1 0x0 3737
get latch: 0x969B6250 1 0x1 3737
get latch: 0x969B6250 1 0x1 3737
get latch: 0x969B62F0 1 0x2 3737
get latch: 0x969B62F0 1 0x2 3737
get latch: 0x969B6390 1 0x3 3737
get latch: 0x969B6390 1 0x3 3737
get latch: 0x969B6430 1 0x4 3737
get latch: 0x969B6430 1 0x4 3737
get latch: 0x969B64D0 1 0x5 3737
get latch: 0x969B64D0 1 0x5 3737
get latch: 0x969B6570 1 0x6 3737
get latch: 0x969B6570 1 0x6 3737
get latch: 0x60025EF8 1 0x0 3752 
get latch: 0x60025EF8 1 0x0 3752
OS write elapsed [ms]: 3
log write elapsed [ms]: 3

We can observe two things. One is that, normally, the OS write is the largest contributor to the instrumented elapsed time. The other is, there are lots of latch allocations within the instrumented time period.

In particular, one of the allocated latches is redo writing:

column name format a12
select name from v$latch where addr like '%60025EF8%' ;

NAME
------------
redo writing

Now it’s time for a little experiment (but not in production!) – I’ll allocate the redo writing latch by directly calling the Oracle C function kslgetl [2] from another session before committing some changed data:

oradebug setmypid
oradebug call kslgetl 0x60025EF8 1 0x0 3752

Expectedly, the commit was hanging, because it was waiting on the redo writing latch, and it completed immediately after terminating the latch holder session.

The DTrace script produced the following output:

$dtrace -s log_write_elapsed.d -p 4198

OS write elapsed [ms]: 13
OS write elapsed [ms]: 5
OS write elapsed [ms]: 2
log write elapsed [ms]: 12872

As you can see, the measured time is now significantly longer than the time it took for the IO operation to complete.

The trace file entry matches the time measured by the DTrace script:

*** 2019-10-07T08:15:33.531014+02:00
Warning: log write elapsed time 12872ms, size 20KB

That was the proof that a slow IO isn’t the only possible cause for this warning.

Anyway, if you notice this message, you can use Tanel Poder’s ash_wait_chains.sql to see if the log writer was waiting on something – before compulsively addressing an “IO issue”:

@ash_wait_chains.sql program2||event2 session_id=148 "to_date('07.10.2019 08:15:21','dd.mm.yyyy hh24:mi:ss')" "to_date('07.10.2019 08:15:34','dd.mm.yyyy hh24:mi:ss')"

-- Display ASH Wait Chain Signatures script v0.4 BETA by Tanel Poder ( http://blog.tanelpoder.com )

%This     SECONDS        AAS
------ ---------- ----------
WAIT_CHAIN
--------------------------------------------------------------------------------
FIRST_SEEN          LAST_SEEN
------------------- -------------------
 100%          12         .9
-> (sqlplus) log file sync  -> (LGWR) latch: redo writing  -> [idle blocker
1,253,8655]
2019-10-07 08:15:21 2019-10-07 08:15:32

The output above shows us that the log writer was waiting on the redo writing latch which was held by the session 253 – the session that issued the kslgetl call. Notice that this session is correctly annotated as an “idle blocker“, as it didn’t do anything after allocating the latch, and, consequently, didn’t produce any samples in v$active_session_history.

In conclusion, the code instrumented by the log writer trace file entry entails not only the the OS write call, but also the latch handling. Therefore, triage is the first thing to do after spotting longer redo writes in the lgwr trace: was it an IO bottleneck or, perhaps, some wait chain? You can crosscheck the former with OS utilities and/or log file parallel write waits. The latter you can investigate with ash_wait_chains.sql. In any case, there isn’t any room left for guesswork and speculation.

References:

[1] Frits Hoogland, A look into Oracle redo, part 3: log writer work cycle overview. February 12, 2018
[2] Andrey Nikolaev, New features of Latches and Mutexes in Oracle 12c. December 9, 2015.

Thanks for sharing

5 Comments


October 23, 2019
Bernd Eckenfels

Great methodology! I am curious, Did you find common causes for that latch to be blocked in a chain outside of the io path?
- Reply
  
  October 23, 2019
  Nenad Noveljic
  
  Thanks.
  We’re occasionally seeing lgwr writes lasting much longer than the underpinning log writer IOs. However, we still don’t know the reason. I used latches here just to make the point that IO isn’t the only activity measured by this trace entry. But, of course, it can be something completely different.

January 3, 2020
Jonathan Lewis

Nenad,
As far as I aware the “redo writing latch” is how lgwr announces to the world that it is writing, and nothing should happen between lgwr taking the latch and lgwr starting to write – apart, of course, from O/S descheduling.

If a session has written to the log buffer with a “commit record” then it checks the state of the latch and if the latch is held exclusive it goes into a “log file sync” wait without sending a message to the log writer because it knows the log writer is writing and will carry on writing and doesn’t need to be told to write.

Off the top of my head the only explanation (in the simple case of no LGxx sslaves) for “lgwr reporting significantly more time to write” than the O/S is CPU starvattion (perhaps because of multiple a foregrounds all reaching a commit and spinning on latches) between the get and the write.
- Reply
  
  January 3, 2020
  Nenad Noveljic
  
  Jonathan,
  I randomly picked the “redo writing” latch here just to demonstrate that the warning entry measures more than the time spent in the I/O subsystem.
  
  On the “real” database, where the warning is generated, ASH suggests that CPU bound operations (LOB operations, in particular) indeed took significantly more time to complete than usual when the warning was generated. However, according to OS performance data the CPUs weren’t starved. So, I scheduled a simple CPU benchmark to verify if we really have a transient CPU performance issue when this problem happens again.
  
  Thanks,
  Nenad
Pingback: Advanced usage of gdb for profiling | Frits Hoogland Weblog

Nenad Noveljic

5 Comments

Leave a Reply Cancel reply