Measuring time

In the previous blog post, I described a hanging issue caused by a debugging event in a Oracle 21c database. In a nutshell, the database writers started compulsively signaling each other to generate some debugging information. In the end, this non-DBWR work distracted the database writers so much that they completely neglected their core business – writing dirty blocks to files. The database writers were spending an extreme amount of time in the part of the code instrumented by the “oradebug request completion” wait event. In this article, I’ll describe the internals of this undocumented wait event. I conducted the research on a 21.5 database.

The first question is, how long does the execution of a single debugging event last? We can find that out by measuring the time between the Oracle C function calls kskthbwt and kskthewt. These functions mark the beginning and the end of a wait event, respectively. The second argument to kskthbwt is the event code, which can be queried from the data dictionary:

select event#,name from v$event_name where name like 'oradebug request completion' ;

    EVENT# NAME
---------- ----------------------------------------------------------------
       588 oradebug request completion

Now we can use this event code to filter only the event of interest:

sudo bpftrace -e 'uprobe:/u00/oracle/orabase/product/21.5.0.0.220131_a/bin/oracle:kskthbwt+2
/ pid == 1857586 && arg1 == 588 /
{
  @start[tid] = nsecs ;
}

uprobe:/u00/oracle/orabase/product/21.5.0.0.220131_a/bin/oracle:kskthewt+2
/ @start[tid] /
{
  $duration_us = ( nsecs - @start[tid] ) / 1000 ;
  time("%H:%D:%S ");
  printf("duration [us] %d\n", $duration_us);
  delete(@start[tid])
}'

17:06/10/22:22 duration [us] 5002550
17:06/10/22:22 duration [us] 14
17:06/10/22:22 duration [us] 5405
17:06/10/22:22 duration [us] 4519
17:06/10/22:22 duration [us] 4501
17:06/10/22:22 duration [us] 4518
17:06/10/22:22 duration [us] 5568
17:06/10/22:22 duration [us] 5461
17:06/10/22:22 duration [us] 6338
17:06/10/22:22 duration [us] 5003
17:06/10/22:22 duration [us] 4470
17:06/10/22:22 duration [us] 8890
17:06/10/22:22 duration [us] 5462
17:06/10/22:22 duration [us] 5471
17:06/10/22:22 duration [us] 5761
17:06/10/22:22 duration [us] 5490
17:06/10/22:22 duration [us] 4408
17:06/10/22:22 duration [us] 4397
17:06/10/22:22 duration [us] 5459
17:06/10/22:22 duration [us] 7917
17:06/10/22:22 duration [us] 5412
17:06/10/22:22 duration [us] 4372
17:06/10/22:22 duration [us] 5435
17:06/10/22:22 duration [us] 4362
...

We can see an outburst of 24 “oradebug request completion” events. The first occurrence was extreme – it lasted 5s, which is an eternity for a database writer. This is a serious issue.

Off-CPU analysis

The observed database writer process hasn’t spent almost any time on CPU – the cumulative CPU time for the process hasn’t changed:

ps -p 1857586
    PID TTY          TIME CMD
1857586 ?        00:01:31 ora_dbw0_dbc

Generally, high frequency stack sampling in regular intervals is a useful method to get an insight into a proprietary software. However, for a non-CPU-bound processes like this one, high frequency stack sampling might skew the picture because it would favor the code spending most time on CPU, but as the process isn’t CPU-bound, that might not be representative of where the time is being spent. Therefore, we must study off-CPU behavior as Gregg Brendan described it here. His bcc program offcputime takes a stack at each context switch and then aggregates the time this call stack spent blocking.

sudo /usr/share/bcc/tools/offcputime -p 1857586

The following stack caused most off-CPU time:

finish_task_switch
schedule
schedule_timeout
do_semtimedop
ksys_semtimedop
__x64_sys_semtimedop
do_syscall_64
entry_SYSCALL_64_after_hwframe
semtimedop
skgpwwait
ksliwat
kslwaitctx
ksarcv
ksbabs
ksbrdp_int
ksbdispatch
opirip
opidrv
sou2o
opimai_real
ssthrdmain
main
__libc_start_main
[unknown]
-                ora_dbw0_dbc (1857586)
	39157040

However, in the stack above the database writer was only idly waiting and periodically waking up to check whether there’s something to do. We know this because of the functions semtimedop and ksliwat are on the stack, see Frits Hoogland’s presentation Profiling the logwriter and
database writer for more explanation. These stacks were collected before and after the actual problem happened.

The following stack looks unusual and was causing high blocking time:

finish_task_switch
schedule
schedule_hrtimeout_range_clock
schedule_hrtimeout_range
poll_schedule_timeout.constprop.16
do_select
core_sys_select
kern_select
__x64_sys_select
do_syscall_64
entry_SYSCALL_64_after_hwframe
__libc_select
ksdx_cmdreq_wait
ksdx_cmdreq_wait_for_pending
ksdxdocmdmultex
ksdxdocmdmult
ksudmp_proc_short_stack
kcbbckptmon
ksb_act_run_int
ksb_act_run
ksbcti
ksbabs
ksbrdp_int
ksbdispatch
opirip
opidrv
sou2o
opimai_real
ssthrdmain
main
__libc_start_main
[unknown]
-                ora_dbw0_dbc (1857586)
	4885099

The program entered the syscall select and then was sleeping for a while (the function schedule_hrtimeout_range is on the stack.)

select syscall

The syscall select checks multiple file descriptors if there is something to read or write. Without this system call, the application would need to monitor every file descriptor in a separate thread.

The arguments are as follows:

int select(int nfds, fd_set *restrict readfds,
		  fd_set *restrict writefds, fd_set *restrict exceptfds,
		  struct timeval *restrict timeout);

The system call receives three sets of file descriptors in the fd_set structures. nfds, the first argument, is highest-numbered file descriptor in any of the three sets, plus 1. Finally, the timeout parameter defines for how long to block the thread before waking up to check the file descriptors for input and output. From the Linux kernel function schedule_hrtimeout_range, we can infer that a non-zero timeout was passed to the syscall, and that the program spent a substantial amount of time doing nothing.

The following bpftrace script quantifies, how much time of the total wait event time was spent sleeping in the select syscall:

sudo bpftrace -e 'uprobe:/u00/oracle/orabase/product/21.5.0.0.220131_a/bin/oracle:kskthbwt+2
/ pid == 1857586 && arg1 == 588 /
{
  @start[tid] = nsecs ;
}

tracepoint:syscalls:sys_enter_select
/ @start[tid] /
{
  @start_select[tid] = nsecs ;
}

tracepoint:syscalls:sys_exit_select
/ @start_select[tid] /
{
  @duration_select_ms = sum( ( nsecs - @start_select[tid] ) / 1000 / 1000 );
  @count_select = count();
  delete(@start_select[tid]);
}

uprobe:/u00/oracle/orabase/product/21.5.0.0.220131_a/bin/oracle:kskthewt+2
/ @start[tid] /
{
  $duration_ms = ( nsecs - @start[tid] ) / 1000 / 1000 ;
  printf("Duration wait event [ms] %d\n", $duration_ms);
  delete(@start[tid]);
  delete(@start_select[tid]);
  exit()
}'

Duration wait event [ms] 5000

@count_select: 4612

@duration_select_ms: 4599

Out of 5000 ms spent in a oradebug wait event, 4599 ms were spent waiting in a select syscall.

Tracing select

Next, I inspected the individual select syscalls with bpftrace. bpftrace can access the following arguments of the select syscall:

sudo bpftrace -lv 'tracepoint:syscalls:sys_enter_select'
tracepoint:syscalls:sys_enter_select
    int __syscall_nr
    int n
    fd_set * inp
    fd_set * outp
    fd_set * exp
    struct timeval * tvp

In particular, I wanted to find the duration of each timeout, and how many file descriptors were being handled:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_select
/ pid == 1857586 /
{
  printf("Number of file descriptors: %d Sleep: %d sec %d usec\n", args->n, args->tvp->tv_sec, args->tvp->tv_usec);
}'

Number of file descriptors: 0 Sleep: 0 sec 1000 usec
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
...

Surprisingly, there weren’t any file descriptors passed. This means that the select system call was used only to implement a sleep. The sleeping interval was 1 ms.

Profiling syscalls

strace flag -c can be used for profiling system calls:

sudo strace -f -c -p 1857676

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 33.06    0.116057          28      4049           openat
 23.29    0.081759          20      4047           read
 21.53    0.075589          19      3849         6 select
 18.51    0.064976          16      4049           close
  1.92    0.006740          33       201           write
  0.36    0.001268          24        51           lseek
  0.36    0.001262          18        69           rt_sigprocmask
  0.28    0.000995          41        24        24 semtimedop
  0.23    0.000806          14        54           getrusage
  0.15    0.000540           8        63           sched_yield
  0.13    0.000452          19        23           tgkill
  0.07    0.000256          11        23        22 rt_sigreturn
  0.04    0.000151          30         5           lstat
  0.03    0.000090          22         4           stat
  0.02    0.000054          27         2           chown
  0.01    0.000039          19         2           fcntl
  0.01    0.000021          21         1           geteuid
------ ----------- ----------- --------- --------- ----------------
100.00    0.351055          21     16516        52 total

We can see around 4000 executions for select/open/read/close system calls. By the way, don’t get confused the by the low times in seconds – it’s only the CPU time consumed by the syscall, not its duration. The profile looks like a loop that opens a file, reads it, closes it and then waits for 1 ms before the next iteration.

This pattern is visible in a more granular strace, where every syscall was traced.

sudo strace -o strace3.log -f -p 1857676

...
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11)                       = 0
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11)                       = 0
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11)                       = 0
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11)                       = 0
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11)                       = 0
...

The traced database writer was basically sampling /proc/PID/stat file of the checkpoint process to generate some execution statistics:

ps -p 1857715
    PID TTY          TIME CMD
1857715 ?        00:05:25 ora_ckpt_dbc

Finally, I did some aggregations on the strace output:

strings strace3.log | grep open strace3.log | awk -F '"' '{ print $2 }' | sort | uniq -c | sort -n -k 1

...
     12 /proc/1857683/stat
     14 /proc/1857597/stat
     14 /proc/1857641/stat
     14 /proc/1857689/stat
     14 /proc/1857705/stat
     15 /proc/1857623/stat
     15 /proc/1857633/stat
     15 /proc/1857655/stat
     15 /proc/1857666/stat
     15 /proc/1857685/stat
     15 /proc/1857696/stat
     16 /proc/1857586/stat
     16 /proc/1857588/stat
     16 /proc/1857600/stat
     16 /proc/1857608/stat
     16 /proc/1857619/stat
     16 /proc/1857657/stat
     16 /proc/1857663/stat
     16 /proc/1857668/stat
     16 /proc/1857674/stat
     16 /proc/1857692/stat
     16 /proc/1857699/stat
     16 /proc/1857701/stat
   3634 /proc/1857715/stat

Most /proc/PID/stat samples (3634) were on the ckpt process (with PID 1857715). This number seems excessive considering the 1 ms break between each sample. The database writer was being blocked for several seconds just by waiting between samples.

The rest of the reads were for other database writers, for example:

ps -p 1857701
    PID TTY          TIME CMD
1857701 ?        00:01:27 ora_dbwm_dbc

ps -p 1857699
    PID TTY          TIME CMD
1857699 ?        00:01:29 ora_dbwl_dbc

Summary

A performance problem with log switches can trigger database writers to generate debug information. This includes taking thousands of samples of /proc/PID/stat for the checkpoint process. Since there’s a 1 ms sleep between each sample, it can take seconds to collect all the debugging information. This non-DBWR activity distracts database writers from their main work – writing dirty blocks to files. The consequences can be fatal – we observed a database grinded to a halt on several occasions.

The collection of diagnostic information is instrumented by the wait event “oradebug request completion”. If you see that the database writers are spending too much time in it, you can try increasing the undocumented parameter _log_switch_tracing_timeout to a much higher value, see my previous blog post.

Thanks for sharing

oradebug request completion