Measuring time
In the previous blog post, I described a hanging issue caused by a debugging event in a Oracle 21c database. In a nutshell, the database writers started compulsively signaling each other to generate some debugging information. In the end, this non-DBWR work distracted the database writers so much that they completely neglected their core business – writing dirty blocks to files. The database writers were spending an extreme amount of time in the part of the code instrumented by the “oradebug request completion” wait event. In this article, I’ll describe the internals of this undocumented wait event. I conducted the research on a 21.5 database.
The first question is, how long does the execution of a single debugging event last? We can find that out by measuring the time between the Oracle C function calls kskthbwt and kskthewt. These functions mark the beginning and the end of a wait event, respectively. The second argument to kskthbwt is the event code, which can be queried from the data dictionary:
select event#,name from v$event_name where name like 'oradebug request completion' ;
EVENT# NAME
---------- ----------------------------------------------------------------
588 oradebug request completion
Now we can use this event code to filter only the event of interest:
sudo bpftrace -e 'uprobe:/u00/oracle/orabase/product/21.5.0.0.220131_a/bin/oracle:kskthbwt+2
/ pid == 1857586 && arg1 == 588 /
{
@start[tid] = nsecs ;
}
uprobe:/u00/oracle/orabase/product/21.5.0.0.220131_a/bin/oracle:kskthewt+2
/ @start[tid] /
{
$duration_us = ( nsecs - @start[tid] ) / 1000 ;
time("%H:%D:%S ");
printf("duration [us] %d\n", $duration_us);
delete(@start[tid])
}'
17:06/10/22:22 duration [us] 5002550
17:06/10/22:22 duration [us] 14
17:06/10/22:22 duration [us] 5405
17:06/10/22:22 duration [us] 4519
17:06/10/22:22 duration [us] 4501
17:06/10/22:22 duration [us] 4518
17:06/10/22:22 duration [us] 5568
17:06/10/22:22 duration [us] 5461
17:06/10/22:22 duration [us] 6338
17:06/10/22:22 duration [us] 5003
17:06/10/22:22 duration [us] 4470
17:06/10/22:22 duration [us] 8890
17:06/10/22:22 duration [us] 5462
17:06/10/22:22 duration [us] 5471
17:06/10/22:22 duration [us] 5761
17:06/10/22:22 duration [us] 5490
17:06/10/22:22 duration [us] 4408
17:06/10/22:22 duration [us] 4397
17:06/10/22:22 duration [us] 5459
17:06/10/22:22 duration [us] 7917
17:06/10/22:22 duration [us] 5412
17:06/10/22:22 duration [us] 4372
17:06/10/22:22 duration [us] 5435
17:06/10/22:22 duration [us] 4362
...
We can see an outburst of 24 “oradebug request completion” events. The first occurrence was extreme – it lasted 5s, which is an eternity for a database writer. This is a serious issue.
Off-CPU analysis
The observed database writer process hasn’t spent almost any time on CPU – the cumulative CPU time for the process hasn’t changed:
ps -p 1857586
PID TTY TIME CMD
1857586 ? 00:01:31 ora_dbw0_dbc
Generally, high frequency stack sampling in regular intervals is a useful method to get an insight into a proprietary software. However, for a non-CPU-bound processes like this one, high frequency stack sampling might skew the picture because it would favor the code spending most time on CPU, but as the process isn’t CPU-bound, that might not be representative of where the time is being spent. Therefore, we must study off-CPU behavior as Gregg Brendan described it here. His bcc program offcputime takes a stack at each context switch and then aggregates the time this call stack spent blocking.
sudo /usr/share/bcc/tools/offcputime -p 1857586
The following stack caused most off-CPU time:
finish_task_switch
schedule
schedule_timeout
do_semtimedop
ksys_semtimedop
__x64_sys_semtimedop
do_syscall_64
entry_SYSCALL_64_after_hwframe
semtimedop
skgpwwait
ksliwat
kslwaitctx
ksarcv
ksbabs
ksbrdp_int
ksbdispatch
opirip
opidrv
sou2o
opimai_real
ssthrdmain
main
__libc_start_main
[unknown]
- ora_dbw0_dbc (1857586)
39157040
However, in the stack above the database writer was only idly waiting and periodically waking up to check whether there’s something to do. We know this because of the functions semtimedop and ksliwat are on the stack, see Frits Hoogland’s presentation Profiling the logwriter and
database writer for more explanation. These stacks were collected before and after the actual problem happened.
The following stack looks unusual and was causing high blocking time:
finish_task_switch
schedule
schedule_hrtimeout_range_clock
schedule_hrtimeout_range
poll_schedule_timeout.constprop.16
do_select
core_sys_select
kern_select
__x64_sys_select
do_syscall_64
entry_SYSCALL_64_after_hwframe
__libc_select
ksdx_cmdreq_wait
ksdx_cmdreq_wait_for_pending
ksdxdocmdmultex
ksdxdocmdmult
ksudmp_proc_short_stack
kcbbckptmon
ksb_act_run_int
ksb_act_run
ksbcti
ksbabs
ksbrdp_int
ksbdispatch
opirip
opidrv
sou2o
opimai_real
ssthrdmain
main
__libc_start_main
[unknown]
- ora_dbw0_dbc (1857586)
4885099
The program entered the syscall select and then was sleeping for a while (the function schedule_hrtimeout_range is on the stack.)
select syscall
The syscall select checks multiple file descriptors if there is something to read or write. Without this system call, the application would need to monitor every file descriptor in a separate thread.
The arguments are as follows:
int select(int nfds, fd_set *restrict readfds,
fd_set *restrict writefds, fd_set *restrict exceptfds,
struct timeval *restrict timeout);
The system call receives three sets of file descriptors in the fd_set structures. nfds, the first argument, is highest-numbered file descriptor in any of the three sets, plus 1. Finally, the timeout parameter defines for how long to block the thread before waking up to check the file descriptors for input and output. From the Linux kernel function schedule_hrtimeout_range, we can infer that a non-zero timeout was passed to the syscall, and that the program spent a substantial amount of time doing nothing.
The following bpftrace script quantifies, how much time of the total wait event time was spent sleeping in the select syscall:
sudo bpftrace -e 'uprobe:/u00/oracle/orabase/product/21.5.0.0.220131_a/bin/oracle:kskthbwt+2
/ pid == 1857586 && arg1 == 588 /
{
@start[tid] = nsecs ;
}
tracepoint:syscalls:sys_enter_select
/ @start[tid] /
{
@start_select[tid] = nsecs ;
}
tracepoint:syscalls:sys_exit_select
/ @start_select[tid] /
{
@duration_select_ms = sum( ( nsecs - @start_select[tid] ) / 1000 / 1000 );
@count_select = count();
delete(@start_select[tid]);
}
uprobe:/u00/oracle/orabase/product/21.5.0.0.220131_a/bin/oracle:kskthewt+2
/ @start[tid] /
{
$duration_ms = ( nsecs - @start[tid] ) / 1000 / 1000 ;
printf("Duration wait event [ms] %d\n", $duration_ms);
delete(@start[tid]);
delete(@start_select[tid]);
exit()
}'
Duration wait event [ms] 5000
@count_select: 4612
@duration_select_ms: 4599
Out of 5000 ms spent in a oradebug wait event, 4599 ms were spent waiting in a select syscall.
Tracing select
Next, I inspected the individual select syscalls with bpftrace. bpftrace can access the following arguments of the select syscall:
sudo bpftrace -lv 'tracepoint:syscalls:sys_enter_select'
tracepoint:syscalls:sys_enter_select
int __syscall_nr
int n
fd_set * inp
fd_set * outp
fd_set * exp
struct timeval * tvp
In particular, I wanted to find the duration of each timeout, and how many file descriptors were being handled:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_select
/ pid == 1857586 /
{
printf("Number of file descriptors: %d Sleep: %d sec %d usec\n", args->n, args->tvp->tv_sec, args->tvp->tv_usec);
}'
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
Number of file descriptors: 0 Sleep: 0 sec 1000 usec
...
Surprisingly, there weren’t any file descriptors passed. This means that the select system call was used only to implement a sleep. The sleeping interval was 1 ms.
Profiling syscalls
strace flag -c can be used for profiling system calls:
sudo strace -f -c -p 1857676
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
33.06 0.116057 28 4049 openat
23.29 0.081759 20 4047 read
21.53 0.075589 19 3849 6 select
18.51 0.064976 16 4049 close
1.92 0.006740 33 201 write
0.36 0.001268 24 51 lseek
0.36 0.001262 18 69 rt_sigprocmask
0.28 0.000995 41 24 24 semtimedop
0.23 0.000806 14 54 getrusage
0.15 0.000540 8 63 sched_yield
0.13 0.000452 19 23 tgkill
0.07 0.000256 11 23 22 rt_sigreturn
0.04 0.000151 30 5 lstat
0.03 0.000090 22 4 stat
0.02 0.000054 27 2 chown
0.01 0.000039 19 2 fcntl
0.01 0.000021 21 1 geteuid
------ ----------- ----------- --------- --------- ----------------
100.00 0.351055 21 16516 52 total
We can see around 4000 executions for select/open/read/close system calls. By the way, don’t get confused the by the low times in seconds – it’s only the CPU time consumed by the syscall, not its duration. The profile looks like a loop that opens a file, reads it, closes it and then waits for 1 ms before the next iteration.
This pattern is visible in a more granular strace, where every syscall was traced.
sudo strace -o strace3.log -f -p 1857676
...
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11) = 0
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11) = 0
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11) = 0
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11) = 0
1857676 select(0, 0x7fff18135c90, 0x7fff18135c90, 0x7fff18135c90, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
1857676 openat(AT_FDCWD, "/proc/1857715/stat", O_RDONLY) = 11
1857676 read(11, "1857715 (ora_ckpt_dbc) S 1 18"..., 999) = 332
1857676 close(11) = 0
...
The traced database writer was basically sampling /proc/PID/stat file of the checkpoint process to generate some execution statistics:
ps -p 1857715
PID TTY TIME CMD
1857715 ? 00:05:25 ora_ckpt_dbc
Finally, I did some aggregations on the strace output:
strings strace3.log | grep open strace3.log | awk -F '"' '{ print $2 }' | sort | uniq -c | sort -n -k 1
...
12 /proc/1857683/stat
14 /proc/1857597/stat
14 /proc/1857641/stat
14 /proc/1857689/stat
14 /proc/1857705/stat
15 /proc/1857623/stat
15 /proc/1857633/stat
15 /proc/1857655/stat
15 /proc/1857666/stat
15 /proc/1857685/stat
15 /proc/1857696/stat
16 /proc/1857586/stat
16 /proc/1857588/stat
16 /proc/1857600/stat
16 /proc/1857608/stat
16 /proc/1857619/stat
16 /proc/1857657/stat
16 /proc/1857663/stat
16 /proc/1857668/stat
16 /proc/1857674/stat
16 /proc/1857692/stat
16 /proc/1857699/stat
16 /proc/1857701/stat
3634 /proc/1857715/stat
Most /proc/PID/stat samples (3634) were on the ckpt process (with PID 1857715). This number seems excessive considering the 1 ms break between each sample. The database writer was being blocked for several seconds just by waiting between samples.
The rest of the reads were for other database writers, for example:
ps -p 1857701
PID TTY TIME CMD
1857701 ? 00:01:27 ora_dbwm_dbc
ps -p 1857699
PID TTY TIME CMD
1857699 ? 00:01:29 ora_dbwl_dbc
Summary
A performance problem with log switches can trigger database writers to generate debug information. This includes taking thousands of samples of /proc/PID/stat for the checkpoint process. Since there’s a 1 ms sleep between each sample, it can take seconds to collect all the debugging information. This non-DBWR activity distracts database writers from their main work – writing dirty blocks to files. The consequences can be fatal – we observed a database grinded to a halt on several occasions.
The collection of diagnostic information is instrumented by the wait event “oradebug request completion”. If you see that the database writers are spending too much time in it, you can try increasing the undocumented parameter _log_switch_tracing_timeout to a much higher value, see my previous blog post.