Database hanging
One of our 21c databases hung. Everything stopped working, including logging in.
I took a system dump:
sqlplus -prelim "/ as sysdba"
oradebug setmypid
oradebug dump systemstate 10
The system dump showed lots of processes waiting on ‘log file switch (checkpoint incomplete)‘:
157: USER ospid 1385877 sid 786 ser 49681, waiting for 'log file switch (checkpoint incomplete)'
Cmd: INSERT
Blocked by inst: 1, sid: 266, ser: 13238
Final Blocker inst: 1, sid: 266, ser: 13238
159: USER ospid 1295896 sid 796 ser 48963, waiting for 'log file switch (checkpoint incomplete)'
Cmd: INSERT
Blocked by inst: 1, sid: 266, ser: 13238
Final Blocker inst: 1, sid: 266, ser: 13238
The sessions were waiting on a redo log switch to complete, so the redo log writer was tagged as a final blocker:
53: LGWR ospid 2299924 sid 266 ser 13238, waiting for 'rdbms ipc message'
But the information about the final blocker is misleading. In the case of the wait event log file switch (checkpoint incomplete), log writers aren’t really the source of the problem. They’re, instead, the victims of a not yet completed checkpoint. They’re idly waiting until dirty blocks are written to database files, so that they can proceed with overwriting the redo logs. Consequently, the analysis requires focus on checkpoint, particularly on database writers, which do the heavy lifting. But the database writers were mostly idle (waiting on a message) when the system dump was taken.
21: DBW0 ospid 2299814 sid 106 ser 44104,
22: DBW1 ospid 2299818 sid 111 ser 61718, waiting for 'rdbms ipc message'
23: DBW2 ospid 2299821 sid 116 ser 19632, waiting for 'rdbms ipc message'
24: DBW3 ospid 2299825 sid 121 ser 44107, waiting for 'rdbms ipc message'
27: DBW4 ospid 2299831 sid 136 ser 36401, waiting for 'rdbms ipc message'
28: DBW5 ospid 2299837 sid 141 ser 61719, waiting for 'rdbms ipc message'
29: DBW6 ospid 2299840 sid 146 ser 11666, waiting for 'rdbms ipc message'
34: DBW7 ospid 2299852 sid 171 ser 31263, waiting for 'rdbms ipc message'
35: DBW8 ospid 2299856 sid 176 ser 18120, waiting for 'rdbms ipc message'
36: DBW9 ospid 2299859 sid 181 ser 11655, waiting for 'rdbms ipc message'
37: DBWA ospid 2299864 sid 186 ser 61784, waiting for 'rdbms ipc message'
40: DBWB ospid 2299869 sid 201 ser 64146, waiting for 'rdbms ipc message'
41: DBWC ospid 2299873 sid 206 ser 8287, waiting for 'rdbms ipc message'
42: DBWD ospid 2299875 sid 211 ser 34990, waiting for 'rdbms ipc message'
43: DBWE ospid 2299882 sid 216 ser 9750, waiting for 'rdbms ipc message'
44: DBWF ospid 2299887 sid 221 ser 1741, waiting for 'rdbms ipc message'
45: DBWG ospid 2299892 sid 226 ser 57549, waiting for 'rdbms ipc message'
46: DBWH ospid 2299897 sid 231 ser 48083, waiting for 'rdbms ipc message'
47: DBWI ospid 2299899 sid 236 ser 27038, waiting for 'rdbms ipc message'
48: DBWJ ospid 2299901 sid 241 ser 40517, waiting for 'rdbms ipc message'
49: DBWK ospid 2299905 sid 246 ser 60322, waiting for 'rdbms ipc message'
50: DBWL ospid 2299907 sid 251 ser 20829, waiting for 'rdbms ipc message'
51: DBWM ospid 2299915 sid 256 ser 35420, waiting for 'rdbms ipc message'
52: DBWN ospid 2299917 sid 261 ser 2933, waiting for 'rdbms ipc message'
Also, the CKPT, another process participating in checkpoint, was idle.
54: CKPT ospid 2299934 sid 271 ser 65272, waiting for 'rdbms ipc message'
Database writers
But, without an obvious reason, database writers were continuously dumping call stacks into the trace files, for example:
Short stack dump:
ksedsts()+418<-ksdxcb()+2187<-sspuser()+213<-__sighandler()<-semtimedop()+14<-sskgpwwait()+245<-skgpwwait()+208<-ksliwat()+2238<-kslwaitctx()+218<-ksarcv()+307<-ksbabs()+602<-ksbrdp_int()+1133<-ksbdispatch()+4372<-opirip()+522<-opidrv()+589<-sou2o()+165<-opimai_real()+176<-ssthrdmain()+412<-main()+292<-__libc_start_main()+243<-_start()+46
The interpretation of the call stack above is as follows:
- ksedsts() dumps the short call stack.
- ksdxcb(), ksdxcb(), sspuser(), __sighandler() are signal handler functions.
- semtimedop(), sskgpwwait(), skgpwwait(), ksliwat(), kslwaitctx(), ksarcv() handle Oracle interprocess communication. The database writers were waiting on a message. That wait time is instrumented by the ‘rdbms ipc message’ wait event which we saw in the system dump.
- ksbabs(), ksbrdp_int(), ksbdispatch() implement generic background process functionality.
For the exact annotations of these functions, look up Frits Hoogland’s orafun.info
Simply put, database writers, idly waiting on a message, were interrupted by a signal. The signal handler was spooling short call stacks into the trace file.
Everything we had seen so far suggested that database writer didn’t have much work to do. The ps aux output confirmed very low total CPU consumption, too. The only activity we’ve noticed was spooling debug information into the trace file. But the alert log indicated that database writers were spending a huge amount of time on some “non-DBWR tasks”:
DBW 17 executing after 8 secs due to non-DBWR task at ts=1654613015
2022-06-07T16:43:35.872532+02:00
DBW 20 executing after 8 secs due to non-DBWR task at ts=1654613015
2022-06-07T16:43:35.999456+02:00
DBW 6 executing after 8 secs due to non-DBWR task at ts=1654613015
2022-06-07T16:43:36.064177+02:00
DBW 3 executing after 8 secs due to non-DBWR task at ts=1654613015
2022-06-07T16:43:36.127452+02:00
DBW 8 executing after 8 secs due to non-DBWR task at ts=1654613015
2022-06-07T16:43:36.128400+02:00
DBW 21 executing after 8 secs due to non-DBWR task at ts=1654613015
2022-06-07T16:43:36.191568+02:00
DBW 0 executing after 8 secs due to non-DBWR task at ts=1654613015
In other words, the database writers were stalled with an activity unrelated to writing dirty blocks to database files. Since the processes were off CPU, I continued the analysis within the database. After restart, I was able to log in again and query the active session history:
@dashtop program2,event2 1=1 "timestamp '2022-06-08 05:00:00'" "timestamp '2022-06-08 08:15:00'"
Total
Seconds AAS %This PROGRAM2 EVENT2 FIRST_SEEN LAST_SEEN
--------- ------- ------- ---------------------------------------- ------------------------------------------ ------------------- -------------------
12950 1.1 43% (DBWnnn) oradebug request completion 2022-06-08 05:06:16 2022-06-08 08:09:24
6100 .5 20% (DIAn) ON CPU 2022-06-08 05:00:07 2022-06-08 06:47:59
4440 .4 15% (DBWnnn) ON CPU 2022-06-08 05:01:39 2022-06-08 08:14:51
2580 .2 9% (pmdtm) enq: RO - fast object reuse [mode=6] 2022-06-08 07:31:00 2022-06-08 08:14:51
690 .1 2% (Wnnn) latch free 2022-06-08 05:02:20 2022-06-08 08:07:11
650 .1 2% (GENn) ON CPU 2022-06-08 05:02:20 2022-06-08 08:09:44
340 .0 1% (DBWnnn) db file async I/O submit 2022-06-08 05:02:00 2022-06-08 06:41:29
The Oracle database writer processes were spending a significant amount of time in the code instrumented by the “oradebug request completion” wait event. This wait event isn’t documented, but as we saw that database writer was intensively writing debugging information into the trace files, we can assume that the debugging activity was instrumented with that wait event.
Signals
We previously established that the stack is printed upon a received signal. In particular, it was the SIGUSR2 signal.
strace output reveals the senders (in the field si_pid):
grep "\-\-\- SIG" strace.log
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001359, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001423, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001353, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001390, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001355, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001393, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001361, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001446, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001429, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001411, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001409, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001448, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001382, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001401, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001440, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001369, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001432, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001384, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001419, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001414, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_QUEUE, si_pid=3001438, si_uid=1000, si_value={int=233284845, ptr=0x10de7a4ed}} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001371, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001403, si_uid=1000} ---
--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001421, si_uid=1000} ---
All signals above came at once. Senders were other database writer processes, for example:
ps -ef | grep 3001438
oracle 3001438 1 0 14:40 ? 00:00:01 ora_dbwk_DB
Simply put, the database writer which spooled debug information into the trace file received signals from all other database writers. But that’s not the whole picture: after spooling the debug information the database writer sent SIGUSR2 to all other database writers:
grep kill strace.log
tgkill(3001452, 3001452, SIGUSR2) = 0
tgkill(3001353, 3001353, SIGUSR2) = 0
tgkill(3001355, 3001355, SIGUSR2) = 0
tgkill(3001359, 3001359, SIGUSR2) = 0
tgkill(3001361, 3001361, SIGUSR2) = 0
tgkill(3001369, 3001369, SIGUSR2) = 0
tgkill(3001371, 3001371, SIGUSR2) = 0
tgkill(3001382, 3001382, SIGUSR2) = 0
tgkill(3001384, 3001384, SIGUSR2) = 0
tgkill(3001390, 3001390, SIGUSR2) = 0
tgkill(3001393, 3001393, SIGUSR2) = 0
tgkill(3001401, 3001401, SIGUSR2) = 0
tgkill(3001403, 3001403, SIGUSR2) = 0
tgkill(3001409, 3001409, SIGUSR2) = 0
tgkill(3001411, 3001411, SIGUSR2) = 0
tgkill(3001414, 3001414, SIGUSR2) = 0
tgkill(3001419, 3001419, SIGUSR2) = 0
tgkill(3001421, 3001421, SIGUSR2) = 0
tgkill(3001423, 3001423, SIGUSR2) = 0
tgkill(3001429, 3001429, SIGUSR2) = 0
tgkill(3001432, 3001432, SIGUSR2) = 0
tgkill(3001440, 3001440, SIGUSR2) = 0
tgkill(3001446, 3001446, SIGUSR2) = 0
tgkill(3001448, 3001448, SIGUSR2) = 0
As a consequence, we had an avalanche of debug information simultaneously written into the trace files. That explains the extreme “oradebug request completion” cumulative time.
The more database writer processes, the worse the problem. Since the number of database writer processes is proportional to the number of CPUs, the problem is much worse on larger servers!
A subliminal message to the people who like to throw hardware at software problems: sometimes the problem becomes worse.
By the way, you can ask a database writer to spool debug information by sending the SIGUSR2 signal from your shell:
kill -SIGUSR2 3001429
I captured the call stack that send the SIGUSR2 signal to other processes, in case I’d later need to investigate the program flow that leads to sending the signal:
sudo /usr/bin/bpftrace -e 'tracepoint:syscalls:sys_enter_tgkill / pid == 3001448 / { printf("%s\n", ustack() )}'
Attaching 1 probe...
syscall+29
skgpwcall+136
ksdx_cmdreq_reissue_attempt+4102
ksdx_cmdreq_check_pending+647
ksdx_cmdreq_wait+4776
ksdx_cmdreq_wait_for_pending+2228
ksdxdocmdmultex+5375
ksdxdocmdmult+55
ksudmp_proc_short_stack+373
kcbbckptmon+710
ksb_act_run_int+117
ksb_act_run+134
ksbcti+1299
ksbabs+1850
ksbrdp_int+1133
ksbdispatch+4330
opirip+522
opidrv+589
sou2o+165
opimai_real+176
ssthrdmain+412
main+292
__libc_start_main+243
kcbbckptmon would be the function to focus on as it’s called regularly (also on databases not affected by this problem). But ksudmp_proc_short_stack is called only occasionally:
sudo /usr/bin/bpftrace -e 'u:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2,
u:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:ksudmp_proc_short_stack+2
/pid == 3001355 /
{
time("%H:%D:%S ");
printf("%s \n", probe);
}'
18:06/09/22:39 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:43 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:46 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:49 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:52 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:55 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:58 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:01 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:04 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:kcbbckptmon+2
18:06/09/22:04 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:ksudmp_proc_short_stack+2
18:06/09/22:04 uprobe:/u00/oracle/orabase/product/21.6.0.0.220419_b/bin/oracle:ksudmp_proc_short_stack+2
Workaround
A question is, why did database writers start spooling this information? On other databases, they haven’t written any information at all. So I inspected the start of the trace file in the hope of identifying the cause:
Starting tracing log seq 10932
Short stack dump:
ksedsts()+418<-ksdxcb()+2187<-sspuser()+213<-__sighandler()<-semtimedop()+14<-sskgpwwait()+245<-skgpwwait()+208<-ksliwat()+2238<-kslwaitctx()+218<-ksarcv()+307<-ksbabs()+602<-ksbrdp_int()+1133<-ksbdispatch()+4372<-opirip()+522<-opidrv()+589<-sou2o()+165<-opimai_real()+176<-ssthrdmain()+412<-main()+292<-__libc_start_main()+243<-_start()+46
It seems that the tracing was triggered by a log switch.
Next, I looked for an undocumented parameter that could control this behaviour:
select ksppinm name, ksppstvl value, ksppdesc description
from sys.x$ksppi n, sys.x$ksppcv v
where n.indx = v.indx and
n.ksppinm like '%trac%'
order by name ;
NAME
--------------------------------------------------------------------------------
VALUE
--------------------------------------------------------------------------------
DESCRIPTION
--------------------------------------------------------------------------------
_log_switch_tracing_timeout
200
dbwr to trace after logswitch timeout in secs
According to the description in data dictionary, the _log_switch_tracing_timeout parameter defines the threshold to trigger DBWR tracing when a log switch lasts too long. If stalled database writer delayed a log switch, additional diagnostic on database writers would be extremely useful. The undocumented parameter came with 21c.
After increasing the parameter to an extreme value, the database writers immediately stopped writing the information into the trace files:
alter system set "_log_switch_tracing_timeout"=10000 scope=both ;
I flipped it back and forth a couple of times to confirm the causality.
Instrumentation error?
To find out what caused the log switch durations larger than 200s, I inspected the AWR report from the period when the database writers started spooling debug information. I was surprised that there was hardly any activity at all.
In an hour, only two minutes were spent inside the database:
Snap Id Snap Time Sessions Curs/Sess PDBs
--------- ------------------- -------- --------- -----
Elapsed: 60.25 (mins)
DB Time: 1.96 (mins)
There were two log file switches. In total, they took just 45ms:
Total Wait Avg % DB Wait
Event Waits Time (sec) Wait time Class
------------------------------ ----------- ---------- --------- ------ --------
log file switch completion 2 .1 44.94ms .1 Configur
Obviously, we can’t explain tracing with exceeding the 200s threshold. I was speculating that it might be some measurement error that triggered the tracing. For example, the alert logs of all of our 21c databases contain the following messages:
CTWR process with os pid 624287 took 1086626707 secs to acquire the CT global enqueue.
It’s hard to believe that the CTWR process was waiting for 34 years to acquire a lock 🙂 This is rather an instrumentation problem that is described in the following MOS note: After Installing 21.5.0.0.220118 the Alert log Shows ‘CT enqueue held’ Messages with Absurdly Large Values (Doc ID 2848325.1). The CTWR instrumentation bug is fixed with the following one-off patch: BUG 33792885 – CTWR PROCESS TOOK 1075095983 SECS TO ACQUIRE THE CT GLOBAL ENQUEUE.
I was speculating that the above instrumentation bug might also be causing the problem with database writer trace but the patch hasn’t helped.
Monitoring
The wait event “oradebug request completion” is undocumented and not widely known but should be monitored. It’s unlikely that standard monitoring solutions already cover that, unless they measure the total duration of all the wait events classified as “Other”:
select name, wait_class from v$event_name
2 where name = 'oradebug request completion' ;
NAME
----------------------------------------------------------------
WAIT_CLASS
----------------------------------------------------------------
oradebug request completion
Other
We prefer measuring individual wait events, so we have to extend our monitoring.
Summary
In the version 21c, Oracle introduced an undocumented tracing possibility: in case of long running redo log file switches, database writer spools call stacks into the trace file. In previous releases, you had to set up OS stack sampling if you suspected that database writer activity caused longer log file switches.
Unfortunately, there seems to be a problem with the triggering threshold – tracing can be activated even when there’s no problem with log file switches. Database writers starts signaling each other to generate even more debugging information. At the end, database writers are being distracted with dumping call stacks. As a consequence, the database can grind to a halt. The problem is more emphasized on larger servers because the number of the database writers is proportional to the number of CPUs.
The debugging process is instrumented with the undocumented wait event ‘oradebug request completion’. Also, the database writer leaves an entry in the alert log when it’s too busy with a non-DBWR work.
The undocumented parameter _log_switch_tracing_timeout defines the threshold for triggering tracing. You can increase it to an extreme value to prevent unwarranted tracing.
This seems to be an unknown issue at the time of writing of this blog post.
Thanks for posting this.
Just fyi: This tracing happens as well in 19.19. I currently investigate a performance issue on a 19.19 database and see these traces as well in dbw0 trace file.
The underscore parameter unfortunately does not exist in this release.