{"id":4213,"date":"2022-06-09T17:06:50","date_gmt":"2022-06-09T17:06:50","guid":{"rendered":"https:\/\/nenadnoveljic.com\/blog\/?p=4213"},"modified":"2022-06-13T16:25:34","modified_gmt":"2022-06-13T16:25:34","slug":"perilous-new-undocumented-tracing-feature-in-oracle-21c","status":"publish","type":"post","link":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/","title":{"rendered":"Perilous New Undocumented Tracing Feature in Oracle 21c"},"content":{"rendered":"<h1>Database hanging<\/h1>\n<p>One of our 21c databases hung. Everything stopped working, including logging in.<\/p>\n<p>I took a system dump:<\/p>\n<pre><code>sqlplus -prelim \"\/ as sysdba\"\noradebug setmypid\noradebug dump systemstate 10<\/code><\/pre>\n<p>The system dump showed lots of processes waiting on &#8216;<span style=\"color:red\">log file switch (checkpoint incomplete)<\/span>&#8216;:<\/p>\n<pre><code>157: USER ospid 1385877 sid 786 ser 49681, waiting for '<span style=\"color:red\">log file switch (checkpoint incomplete)<\/span>'\n  Cmd: INSERT\n  Blocked by inst: 1, sid: 266, ser: 13238\n  <span style=\"color:blue\">Final Blocker inst: 1, sid: 266, ser: 13238<\/span>\n159: USER ospid 1295896 sid 796 ser 48963, waiting for '<span style=\"color:red\">log file switch (checkpoint incomplete)<\/span>'\n  Cmd: INSERT\n  Blocked by inst: 1, sid: 266, ser: 13238\n  <span style=\"color:blue\">Final Blocker inst: 1, sid: 266, ser: 13238<\/span><\/code><\/pre>\n<p>The sessions were waiting on a redo log switch to complete, so the redo log writer was tagged as a final blocker:<\/p>\n<pre><code>53: <span style=\"color:blue\">LGWR ospid 2299924 sid 266 ser 13238, waiting for 'rdbms ipc message'<\/span><\/code><\/pre>\n<p>But the information about the final blocker is misleading. In the case of the wait event <a href=\"https:\/\/docs.oracle.com\/en\/database\/oracle\/oracle-database\/21\/refrn\/descriptions-of-wait-events.html#GUID-A19C59FE-4338-483C-B707-FD2A279DD02A\">log file switch (checkpoint incomplete)<\/a>, log writers aren&#8217;t really the source of the problem. They&#8217;re, instead, the victims of a not yet completed checkpoint. They&#8217;re idly waiting until dirty blocks are written to database files, so that they can proceed with overwriting the redo logs. Consequently, the analysis requires  focus on checkpoint, particularly on database writers, which do the heavy lifting. But the database writers were mostly idle (waiting on a message) when the system dump was taken.<\/p>\n<pre><code>21: DBW0 ospid 2299814 sid 106 ser 44104,\n22: DBW1 ospid 2299818 sid 111 ser 61718, waiting for '<span style=\"color:green\">rdbms ipc message<\/span>'\n23: DBW2 ospid 2299821 sid 116 ser 19632, waiting for 'rdbms ipc message'\n24: DBW3 ospid 2299825 sid 121 ser 44107, waiting for 'rdbms ipc message'\n27: DBW4 ospid 2299831 sid 136 ser 36401, waiting for 'rdbms ipc message'\n28: DBW5 ospid 2299837 sid 141 ser 61719, waiting for 'rdbms ipc message'\n29: DBW6 ospid 2299840 sid 146 ser 11666, waiting for 'rdbms ipc message'\n34: DBW7 ospid 2299852 sid 171 ser 31263, waiting for 'rdbms ipc message'\n35: DBW8 ospid 2299856 sid 176 ser 18120, waiting for 'rdbms ipc message'\n36: DBW9 ospid 2299859 sid 181 ser 11655, waiting for 'rdbms ipc message'\n37: DBWA ospid 2299864 sid 186 ser 61784, waiting for 'rdbms ipc message'\n40: DBWB ospid 2299869 sid 201 ser 64146, waiting for 'rdbms ipc message'\n41: DBWC ospid 2299873 sid 206 ser 8287, waiting for 'rdbms ipc message'\n42: DBWD ospid 2299875 sid 211 ser 34990, waiting for 'rdbms ipc message'\n43: DBWE ospid 2299882 sid 216 ser 9750, waiting for 'rdbms ipc message'\n44: DBWF ospid 2299887 sid 221 ser 1741, waiting for 'rdbms ipc message'\n45: DBWG ospid 2299892 sid 226 ser 57549, waiting for 'rdbms ipc message'\n46: DBWH ospid 2299897 sid 231 ser 48083, waiting for 'rdbms ipc message'\n47: DBWI ospid 2299899 sid 236 ser 27038, waiting for 'rdbms ipc message'\n48: DBWJ ospid 2299901 sid 241 ser 40517, waiting for 'rdbms ipc message'\n49: DBWK ospid 2299905 sid 246 ser 60322, waiting for 'rdbms ipc message'\n50: DBWL ospid 2299907 sid 251 ser 20829, waiting for 'rdbms ipc message'\n51: DBWM ospid 2299915 sid 256 ser 35420, waiting for 'rdbms ipc message'\n52: DBWN ospid 2299917 sid 261 ser 2933, waiting for 'rdbms ipc message'<\/code><\/pre>\n<p>Also, the CKPT, another process participating in checkpoint, was idle.<\/p>\n<pre><code>54: CKPT ospid 2299934 sid 271 ser 65272, waiting for '<span style=\"color:green\">rdbms ipc message<\/span>'<\/code><\/pre>\n<h1>Database writers<\/h1>\n<p>But, without an obvious reason, database writers were continuously dumping call stacks into the trace files, for example:<\/p>\n<pre><code>  Short stack dump:\nksedsts()+418&lt;-ksdxcb()+2187&lt;-sspuser()+213&lt;-__sighandler()&lt;-semtimedop()+14&lt;-sskgpwwait()+245&lt;-skgpwwait()+208&lt;-ksliwat()+2238&lt;-kslwaitctx()+218&lt;-ksarcv()+307&lt;-ksbabs()+602&lt;-ksbrdp_int()+1133&lt;-ksbdispatch()+4372&lt;-opirip()+522&lt;-opidrv()+589&lt;-sou2o()+165&lt;-opimai_real()+176&lt;-ssthrdmain()+412&lt;-main()+292&lt;-__libc_start_main()+243&lt;-_start()+46\n<\/code><\/pre>\n<p>The interpretation of the call stack above is as follows:<\/p>\n<ul>\n<li>ksedsts() dumps the short call stack.<\/li>\n<li>ksdxcb(), ksdxcb(), sspuser(), __sighandler() are signal handler functions.<\/li>\n<li>semtimedop(), sskgpwwait(), skgpwwait(), ksliwat(), kslwaitctx(), ksarcv() handle Oracle interprocess communication. The database writers were waiting on a message. That wait time is instrumented by the &#8216;rdbms ipc message&#8217; wait event which we saw in the system dump.<\/li>\n<li>ksbabs(), ksbrdp_int(), ksbdispatch() implement generic background process functionality.<\/li>\n<\/ul>\n<p>For the exact annotations of these functions, look up <a href=\"http:\/\/orafun.info\/\">Frits Hoogland&#8217;s orafun.info<\/a><\/p>\n<p>Simply put, database writers, idly waiting on a message, were interrupted by a signal. The signal handler was spooling short call stacks into the trace file.<\/p>\n<p>Everything we had seen so far suggested that database writer didn&#8217;t have much work to do. The ps aux output confirmed very low total CPU consumption, too. The only activity we&#8217;ve noticed was spooling debug information into the trace file. But the alert log indicated that database writers were spending a huge amount of time on some &#8220;non-DBWR tasks&#8221;:<\/p>\n<pre><code>DBW 17 executing after 8 secs due to non-DBWR task at ts=1654613015\n2022-06-07T16:43:35.872532+02:00\nDBW 20 executing after 8 secs due to non-DBWR task at ts=1654613015\n2022-06-07T16:43:35.999456+02:00\nDBW 6 executing after 8 secs due to non-DBWR task at ts=1654613015\n2022-06-07T16:43:36.064177+02:00\nDBW 3 executing after 8 secs due to non-DBWR task at ts=1654613015\n2022-06-07T16:43:36.127452+02:00\nDBW 8 executing after 8 secs due to non-DBWR task at ts=1654613015\n2022-06-07T16:43:36.128400+02:00\nDBW 21 executing after 8 secs due to non-DBWR task at ts=1654613015\n2022-06-07T16:43:36.191568+02:00\nDBW 0 executing after 8 secs due to non-DBWR task at ts=1654613015\n<\/code><\/pre>\n<p>In other words, the database writers were stalled with an activity unrelated to writing dirty blocks to database files. Since the processes were off CPU, I continued the analysis within the database. After restart, I was able to log in again and query the active session history:<\/p>\n<pre><code>@dashtop program2,event2 1=1 \"timestamp '2022-06-08 05:00:00'\" \"timestamp '2022-06-08 08:15:00'\"\n\n    Total\n  Seconds     AAS %This   PROGRAM2                                 EVENT2                                     FIRST_SEEN          LAST_SEEN\n--------- ------- ------- ---------------------------------------- ------------------------------------------ ------------------- -------------------\n    <span style=\"color:red\">12950<\/span>     1.1   43%   <span style=\"color:red\">(DBWnnn)<\/span>                                 <span style=\"color:red\">oradebug request completion<\/span>                2022-06-08 05:06:16 2022-06-08 08:09:24\n     6100      .5   20%   (DIAn)                                   ON CPU                                     2022-06-08 05:00:07 2022-06-08 06:47:59\n     4440      .4   15%   (DBWnnn)                                 ON CPU                                     2022-06-08 05:01:39 2022-06-08 08:14:51\n     2580      .2    9%   (pmdtm)                                  enq: RO - fast object reuse [mode=6]       2022-06-08 07:31:00 2022-06-08 08:14:51\n      690      .1    2%   (Wnnn)                                   latch free                                 2022-06-08 05:02:20 2022-06-08 08:07:11\n      650      .1    2%   (GENn)                                   ON CPU                                     2022-06-08 05:02:20 2022-06-08 08:09:44\n      340      .0    1%   (DBWnnn)                                 db file async I\/O submit                   2022-06-08 05:02:00 2022-06-08 06:41:29<\/code><\/pre>\n<p>The Oracle database writer processes were spending a significant amount of time in the code instrumented by the &#8220;oradebug request completion&#8221; wait event. This wait event isn&#8217;t documented, but as we saw that database writer was intensively writing debugging information into the trace files, we can assume that the debugging activity was instrumented with that wait event.<\/p>\n<h1>Signals<\/h1>\n<p>We previously established that the stack is printed upon a received signal. In particular, it was the SIGUSR2 signal.<\/p>\n<p>strace output reveals the senders (in the field si_pid):<\/p>\n<pre><code>grep \"\\-\\-\\- SIG\" strace.log\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001359, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001423, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001353, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001390, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001355, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001393, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001361, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001446, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001429, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001411, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001409, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001448, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001382, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001401, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001440, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001369, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001432, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001384, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001419, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001414, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_QUEUE, si_pid=3001438, si_uid=1000, si_value={int=233284845, ptr=0x10de7a4ed}} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001371, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001403, si_uid=1000} ---\n--- SIGUSR2 {si_signo=SIGUSR2, si_code=SI_TKILL, si_pid=3001421, si_uid=1000} ---<\/code><\/pre>\n<p>All signals above came at once. Senders were other database writer processes, for example:<\/p>\n<pre><code>ps -ef | grep 3001438\noracle   3001438       1  0 14:40 ?        00:00:01 ora_dbwk_DB<\/code><\/pre>\n<p>Simply put, the database writer which spooled debug information into the trace file received signals from all other database writers. But that&#8217;s not the whole picture: after spooling the debug information the database writer sent SIGUSR2 to all other database writers:<\/p>\n<pre><code>grep kill strace.log\ntgkill(3001452, 3001452, SIGUSR2)       = 0\ntgkill(3001353, 3001353, SIGUSR2)       = 0\ntgkill(3001355, 3001355, SIGUSR2)       = 0\ntgkill(3001359, 3001359, SIGUSR2)       = 0\ntgkill(3001361, 3001361, SIGUSR2)       = 0\ntgkill(3001369, 3001369, SIGUSR2)       = 0\ntgkill(3001371, 3001371, SIGUSR2)       = 0\ntgkill(3001382, 3001382, SIGUSR2)       = 0\ntgkill(3001384, 3001384, SIGUSR2)       = 0\ntgkill(3001390, 3001390, SIGUSR2)       = 0\ntgkill(3001393, 3001393, SIGUSR2)       = 0\ntgkill(3001401, 3001401, SIGUSR2)       = 0\ntgkill(3001403, 3001403, SIGUSR2)       = 0\ntgkill(3001409, 3001409, SIGUSR2)       = 0\ntgkill(3001411, 3001411, SIGUSR2)       = 0\ntgkill(3001414, 3001414, SIGUSR2)       = 0\ntgkill(3001419, 3001419, SIGUSR2)       = 0\ntgkill(3001421, 3001421, SIGUSR2)       = 0\ntgkill(3001423, 3001423, SIGUSR2)       = 0\ntgkill(3001429, 3001429, SIGUSR2)       = 0\ntgkill(3001432, 3001432, SIGUSR2)       = 0\ntgkill(3001440, 3001440, SIGUSR2)       = 0\ntgkill(3001446, 3001446, SIGUSR2)       = 0\ntgkill(3001448, 3001448, SIGUSR2)       = 0<\/code><\/pre>\n<p>As a consequence, we had an avalanche of debug information simultaneously written into the trace files. That explains the extreme &#8220;oradebug request completion&#8221; cumulative time.<\/p>\n<p>The more database writer processes, the worse the problem. Since the number of database writer processes is proportional to the number of CPUs, the problem is much worse on larger servers!<\/p>\n<p style=\"font-size:11px\">A subliminal message to the people who like to throw hardware at software problems: sometimes the problem becomes worse.<\/p>\n<p>By the way, you can ask a database writer to spool debug information by sending the SIGUSR2 signal from your shell:<\/p>\n<pre><code>kill -SIGUSR2 3001429<\/code><\/pre>\n<p>I captured the call stack that send the SIGUSR2 signal to other processes, in case I&#8217;d later need to investigate the program flow that leads to sending the signal:<\/p>\n<pre><code>sudo \/usr\/bin\/bpftrace -e 'tracepoint:syscalls:sys_enter_tgkill \/ pid == 3001448 \/ { printf(\"%s\\n\", ustack() )}'\nAttaching 1 probe...\n\n        syscall+29\n        skgpwcall+136\n        ksdx_cmdreq_reissue_attempt+4102\n        ksdx_cmdreq_check_pending+647\n        ksdx_cmdreq_wait+4776\n        ksdx_cmdreq_wait_for_pending+2228\n        ksdxdocmdmultex+5375\n        ksdxdocmdmult+55\n        ksudmp_proc_short_stack+373\n        kcbbckptmon+710\n        ksb_act_run_int+117\n        ksb_act_run+134\n        ksbcti+1299\n        ksbabs+1850\n        ksbrdp_int+1133\n        ksbdispatch+4330\n        opirip+522\n        opidrv+589\n        sou2o+165\n        opimai_real+176\n        ssthrdmain+412\n        main+292\n        __libc_start_main+243<\/code><\/pre>\n<p>kcbbckptmon would be the function to focus on as it&#8217;s called regularly (also on databases not affected by this problem). But ksudmp_proc_short_stack is called only occasionally:<\/p>\n<pre><code>sudo \/usr\/bin\/bpftrace -e 'u:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2,\nu:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:ksudmp_proc_short_stack+2\n\/pid == 3001355 \/\n{\n  time(\"%H:%D:%S \");\n  printf(\"%s \\n\", probe);\n}'\n\n18:06\/09\/22:39 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:43 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:46 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:49 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:52 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:55 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:58 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:01 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:04 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:kcbbckptmon+2\n18:06\/09\/22:04 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:ksudmp_proc_short_stack+2\n18:06\/09\/22:04 uprobe:\/u00\/oracle\/orabase\/product\/21.6.0.0.220419_b\/bin\/oracle:ksudmp_proc_short_stack+2\n<\/code><\/pre>\n<h1>Workaround<\/h1>\n<p>A question is, why did database writers start spooling this information? On other databases, they haven&#8217;t written any information at all. So I inspected the start of the trace file in the hope of identifying the cause:<\/p>\n<pre><code><span style=\"color:red\">Starting tracing log seq 10932<\/span>\n  Short stack dump:\nksedsts()+418&lt;-ksdxcb()+2187&lt;-sspuser()+213&lt;-__sighandler()&lt;-semtimedop()+14&lt;-sskgpwwait()+245&lt;-skgpwwait()+208&lt;-ksliwat()+2238&lt;-kslwaitctx()+218&lt;-ksarcv()+307&lt;-ksbabs()+602&lt;-ksbrdp_int()+1133&lt;-ksbdispatch()+4372&lt;-opirip()+522&lt;-opidrv()+589&lt;-sou2o()+165&lt;-opimai_real()+176&lt;-ssthrdmain()+412&lt;-main()+292&lt;-__libc_start_main()+243&lt;-_start()+46<\/code><\/pre>\n<p>It seems that the tracing was triggered by a log switch.<\/p>\n<p>Next, I looked for an undocumented parameter that could control this behaviour:<\/p>\n<pre><code>select ksppinm name, ksppstvl value, ksppdesc description\n  from sys.x$ksppi n, sys.x$ksppcv v\n  where n.indx = v.indx and\n    n.ksppinm like '%trac%'\n  order by name ;\n\nNAME\n--------------------------------------------------------------------------------\nVALUE\n--------------------------------------------------------------------------------\nDESCRIPTION\n--------------------------------------------------------------------------------\n_log_switch_tracing_timeout\n200\n<span style=\"color:red\">dbwr to trace after logswitch timeout in secs<\/span><\/code><\/pre>\n<p>According to the description in data dictionary, the _log_switch_tracing_timeout parameter defines the threshold to trigger DBWR tracing when a log switch lasts too long. If stalled database writer delayed a log switch, additional diagnostic on database writers would be extremely useful. The undocumented parameter came with 21c.<\/p>\n<p>After increasing the parameter to an extreme value, the database writers immediately stopped writing the information into the trace files:<\/p>\n<pre><code>alter system set \"_log_switch_tracing_timeout\"=10000 scope=both ;<\/code><\/pre>\n<p>I flipped it back and forth a couple of times to confirm the causality.<\/p>\n<h1>Instrumentation error?<\/h1>\n<p>To find out what caused the log switch durations larger than 200s, I inspected the AWR report from the period when the database writers started spooling debug information. I was surprised that there was hardly any activity at all.<\/p>\n<p>In an hour, only two minutes were spent inside the database:<\/p>\n<pre><code>              Snap Id      Snap Time      Sessions Curs\/Sess  PDBs\n            --------- ------------------- -------- --------- -----\n   Elapsed:               60.25 (mins)\n   DB Time:                1.96 (mins)<\/code><\/pre>\n<p>There were two log file switches. In total, they took just 45ms:<\/p>\n<pre><code>                                           Total Wait       Avg   % DB Wait\nEvent                                Waits Time (sec)      Wait   time Class\n------------------------------ ----------- ---------- --------- ------ --------\nlog file switch completion               2         .1   44.94ms     .1 Configur<\/code><\/pre>\n<p>Obviously, we can&#8217;t explain tracing with exceeding the 200s threshold. I was speculating that it might be some measurement error that triggered the tracing. For example, the alert logs of all of our 21c databases contain the following messages:<\/p>\n<pre><code>CTWR process with os pid 624287 took 1086626707 secs to acquire the CT global enqueue. <\/code><\/pre>\n<p>It&#8217;s hard to believe that the CTWR process was waiting for 34 years to acquire a lock \ud83d\ude42  This is rather an instrumentation problem that is described in the following MOS note: <i>After Installing 21.5.0.0.220118 the Alert log Shows &#8216;CT enqueue held&#8217; Messages with Absurdly Large Values (Doc ID 2848325.1)<\/i>. The CTWR instrumentation bug is fixed with the following one-off patch: <i>BUG 33792885 &#8211; CTWR PROCESS TOOK 1075095983 SECS TO ACQUIRE THE CT GLOBAL ENQUEUE<\/i>.<\/p>\n<p>I was speculating that the above instrumentation bug might also be causing the problem with database writer trace but the patch hasn&#8217;t helped.<\/p>\n<h1>Monitoring<\/h1>\n<p>The wait event &#8220;oradebug request completion&#8221; is undocumented and not widely known but should be monitored. It&#8217;s unlikely that standard monitoring solutions already cover that, unless they measure the total duration of all the wait events classified as &#8220;Other&#8221;:<\/p>\n<pre><code>select name, wait_class from v$event_name\n  2    where name = 'oradebug request completion' ;\n\nNAME\n----------------------------------------------------------------\nWAIT_CLASS\n----------------------------------------------------------------\noradebug request completion\nOther<\/code><\/pre>\n<p>We prefer measuring individual wait events, so we have to extend our monitoring.<\/p>\n<h1>Summary<\/h1>\n<p>In the version 21c, Oracle introduced an undocumented tracing possibility: in case of long running redo log file switches, database writer spools call stacks into the trace file. In previous releases, you had to set up OS stack sampling if you suspected that database writer activity caused longer log file switches.<\/p>\n<p>Unfortunately, there seems to be a problem with the triggering threshold &#8211; tracing can be activated even when there&#8217;s no problem with log file switches. Database writers starts signaling each other to generate even more debugging information. At the end, database writers are being distracted with dumping call stacks. As a consequence, the database can grind to a halt. The problem is more emphasized on larger servers because the number of the database writers is proportional to the number of CPUs.<\/p>\n<p>The debugging process is instrumented with the undocumented wait event &#8216;oradebug request completion&#8217;. Also, the database writer leaves an entry in the alert log when it&#8217;s too busy with a non-DBWR work.<\/p>\n<p>The undocumented parameter _log_switch_tracing_timeout defines the threshold for triggering tracing. You can increase it to an extreme value to prevent unwarranted tracing.<\/p>\n<p>This seems to be an unknown issue at the time of writing of this blog post.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Database writer performance impaired by sudden debug trace. <a href=\"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/\" class=\"more-link\">Continue Reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[46,53,5,37],"tags":[],"class_list":["post-4213","post","type-post","status-publish","format-standard","hentry","category-21c","category-ebpf","category-oracle","category-redo"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.5 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Perilous New Undocumented Tracing Feature in Oracle 21c - All-round Database Topics<\/title>\n<meta name=\"description\" content=\"Database writer performance impaired by sudden debug trace.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Perilous New Undocumented Tracing Feature in Oracle 21c - All-round Database Topics\" \/>\n<meta property=\"og:description\" content=\"Database writer performance impaired by sudden debug trace.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/\" \/>\n<meta property=\"og:site_name\" content=\"All-round Database Topics\" \/>\n<meta property=\"article:published_time\" content=\"2022-06-09T17:06:50+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-06-13T16:25:34+00:00\" \/>\n<meta name=\"author\" content=\"Nenad Noveljic\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@NenadNoveljic\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Nenad Noveljic\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/\"},\"author\":{\"name\":\"Nenad Noveljic\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#\\\/schema\\\/person\\\/51458d9dd86dbbdd19f5add451d44efa\"},\"headline\":\"Perilous New Undocumented Tracing Feature in Oracle 21c\",\"datePublished\":\"2022-06-09T17:06:50+00:00\",\"dateModified\":\"2022-06-13T16:25:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/\"},\"wordCount\":1314,\"commentCount\":1,\"articleSection\":[\"21c\",\"eBPF\",\"Oracle\",\"redo\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/\",\"url\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/\",\"name\":\"Perilous New Undocumented Tracing Feature in Oracle 21c - All-round Database Topics\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#website\"},\"datePublished\":\"2022-06-09T17:06:50+00:00\",\"dateModified\":\"2022-06-13T16:25:34+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#\\\/schema\\\/person\\\/51458d9dd86dbbdd19f5add451d44efa\"},\"description\":\"Database writer performance impaired by sudden debug trace.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/perilous-new-undocumented-tracing-feature-in-oracle-21c\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Perilous New Undocumented Tracing Feature in Oracle 21c\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/\",\"name\":\"All-round Database Topics\",\"description\":\"Nenad Noveljic\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/#\\\/schema\\\/person\\\/51458d9dd86dbbdd19f5add451d44efa\",\"name\":\"Nenad Noveljic\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g\",\"caption\":\"Nenad Noveljic\"},\"sameAs\":[\"nenad-noveljic-9b746a6\",\"https:\\\/\\\/x.com\\\/NenadNoveljic\"],\"url\":\"https:\\\/\\\/nenadnoveljic.com\\\/blog\\\/author\\\/nenad\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Perilous New Undocumented Tracing Feature in Oracle 21c - All-round Database Topics","description":"Database writer performance impaired by sudden debug trace.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/","og_locale":"en_US","og_type":"article","og_title":"Perilous New Undocumented Tracing Feature in Oracle 21c - All-round Database Topics","og_description":"Database writer performance impaired by sudden debug trace.","og_url":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/","og_site_name":"All-round Database Topics","article_published_time":"2022-06-09T17:06:50+00:00","article_modified_time":"2022-06-13T16:25:34+00:00","author":"Nenad Noveljic","twitter_card":"summary_large_image","twitter_creator":"@NenadNoveljic","twitter_misc":{"Written by":"Nenad Noveljic","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/#article","isPartOf":{"@id":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/"},"author":{"name":"Nenad Noveljic","@id":"https:\/\/nenadnoveljic.com\/blog\/#\/schema\/person\/51458d9dd86dbbdd19f5add451d44efa"},"headline":"Perilous New Undocumented Tracing Feature in Oracle 21c","datePublished":"2022-06-09T17:06:50+00:00","dateModified":"2022-06-13T16:25:34+00:00","mainEntityOfPage":{"@id":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/"},"wordCount":1314,"commentCount":1,"articleSection":["21c","eBPF","Oracle","redo"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/","url":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/","name":"Perilous New Undocumented Tracing Feature in Oracle 21c - All-round Database Topics","isPartOf":{"@id":"https:\/\/nenadnoveljic.com\/blog\/#website"},"datePublished":"2022-06-09T17:06:50+00:00","dateModified":"2022-06-13T16:25:34+00:00","author":{"@id":"https:\/\/nenadnoveljic.com\/blog\/#\/schema\/person\/51458d9dd86dbbdd19f5add451d44efa"},"description":"Database writer performance impaired by sudden debug trace.","breadcrumb":{"@id":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/nenadnoveljic.com\/blog\/perilous-new-undocumented-tracing-feature-in-oracle-21c\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/nenadnoveljic.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Perilous New Undocumented Tracing Feature in Oracle 21c"}]},{"@type":"WebSite","@id":"https:\/\/nenadnoveljic.com\/blog\/#website","url":"https:\/\/nenadnoveljic.com\/blog\/","name":"All-round Database Topics","description":"Nenad Noveljic","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/nenadnoveljic.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/nenadnoveljic.com\/blog\/#\/schema\/person\/51458d9dd86dbbdd19f5add451d44efa","name":"Nenad Noveljic","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a97b796613ea48ec8a7b79c8ffe1c685dcffc920c68121f6238d5caab5070670?s=96&d=mm&r=g","caption":"Nenad Noveljic"},"sameAs":["nenad-noveljic-9b746a6","https:\/\/x.com\/NenadNoveljic"],"url":"https:\/\/nenadnoveljic.com\/blog\/author\/nenad\/"}]}},"_links":{"self":[{"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/posts\/4213","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/comments?post=4213"}],"version-history":[{"count":1,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/posts\/4213\/revisions"}],"predecessor-version":[{"id":4227,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/posts\/4213\/revisions\/4227"}],"wp:attachment":[{"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/media?parent=4213"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/categories?post=4213"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nenadnoveljic.com\/blog\/wp-json\/wp\/v2\/tags?post=4213"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}