Active Session History
I was almost fooled by missing data while doing a performance analysis based on active session history (ASH).
The application reported a problem on an Oracle 19c database: A PL/SQL procedure call took an hour and a half instead of 5 minutes to complete.
The active session history looked strange:
@dashtop.sql top_level_sql_id,sql_exec_id "session_id = 997 and session_serial# = 58505" "timestamp '2021-02-26 05:36:37'" "timestamp '2021-02-26 08:03:09'"
Total
Seconds AAS %This TOP_LEVEL_SQL SQL_EXEC_ID FIRST_SEEN LAST_SEEN
--------- ------- ------- ------------- ----------- ------------------- -------------------
290 .0 97% 289q4500mqm43 16777216 2021-02-26 05:36:40 2021-02-26 08:03:06
SQL> @https://github.com/tanelpoder/tpt-oracle/blob/master/awr/awr_sqlid.sql 289q4500mqm43
DBID SQL_ID SQL_TEXT COMMAND_TYPE CON_DBID CON_ID
---------- ------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------ ---------- ----------
771238284 289q4500mqm43 BEGIN package.init_procedure; END; 47 771238284 0
If you counted samples for estimating the database time, you could conclude that the session spent only a fraction of the total time in the database (290s out of 87 minutes). But during the whole time period between 05:36:40 and 08:03:06, the session was executing a single PL/SQL procedure call. Consequently, the time must have been spent in the database, but the ASH samples were missing.
Flashback Data Archive
Fortunately, the server dedicated process left some information in its trace. It interacted with the Flashback Data Archive (FBDA) background process:
*** SESSION ID:(997.58505) 2021-02-26T05:36:58.306653+01:00
...
*** 2021-02-26T05:37:20.673061+01:00
...
ktfaTrace: ktfacmtcbk:12071 =Inline Propagate large:19.22.4011802:changes:903, undoblks:17
...
*** 2021-02-26T08:03:08.617529+01:00
ktfaTrace: ktfa_slave_schedule:12387 =FBDA Slave called
ktfaTrace: ktfa_slave_schedule:12390 =FDA slave task scheduled
ktfaTrace: ktfa_slave_schedule:12392 =ktfa_slave_schedule:ktsj_status=0
I inspected the PL/SQL procedure, and identified the suspects: truncate and table statistics gathering.
fbar timer
I setup a test case with FBDA:
create flashback archive fba1 tablespace tools retention 1 year;
create table t (n integer) flashback archive fba1;
First, I suspended the fbda process:
ps -ef | grep fbda | grep DB1
oracle 18082 17800 0 Dec 28 ? 2:23 ora_fbda_DB1
oradebug setospid 18082
oradebug suspend
Then I ran truncate:
truncate table t ;
Truncate was hanging without leaving any samples in ASH:
@ashtop.sql event2 session_id=815 &1min
no rows selected
So I reproduced a variant of this problem.
But what was the process doing?
It was waiting on fbar timer:
column event format a15
column wait_class format a15
select event,wait_class,status,seconds_in_wait from v$session where sid=815 ;
EVENT WAIT_CLASS STATUS SECONDS_IN_WAIT
--------------- --------------- -------- ---------------
fbar timer Idle ACTIVE 546
fbar timer isn’t documented, but I guessed its meaning by looking at the stack of the server dedicated process hanging in truncate:
pstack 27109
27109: oracleDB1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
00007fffa617741a portfs (5, 4, 7fffbffe3480, 0, 1dcd6500, 7fffbffe34a0)
000000000622c44d sskgpwwait () + dd
000000000622c033 skgpwwait () + c3
000000000642e664 ksliwat () + 9b4
000000000642d84f kslwaitctx () + af
000000000a7a3a58 ktfa_do_prepare_ddl () + 1b8
00000000069d448c ktfa_check_ddl_fda_tables () + 179c
0000000006947ce6 opiSem () + 696
00000000069cbaaf opiprs () + 11f
00000000069ca3ff kksParseChildCursor () + 1ff
The server dedicated process is waiting on a message (from fbda). Apparently, there’s some communication between the server dedicated process and fbda in the case of DDLs. Consequently, if fbda is doing something else and doesn’t send the message, or the message gets lost, the user session will keep hanging. On top of that, this time isn’t visible in ASH.
Why is fbar timer not visible in ASH?
Because, it’s wrongly classified as “Idle”:
select wait_class from v$event_name where name = 'fbar timer' ;
WAIT_CLASS
----------------------------------------------------------------
Idle
In contrast, Tanel Poder’s Snapper highlights the problem:
@snapper all 10 1 815
Sampling SID 815 with interval 10 seconds, taking 1 snapshots...
-- Session Snapper v4.30 - by Tanel Poder ( http://blog.tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH , NUM_WAITS, WAITS/SEC, AVERAGES
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
815, USER1 , STAT, in call idle wait time , 1001, 99.53, , , , , ~ per execution
815, USER1 , WAIT, fbar timer , 20002541, 1.99s, 198.9%, [WWWWWWWWWW], 0, 0, 20s average wait
-- End of Stats snap 1, end=2021-03-15 15:39:47, seconds=10.1
-- End of ASH snap 1, end=2021-03-15 15:39:47, seconds=10, samples_taken=92, AAS=0
When truncate completes (after issuing “oradebug resume” in this test case) tkprof will display the wait and elapsed time correctly:
truncate table t
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.82 2757.71 0 0 0 0
Execute 1 0.00 4.50 0 0 0 0
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 0.82 2762.22 0 0 0 0
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 88
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
fbar timer 6 2757.69 2762.21
PGA memory operation 1 0.00 0.00
SQL*Net message to client 1 0.00 0.00
SQL*Net message from client 1 0.00 0.00
********************************************************************************
Bugs
Below is the list of the open bugs which cause the user session to wait on fbar timer:
- Bug 31667263 : STATS HANG WAITING AS FINAL BLOCKER FOR ‘FBAR TIMER’
- Bug 31625755 : FBDA DISABLE SLOW WAITING FOR ‘FBAR TIMER’ FOR HUGE NUMBER OF FBDA TABLES
- Bug 31031183 : FBDA: HANG IN ‘FBAR TIMER'<=’ENQ: TX – CONTENTION’ WITH HIGH GG EXTRACT LOAD
- Bug 32587508 : ORA-1 ERROR WHILE MERGING INTO SYS_FBA_TCRV_% TABLE
Summary
In the case of DDL on a table with a Flashback Data Archive (FBDA), the dedicated server process communicates with the FBDA background process. Waiting on the message is instrumented with the undocumented fbar timer wait event. As this wait event is classified as Idle, it won’t appear in the active session history. The unaccounted database time is a consequence. Therefore, be careful when estimating database time by counting ASH samples when tables with FBDA are involved. SQL trace or Tanel Poder’s Snapper will show the correct information once the issue reappears.
There are several open bugs related to fbar timer. Check the bug descriptions to see if they are relevant for your case.
Never thought about it.. Any idea why would FBA Timer an Idle event?
I have to assume there has to be a reason for that.
Now I will be all paranoid every time I don’t see the info in ASH 😛
It doesn’t seem right to me that fbar timer is classified as an idle event.
I implemented a check that that monitors fbar timer based on v$session to compensate for missing ASH samples.
Additional data collection with SQL trace or snapper would highlight the problem.