Deadlock
I’ve been occasionally seeing deadlocks like the following one:
Resolving deadlock by signaling ORA-00060 to 'instance: 1, os id: 27612, session id: 277'
dump location: /u00/oracle/orabase/diag/rdbms/db1_site1/DB1/trace/DB1_m007_27612.trc
Oracle session identified by:
{
instance: 1 (db1_site1.db1)
os id: 27612
process id: 55, oracle@server1 (M007)
session id: 277
session serial #: 13710
module name: 0 (MMON_SLAVECliente Manager.Metric Engine)
}
is waiting for 'row cache lock' with wait info:
{
p1: 'cache id'=0x10
p2: 'mode'=0x0
p3: 'request'=0x5
time in wait: 2.004491 sec
timeout after: never
wait id: 5324
blocking: 1 session
current sql_id: 1675298304
current sql: alter table WRH$_ACTIVE_SESSION_HISTORY split partition WRH$_ACTIVE_SESSION_HISTORY_1246382628_3941 at (1246382628,3965) into (partition WRH$_ACTIVE_SESSION_HISTORY_1246382628_3941, partition WRH$_ACTIVE_SESSION_HISTORY_1246382628_3965 tablespace SYSAUX) update indexes
wait history:
* time between current wait and wait #1: 0.001227 sec
1. event: 'library cache lock'
time waited: 0.128808 sec
wait id: 5323 p1: 'handle address'=0x87d123c8
p2: 'lock address'=0x818f8638
p3: '100*mode+namespace'=0x2a7900010003
* time between wait #1 and #2: 0.000176 sec
2. event: 'direct path write'
time waited: 0.000472 sec
wait id: 5322 p1: 'file number'=0x2
p2: 'first dba'=0x62e1
p3: 'block cnt'=0x1
* time between wait #2 and #3: 0.000057 sec
3. event: 'direct path write'
time waited: 0.000729 sec
wait id: 5321 p1: 'file number'=0x2
p2: 'first dba'=0x62b5
p3: 'block cnt'=0x3
}
and is blocked by
=> Oracle session identified by:
{
instance: 1 (db1_site1.db1)
os id: 7428
process id: 62, oracle@server1 (J000)
session id: 311
session serial #: 52586
module name: 0 (DBMS_SCHEDULERntNS V1-V3)seetric Engine)
}
which is waiting for 'library cache lock' with wait info:
{
p1: 'handle address'=0x87d123c8
p2: 'lock address'=0x863522d0
p3: '100*mode+namespace'=0x2a7900010003
time in wait: 2.005735 sec
timeout after: 14 min 57 sec
wait id: 652
blocking: 1 session
current sql_id: 3010140026
current sql: call dbms_stats.gather_database_stats_job_proc ( )
wait history:
* time between current wait and wait #1: 0.170488 sec
1. event: 'buffer busy waits'
time waited: 0.000193 sec
wait id: 651 p1: 'file#'=0x1
p2: 'block#'=0x52b5
p3: 'class#'=0x1
* time between wait #1 and #2: 0.033917 sec
2. event: 'latch: shared pool'
time waited: 0.000057 sec
wait id: 650 p1: 'address'=0x602e2348
p2: 'number'=0x250
p3: 'why'=0x0
* time between wait #2 and #3: 0.031947 sec
3. event: 'buffer busy waits'
time waited: 0.000059 sec
wait id: 649 p1: 'file#'=0x3
p2: 'block#'=0x4a7
p3: 'class#'=0x22
}
and is blocked by the session at the start of the chain.
Here’s is a short intepretation of the diagnostic information above:
- MMON_SLAVE, which was executing alter table split partition on a workload repository table was waiting on a row cache lock after acquiring a library cache lock.
- The row cache lock was held by a job queue process, which was gathering statistics.
- This job queue process was waiting on the library cache lock which was held by MMON_SLAVE.
- MMON_SLAVE was a deadlock victim.
Column statistics
Further, p1 of the row cache lock wait holds the information about the cache id in the row cache. We can use it for querying v$rowcache.
…
is waiting for 'row cache lock' with wait info:
{
p1: 'cache id'=0x10
...
select distinct parameter from v$rowcache where cache#=16 ;
PARAMETER
--------------------------------
dc_histogram_data
dc_histogram_defs
As we can see, the row cache lock was related to the column statistics.
Known bugs
The fingerprint matches the following known bugs:
- 26440681: ORACLE 12.2 DEADLOCK BETWEEN GATHER STATS JOB AND MMON SLAVE
- 8605337 : DEADLOCK BETWEEN GATHER STATS JOB AND MMON SLAVE
I’ve been seeing this issue after upgrading to 18c, but I’ve managed to reproduce it on 12c and 19c as well. In fact, the purpose of this blog post is to provide a reproducable test case which might be useful for fixing the bug and explaining the circumstances leading to the problem. Interestingly, the Oracle Support documents state that the bug isn’t reproducable.
Call stacks
Luckily, the call stack traces were spooled into the trace files for both processes.
Below is the stack for split partition (I cut out some irrelevant parts):
----- Current SQL Statement for this session (sql_id=67qs60289jx3k) -----
alter table WRH$_ACTIVE_SESSION_HISTORY split partition WRH$_ACTIVE_SESSION_HISTORY_1246382628_3941 at (1246382628,3965) into (partition WRH$_ACTIVE_SESSION_HISTORY_1246382628_3941, partition WRH$_ACTIVE_SESSION_HISTORY_1246382628_3965 tablespace SYSAUX) update indexes
*** 2020-03-21T23:07:24.536095+01:00
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst1()+410 call skdstdst() 7FFFBFFAB890 ?
2DD66E46FF12FB ?
7FFFBFFAB880 ? 7FFFBFF85778 ?
7FFFBFF88D10 ? 7FFFBFF84F58 ?
ksdhng_dumpcb_hung_ call ksedst1() 000000000 ? 000000001 ?
proc_int()+843 7FFFBFFAB880 ? 7FFFBFF85778 ?
7FFFBFF88D10 ? 7FFFBFF84F58 ?
ksdhng_action_execu call ksdhng_dumpcb_hung_ 000000001 ? 0B6B3B310 ?
te()+1165 proc_int() 7FFFBFFAB880 ? 7FFFBFF85778 ?
7FFFBFF88D10 ? 7FFFBFF84F58 ?
ksuintract_exec_pen call ksdhng_action_execu 008D62128 ? 0B6B3B310 ?
ding()+352 te() 7FFFBFFAB880 ? 7FFFBFF85778 ?
7FFFBFF88D10 ? 7FFFBFF84F58 ?
ksuitr()+6419 call ksuintract_exec_pen 008D62128 ? 0B6B3B310 ?
ding() 7FFFBFFAB880 ? 0B6B65C58 ?
7FFFBFF88D10 ? 7FFFBFF84F58 ?
ksu_dispatch_tac()+ call ksuitr() 000000039 ? 0B6B3B310 ?
2384 7FFFBFFAB880 ? 0B6B65C58 ?
7FFFBFF88D10 ? 7FFFBFF84F58 ?
kqrget()+4920 call ksu_dispatch_tac() 000000039 ? 0B6B3B310 ?
7FFFBFFAB880 ? 0B6B65C58 ?
7FFFBFF88D10 ? 7FFFBFF84F58 ?
kqrLockPo()+272 call kqrget() 0868B7810 ? 000000005 ?
0001EC682 ? 0832EC860 ?
7FFFBFF88D10 ? 7FFFBFF84F58 ?
kqrpre2()+1950 call kqrLockPo() 000000010 ? 0832EC780 ?
09670BF20 ? 000000005 ?
000000000 ? 094510000 ?
kqrpre()+44 call kqrpre2() 000000010 ? 7FFFBFFACDA0 ?
7FFFBFFACDA8 ? 000000005 ?
09670BF20 ? 000000000 ?
qoshdd()+155 call kqrpre() 000000010 ? 7FFFBFFACDA0 ?
7FFFBFFACDA8 ? 000000005 ?
09670BF20 ? 000000000 ?
kkpodp_del_stats()+ call qoshdd() 000030E3D ? 7FFFBFFACDA0 ?
199 7FFFBFFACDA8 ? 000000005 ?
09670BF20 ? 000000000 ?
ctcfmo_fragment_mai call kkpodp_del_stats() 000000000 ? 0812A66B8 ?
nt_oper()+1783 000000000 ? 087D123C8 ?
087405A68 ? 000000000 ?
ctcdrv()+32784 call ctcfmo_fragment_mai 08BEC5008 ?
nt_oper() FFFFFFFF8D31A750 ?
000000000 ? 0840DA650 ?
000000000 ? 7FFFBFFAD790 ?
...
As we can see, the function that was executing was kqrget, which is the function for getting a row cache lock (see Frits Hoogland’s Oracle C functions annotations). This is consistent with the deadlock diagram, as the process was waiting on a row cache lock. The row cache lock was acquired for deleting the statistics on the partitioned dictionary table. We know that, because kkpodp_del_stats is higher up on the stack.
Next, let’s check the stack for gather statistics:
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst1()+410 call skdstdst() 7FFFBFFE7C30 ?
2DD66E470C094B ?
7FFFBFFE7C20 ? 7FFFBFFC1B18 ?
7FFFBFFC50B0 ? 7FFFBFFC12F8 ?
ksdhng_dumpcb_hung_ call ksedst1() 000000000 ? 000000001 ?
proc_int()+843 7FFFBFFE7C20 ? 7FFFBFFC1B18 ?
7FFFBFFC50B0 ? 7FFFBFFC12F8 ?
ksdhng_action_execu call ksdhng_dumpcb_hung_ 000000001 ? 0B6B3B310 ?
te()+148 proc_int() 7FFFBFFE7C20 ? 7FFFBFFC1B18 ?
7FFFBFFC50B0 ? 7FFFBFFC12F8 ?
ksuintract_exec_pen call ksdhng_action_execu 000000000 ? 0B6B3B310 ?
ding()+352 te() 7FFFBFFE7C20 ? 7FFFBFFC1B18 ?
7FFFBFFC50B0 ? 7FFFBFFC12F8 ?
ksuitr()+6419 call ksuintract_exec_pen 000000000 ? 0B6B3B310 ?
ding() 7FFFBFFE7C20 ? 0B6B6D6A0 ?
7FFFBFFC50B0 ? 7FFFBFFC12F8 ?
ksu_dispatch_tac()+ call ksuitr() 000000039 ? 0B6B3B310 ?
2384 7FFFBFFE7C20 ? 0B6B6D6A0 ?
7FFFBFFC50B0 ? 7FFFBFFC12F8 ?
ksfgti()+108 call ksu_dispatch_tac() 000000039 ? 0B6B3B310 ?
7FFFBFFE7C20 ? 0B6B6D6A0 ?
7FFFBFFC50B0 ? 7FFFBFFC12F8 ?
kglLockWait()+881 call ksfgti() 000000039 ? 0B6B3B310 ?
7FFFBFFE7C20 ? 0B6B6D6A0 ?
7FFFBFFC50B0 ? 7FFFBFFC12F8 ?
kgllkal()+1394 call kglLockWait() 7FFFBDB7DA10 ? 7FFFBFFE9D30 ?
087D123C8 ? 0863522D0 ?
87D1244800000003 ?
7FFFBFFC12F8 ?
kglLock()+19861 call kgllkal() 7FFFBDB7DA10 ? 7FFFBFFE9480 ?
087D123C8 ? 000000003 ?
0863522D0 ? 7FFFBFFE9D30 ?
kglget()+1920 call kglLock() 7FFFBDB7DA10 ? 7FFFBFFE9D30 ?
000000003 ? 000000001 ?
7FFFBD83CF98 ? 7FFFBFFE9D30 ?
qostobkglrel()+510 call kglget() 7FFFBDB7DA10 ? 7FFFBFFE9D30 ?
000000003 ? 000000001 ?
7FFFBD83CF98 ? 7FFFBFFE9D30 ?
qostobctxrel()+311 call qostobkglrel() 7FFFBD83CD88 ? 7FFFBFFE9D30 ?
000000003 ? 08D62EAF8 ?
7FFFBD83CF98 ? 7FFFBFFE9D30 ?
qospscs()+8574 call qostobctxrel() 7FFFBFFEA4F0 ? 7FFFBFFE9D30 ?
000000003 ? 08D62EAF8 ?
7FFFBD83CF98 ? 7FFFBFFE9D30 ?
spefcmpa()+1156 call qospscs() 7FFFBFFECA60 ? 7FFFBD3BC540 ?
000000149 ? 000000210 ?
7FFFBD8863A8 ?
BD8863A800000000 ?
spefmccallstd()+233 call spefcmpa() 7FFFBD8864D8 ? 00000000A ?
000000000 ? 000000210 ?
7FFFBD8863A8 ?
BD8863A800000000 ?
...
Also here, the stack is consistent with the deadlock diagram – the process is waiting on a library cache lock (kglLockWait). The caller is qospscs – the Oracle C function underpinning DBMS_STATS.SET_COLUMN_STATS_CALLOUT.
Debugging
Based on this information in the call stacks, we can use GNU debugger for reproducing the deadlock. Mind that the described procedure heavily relies on Oracle internals, so it might not be valid for all software versions. I tested it on a 19.6.0.0.200114 database.
I modelled the problem with a table having a similar, yet much simpler, structure:
create table t (
n number, constraint t_pk primary key (n) using index local
(
PARTITION t_10 ,
PARTITION t_20
) enable
)
PARTITION BY RANGE (n)
(
PARTITION t_10 VALUES LESS THAN (10),
PARTITION t_20 VALUES LESS THAN (MAXVALUE)
)
;
insert into t select level n from dual connect by level < 20 ;
commit ;
exec dbms_stats.gather_table_stats (null, 'T', cascade => true ) ;
The relevant features are: partitions and a local index. Besides that, the table must be created in the SYS schema.
By using the information in the stack trace and gdb, it’s easy to reproduce the problem. Simply, attach to the process that will be gathering statistics, set the following breakpoint and continue the execution:
b qostobkglrel
c
exec dbms_stats.gather_table_stats (null, 'T', cascade => true ) ;
When statistics gathering reached the breakpoint, run split partition until it hits the breakpoint on qoshdd:
b qoshdd
c
alter table t split partition t_10 at (5) into (partition t_5, partition t_10 ) update indexes ;
After split partition stops at the breakpoint, continue the execution of gather statistics until kglget:
b kglget
c
Resume split partition until the following breakpoint:
b kqrpre if $rdi==16
c
After both breakpoints have been reached, continue the process executions:
c
We can clearly see in the active session history that two sessions are waiting on each other:
@ash_wait_chains2 event2||sql_opname 1=1 &1min
-- Display ASH Wait Chain Signatures script v0.5 BETA by Tanel Poder ( http://blog.tanelpoder.com )
%This SECONDS AAS WAIT_CHAIN FIRST_SEEN LAST_SEEN
------ ---------- ------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ------------------- -------------------
48% 59 1.0 -> row cache lock CREATE TABLE -> library cache lock PL/SQL EXECUTE -> [idle blocker 1,1405,54879 (sqlplus@svdbp01i (TNS V1-V3))] 2020-04-01 18:36:25 2020-04-01 18:37:24
48% 59 1.0 -> library cache lock PL/SQL EXECUTE -> row cache lock CREATE TABLE -> [idle blocker 1,2159,25346 (sqlplus@svdbp01i (TNS V1-V3))] 2020-04-01 18:36:25 2020-04-01 18:37:24
After detaching the debugger from both processes, Oracle resolved this deadlock:
detach
alter table t split partition t_10 at (5) into (partition t_5, partition t_10 ) update indexes ;
alter table t split partition t_10 at (5) into (partition t_5, partition t_10 ) update indexes
*
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
Curiously, the deadlock doesn’t occur if the table is created outside the SYS shema.
So, what’s the difference?
Library cache lock
We can explore library cache locks when gather statistics reaches the breakpoint on qostobkglrel and split partition is waiting on the library cache lock:
set pagesize 20
column sql_text format a20
column KGLNAOBJ format a10
select ses.sid, ses.serial#,lck.kgllkcnt, lck.kgllkmod,lck.kgllkreq, lck.kglnaobj, s.sql_text
from x$kgllk lck , v$session ses, v$sqlarea s
where kgllkhdl in
(select kgllkhdl from x$kgllk where kgllkreq >0)
and lck.KGLLKUSE = ses.saddr
and ses.sql_id = s.sql_id
;
(The query above is from Riyaj Shamsudeen’s blog post library cache lock on BUILD$ object; I enriched it with the sql_text.)
For the table in a non-SYS schema gather statistics acquires a shared lock (2):
SID SERIAL# KGLLKCNT KGLLKMOD KGLLKREQ KGLNAOBJ SQL_TEXT
---------- ---------- ---------- ---------- ---------- ---------- --------------------
1405 42006 0 0 3 T alter table t split
partition t_10 at (5
) into (partition t_
5, partition t_10 )
update indexes
2159 13084 1 2 0 T BEGIN dbms_stats.gat
her_table_stats (nul
l, 'T', cascade => t
rue ) ; END;
In contrast, the gather statistics process grabs the exclusive lock (3) when the table is outside the SYS schema:
SID SERIAL# KGLLKCNT KGLLKMOD KGLLKREQ KGLNAOBJ SQL_TEXT
---------- ---------- ---------- ---------- ---------- ---------- --------------------
1405 18584 0 0 2 T alter table t split
partition t_10 at (5
) into (partition t_
5, partition t_10 )
update indexes
2159 13084 1 1 0 T BEGIN dbms_stats.gat
her_table_stats (nul
l, 'T', cascade => t
rue ) ; END;
2159 13084 1 3 0 T BEGIN dbms_stats.gat
her_table_stats (nul
l, 'T', cascade => t
rue ) ; END;
The exclusive lock acqured by the statistics gathering prevents split partition entering the critical code path where deadlock is possible. Split partition will, instead, simply wait until gather statistics releases the lock.
Summary
In summary, both split partition and gather statistics change the column statistics. Getting a library cache lock on the table and a row cache lock on the dc_histograms cache is a prerequisite for changing the column statistics. For tables in the SYS schema only a shared library lock is acquired at some critical point. Shared lock can lead to concurrency problems, because the execution of both processes can progress so that they acquire library cache and row cache locks in such a way that they block each other. This is the reason why a deadlock sometimes happens when the partition maintenance on workload repository tables is running concurrently with the statistics gathering.
In contrast, the library cache lock in exclusive mode prevents concurrency problems on tables outside the SYS schema .
Really informative article