Private strands
Private strands reduce contention on the public redo log buffer. Without them a process would have to write to public redo buffer more frequently and compete with other processes. [1]
Message Advanced Queues (AQ) and alerts can prevent processes from using private strands.
For example, the following transaction isn’t using private strands:
create table t (n number);
insert into t values (1);
select decode(bitand(ktcxbflg, 4096),0,1,0) used_private_strand
from x$ktcxb, v$session
where ktcxbses = saddr and sid = userenv('sid');
USED_PRIVATE_STRAND
-------------------
0
x$ktcxb is the SQL interface to the transaction table. The unset 13th bit of the bitmap ktcxbflg indicates that the transaction is using a private strand.
In-memory undo buffers
In-memory undo (IMU) buffer is a prerequisite for using a private strand [2]. But in my case, the process couldn’t allocate it:
select n.name, s.value from v$statname n, v$sesstat s
where n.statistic# = s.statistic# and s.sid = userenv('SID')
and n.name like '%IMU%' and value != 0 ;
NAME VALUE
---------------------------------------------------------------- ----------
IMU pool not allocated 1
x$ktifp externalizes the IMU pool:
select ktifptxflg, count(*) from x$ktifp group by ktifptxflg ;
KTIFPTXFLG COUNT(*)
---------- ----------
7 64
1 107
All IMU buffers are currently occupied – none of them has ktifptxflg set to zero.
According to v$transaction there aren’t any open transactions:
select count(*) from v$transaction ;
COUNT(*)
----------
0
But who then has allocated all IMU buffers?
To answer this question, I selected transactions from the fixed table x$ktcxb instead of v$transaction. I also joined the result with the IMU buffer fixed table x$ktifp and v$session:
column sid format 9999
column minutes format 9999999
column program format a23
column event format a48
select
sid,
round(( sysdate - to_date(ktcxbstm,'mm/dd/rr hh24:mi:ss') ) * 24 * 60)
minutes,
decode(bitand(ktcxbflg,2),2,1,0) bit2,
to_number(ktifprpc,'xxxxxxxxxxxxxxxx') -
to_number(ktifprpb,'xxxxxxxxxxxxxxxx') redo_bytes,
to_number(ktifpupc, 'xxxxxxxxxxxxxxxx') -
to_number(ktifpupb, 'xxxxxxxxxxxxxxxx') undo_bytes,
program,
event
from x$ktifp, x$ktcxb, v$session ses
where ktifpxcb = ktcxbxba and ktcxbses = saddr
order by ktcxbstm desc ;
SID MINUTES BIT2 REDO_BYTES UNDO_BYTES PROGRAM EVENT
----- -------- ----- ---------- ---------- ----------------------- ------------------------------------------------
...
1536 47 0 0 72 oracle@serverxx (J07T) Streams AQ: waiting for messages in the queue
333 47 0 0 72 oracle@serverxx (J08S) Streams AQ: waiting for messages in the queue
1098 47 0 0 72 oracle@serverxx (J06I) Streams AQ: waiting for messages in the queue
98 47 0 0 72 oracle@serverxx (J089) Streams AQ: waiting for messages in the queue
1462 47 0 0 72 oracle@serverxx (J03X) Streams AQ: waiting for messages in the queue
730 47 0 0 72 oracle@serverxx (J02N) Streams AQ: waiting for messages in the queue
787 47 0 0 72 oracle@serverxx (J02T) Streams AQ: waiting for messages in the queue
839 47 0 0 72 oracle@serverxx (J07Z) Streams AQ: waiting for messages in the queue
1422 47 0 0 72 oracle@serverxx (J03V) Streams AQ: waiting for messages in the queue
114 47 0 0 72 oracle@serverxx (J07R) Streams AQ: waiting for messages in the queue
55 47 0 0 72 oracle@serverxx (J07Y) Streams AQ: waiting for messages in the queue
1343 47 0 0 72 oracle@serverxx (J07M) pipe get
54 47 0 0 72 oracle@serverxx (J01E) Streams AQ: waiting for messages in the queue
1425 47 0 0 72 oracle@serverxx (J010) Streams AQ: waiting for messages in the queue
394 47 0 0 72 oracle@serverxx (J04Z) Streams AQ: waiting for messages in the queue
918 47 0 0 72 oracle@serverxx (J00C) Streams AQ: waiting for messages in the queue
593 47 0 0 72 oracle@serverxx (J002) Streams AQ: waiting for messages in the queue
1265 47 0 0 72 oracle@serverxx (J06Y) Streams AQ: waiting for messages in the queue
1216 47 0 0 72 oracle@serverxx (J04C) Streams AQ: waiting for messages in the queue
177 47 0 0 72 oracle@serverxx (J068) Streams AQ: waiting for messages in the queue
809 47 0 0 72 oracle@serverxx (J008) Streams AQ: waiting for messages in the queue
1436 47 0 0 72 oracle@serverxx (J07L) Streams AQ: waiting for messages in the queue
1496 47 0 0 72 oracle@serverxx (J02L) Streams AQ: waiting for messages in the queue
665 47 0 0 72 oracle@serverxx (J02H) Streams AQ: waiting for messages in the queue
SID MINUTES BIT2 REDO_BYTES UNDO_BYTES PROGRAM EVENT
----- -------- ----- ---------- ---------- ----------------------- ------------------------------------------------
1120 47 0 0 72 oracle@serverxx (J03W) Streams AQ: waiting for messages in the queue
496 47 0 0 72 oracle@serverxx (J029) Streams AQ: waiting for messages in the queue
1176 47 0 0 72 oracle@serverxx (J01T) Streams AQ: waiting for messages in the queue
1523 47 0 0 72 oracle@serverxx (J020) Streams AQ: waiting for messages in the queue
1366 47 0 0 72 oracle@serverxx (J08V) Streams AQ: waiting for messages in the queue
355 47 0 0 72 oracle@serverxx (J04Y) Streams AQ: waiting for messages in the queue
1470 47 0 0 72 oracle@serverxx (J03Y) Streams AQ: waiting for messages in the queue
1202 47 0 0 72 oracle@serverxx (J00K) Streams AQ: waiting for messages in the queue
641 48 0 0 72 oracle@serverxx (J02K) Streams AQ: waiting for messages in the queue
573 48 0 0 72 oracle@serverxx (J02E) Streams AQ: waiting for messages in the queue
41 48 0 0 72 oracle@serverxx (J01F) Streams AQ: waiting for messages in the queue
487 48 0 0 72 oracle@serverxx (J03J) Streams AQ: waiting for messages in the queue
1407 108 0 0 72 oracle@serverxx (J046) pipe get
1448 108 0 0 72 oracle@serverxx (J012) pipe get
1291 108 0 0 72 oracle@serverxx (J06V) pipe get
864 108 0 0 72 oracle@serverxx (J08Z) pipe get
850 108 0 0 72 oracle@serverxx (J08P) pipe get
SID MINUTES BIT2 REDO_BYTES UNDO_BYTES PROGRAM EVENT
----- -------- ----- ---------- ---------- ----------------------- ------------------------------------------------
1368 468 0 0 72 oracle@serverxx (J03R) pipe get
1457 468 0 0 72 oracle@serverxx (J071) Streams AQ: waiting for messages in the queue
1201 468 0 0 72 oracle@serverxx (J06G) PL/SQL lock timer
1136 468 0 0 72 oracle@serverxx (J036) Streams AQ: waiting for messages in the queue
1189 468 0 0 72 oracle@serverxx (J03B) pipe get
1174 468 0 0 72 oracle@serverxx (J03A) pipe get
1162 468 0 0 72 oracle@serverxx (J06D) pipe get
1145 468 0 0 72 oracle@serverxx (J038) pipe get
521 468 0 0 72 oracle@serverxx (J000) pipe get
171 rows selected.
The transactions above look strange:
- None of them has allocated any redo space.
- They have all allocated exactly 72 bytes in the undo buffer.
- They’ve been holding IMU buffers for hours.
- The 2nd bit in ktcxbflg isn’t set.
ktcxbflg
It’s unusual to leave the 2nd bit of ktcxbflg unset. v$transaction, by the way, filters such transactions out:
BITAND (ktcxbflg, 2) != 0
Let’s compare the transactions above with a regular transaction:
insert into t values (1);
SID MINUTES BIT2 REDO_BYTES UNDO_BYTES PROGRAM EVENT
----- -------- ---------- ---------- ---------- ----------------------- ------------------------------------------------
257 0 1 572 576 sqlplus@svdbp01i (TNS V PGA memory operation
1-V3)
The 2nd bit of ktcxbflg is set, and the transaction is using some redo space.
What has produced the transactions holding the IMU buffers?
Inter-process communication
Based on the wait events pipe get and Streams AQ: waiting for messages in the queue we can conclude that the sessions have been involved in inter-process communication. The errorstack for these wait events provides more details.
AQ
ALTER session SET EVENTS 'wait_event["Streams AQ: waiting for messages in the queue"] {occurence:end_after 100} errorstack()';
----- PL/SQL Call Stack -----
object line object
handle number name
c1a5e6e0 750 package body SYS.DBMS_AQ.DEQUEUE
c31eb7a0 2859 package body K.MSG_EXTL#.GET_NEXT_IN_MSG
c33f9130 489 package body K.MSG_IN#.GET_NEXT_MSG
c6fdb388 1215 procedure K.BGP_MSG_IN
c6fdb388 1404 procedure K.BGP_MSG_IN
bfb9fa78 1 anonymous block
The PL/SQL process was waiting on an AQ.
We can reproduce this behavior with the test case from the Tim Hall’s tutorial:
DECLARE
l_dequeue_options DBMS_AQ.dequeue_options_t;
l_message_properties DBMS_AQ.message_properties_t;
l_message_handle RAW(16);
l_event_msg event_msg_type;
BEGIN
DBMS_AQ.dequeue(queue_name => 'event_queue',
dequeue_options => l_dequeue_options,
message_properties => l_message_properties,
payload => l_event_msg,
msgid => l_message_handle);
END;
/
SID MINUTES BIT2 REDO_BYTES UNDO_BYTES PROGRAM EVENT
----- -------- ---------- ---------- ---------- ----------------------- ------------------------------------------------
265 2 0 0 72 sqlplus@svdbp01i (TNS V Streams AQ: waiting for messages in the queue
1-V3)
There were no incoming messages. Simply put, the dequeue procedure creates a transaction and allocates an IMU buffer. This transaction remains open during the whole waiting time.
Alerts
ALTER session SET EVENTS 'wait_event["pipe get"] {occurence:end_after 100} errorstack()'
c2e8a078 169 package body SYS.DBMS_PIPE.RECEIVE_MESSAGE
c42bdc00 165 package body SYS.DBMS_ALERT.PIPE_WAIT
c42bdc00 363 package body SYS.DBMS_ALERT.WAITONE
This process was waiting for an alert.
Also here, we can easily reproduce the behavior:
declare
L_MSG VARCHAR2(240);
l_status pls_integer ;
begin
DBMS_ALERT.WAITONE('TEST2', L_MSG, L_STATUS, 1000);
insert into t values(1);
end;
exec DBMS_ALERT.SIGNAL('TEST2', 'message_text2');
commit;
SID MINUTES BIT2 REDO_BYTES UNDO_BYTES PROGRAM EVENT
----- -------- ---------- ---------- ---------- ----------------------- ------------------------------------------------
265 1 0 0 72 sqlplus@svdbp01i (TNS V pipe get
1-V3)
Similarly, the session holds the allocated IMU buffer while waiting for an alert.
Is it a bug or a feature?
In both cases, the session is waiting for an incoming message/alert. Even though it’s an idle wait, the process started a pseudo-transaction which allocated 72 bytes in the IMU buffer. It’s a non-obvious and undocumented behavior. Many such processes could use up the IMU pool. Consequently, “real” transactions wouldn’t be able to use private strands.
Avaloq
The banking application Avaloq is particularly at risk as it heavily relies on AQ and alerts. For example, our production database has more than 350 job queue processes waiting for a message.
We need enough IMU buffers to handle the peak load. The number of IMU buffers can’t be explicitly specified. It’s rather derived from the transactions parameter (#IMU buffers = 0.1 * transactions).
MTTR advisor
There is another undocumented quirk related to public redo buffers. A process sometimes counts dirty buffers on all working data sets while holding a redo copy latch to prepare the information for the MTTR advisor. As the number of working data sets equals the number of CPUs, larger servers suffer more. [3]
Our primary goal is to reduce public redo buffer writes. But you can disable the MTTR advisor to alleviate the redo copy latch contention, until increasing the transactions parameter.
ORA-00600: [krcput_overwrite]
We’ve been occasionally hitting ORA-00600: [krcput_overwrite]. The affected Avaloq test databases had the transactions parameter set so low, that processes weren’t using private strands.
The error has being thrown in the Oracle function krcput:
...
krcput()+2249 call ksesin() 008C173D0 ? 000000004 ?
000000000 ? 00000003E ?
000000000 ? 00000003E ?
kcrfw_redo_gen_ext( call krcput() 091ED0FD8 ? 7FFFBFFF18F8 ?
)+22019 7FFFBFFF24C0 ? 000000000 ?
000000002 ? 00000003E ?
kcbchg1_main()+7114 call kcrfw_redo_gen_ext( 000000002 ? 7FFFBFFF24C0 ?
) 7FFFBFFF2480 ? 000000000 ?
7FFFBFFF2188 ? 7FFFBD869CD0 ?
kcbchg()+482 call kcbchg1_main() 000000008 ? 000000002 ?
000000037 ? 7FFFBFFF2DB0 ?
7FFFBFFF2E00 ? 000000000 ?
ktuapundo()+497 call kcbchg() 000000008 ? 000000002 ?
...
krcput writes changed block addresses into the block change tracking buffer [4], and its caller kcrfw_redo_gen_ext copies changes to the public redo buffer. [5]
According to the support note Bug 30295790 – DB Crashes with ORA-00600 [krcput_overwrite] (Doc ID 30295790.8) it’s a concurrency issue: “DBA buffers are updated by CTWR as well as multiple foreground process to read and write krcpe entries in it.”
In other words, the problem occurs during concurrent writes to the public redo log buffer. Private strands reduce these calls, and help avoid the bug.
Summary
Private strands reduce contention on the public redo buffer and improve scalability. A process that hasn’t allocated an IMU buffer can’t use private strands.
A process waiting for an AQ message or an alert allocates an IMU buffer and holds it while waiting. Many such processes could use up the IMU pool. That would prevent other sessions from using private strands, and cause performance issues. This problem is worse on servers with more CPUs, because the MTTR advisor code protected by redo copy latch takes more time to complete.
The banking application Avaloq needs a thorough consideration, because it configures many background job queue processes to wait for AQ messages and alerts. Set the transactions parameter high enough to ensure that private strands are being used during the peak load.
References
- [1] Wei Huang, Oracle Redo Strand. December 18, 2009.
- [2] Frits Hoogland, Redo a blogpost. November 15, 2016.
- [3] Nenad Noveljic, Redo Copy Latch Contention 2: MTTR Advisor. June 16, 2020.
- [4] Tanel Poder, Oracle-l: intermittent long “log file sync” waits. Februar 10, 2020.
- [5] Frits Hoogland, A look into Oracle redo, part 1: redo allocation latches. January 29, 2019.
Updates
October 10, 2020
A correction: The unset (not the set) 13th bit in x$ktcxb.ktcxbflg indicates that the private strands are being used. Thanks to Kun Sun for pointing this out.
October 23, 2020
The fix for the bug 30295790 DB Crashes with ORA-00600 [krcput_overwrite] is included in 19.9 19.9.0.0.DBRU:201020 (OCT 2020).
p1= 7643370632 p1 text = handle address
p2= 4096 p2 text = buffer length
p3 = 60 p3 text = timeout
p1= 4752598408 p1 text = handle address
p2= 4096 p2 text = buffer length
p3 = 60 p3 text = timeout
p1= 5254179256 p1 text = handle address
p2= 4096 p2 text = buffer length
p3 = 60 p3 text = timeout
p1=2061237600 p1 text = handle address
p2= 4096 p2 text = buffer length
p3 = 60 p3 text = timeout
I have one scheduler job which always sleeps on pipe get in 19c. The patch level is 19.12. Once i kill the session and rerun the session the job succeeds. While monitoring i got the above values where p3 always is recorded as timeout.
I am unable to trace where the session is getting timed out. Could you please throw some light on it.
You haven’t mentioned the wait event name. It’s needed to get more information about the blocker.
pipe get
Could you execute following for the session waiting on pipe get:
@init
@ashtop event2 session_id=... &5min