A Case Study: How Advanced Queuing and Alerts Prevented Private Strands Usage

Private strands

Private strands reduce contention on the public redo log buffer. Without them a process would have to write to public redo buffer more frequently and compete with other processes. [1]

Message Advanced Queues (AQ) and alerts can prevent processes from using private strands.

For example, the following transaction isn’t using private strands:

create table t (n number);
insert into t values (1);

select decode(bitand(ktcxbflg, 4096),0,1,0) used_private_strand
    from x$ktcxb, v$session
    where ktcxbses = saddr and sid = userenv('sid');

USED_PRIVATE_STRAND
-------------------
                  0

x$ktcxb is the SQL interface to the transaction table. The unset 13th bit of the bitmap ktcxbflg indicates that the transaction is using a private strand.

In-memory undo buffers

In-memory undo (IMU) buffer is a prerequisite for using a private strand [2]. But in my case, the process couldn’t allocate it:

select n.name, s.value from v$statname n, v$sesstat s 
  where n.statistic# = s.statistic# and s.sid =  userenv('SID') 
    and n.name like '%IMU%' and value != 0 ;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
IMU pool not allocated                                                    1

x$ktifp externalizes the IMU pool:

select ktifptxflg, count(*) from x$ktifp group by ktifptxflg ;

KTIFPTXFLG   COUNT(*)
---------- ----------
         7         64
         1        107

All IMU buffers are currently occupied – none of them has ktifptxflg set to zero.

According to v$transaction there aren’t any open transactions:

select count(*) from v$transaction ;

  COUNT(*)
----------
         0

But who then has allocated all IMU buffers?

To answer this question, I selected transactions from the fixed table x$ktcxb instead of v$transaction. I also joined the result with the IMU buffer fixed table x$ktifp and v$session:

column sid format 9999
column minutes format 9999999
column program format a23
column event format a48

select 
    sid, 
    round(( sysdate - to_date(ktcxbstm,'mm/dd/rr hh24:mi:ss') ) * 24 * 60) 
      minutes, 
    decode(bitand(ktcxbflg,2),2,1,0) bit2,
    to_number(ktifprpc,'xxxxxxxxxxxxxxxx') -
    to_number(ktifprpb,'xxxxxxxxxxxxxxxx')  redo_bytes,
    to_number(ktifpupc, 'xxxxxxxxxxxxxxxx') - 
    to_number(ktifpupb, 'xxxxxxxxxxxxxxxx') undo_bytes,
    program, 
    event
  from x$ktifp, x$ktcxb, v$session ses 
  where ktifpxcb = ktcxbxba and ktcxbses = saddr
  order by ktcxbstm desc ;

  SID  MINUTES  BIT2 REDO_BYTES UNDO_BYTES PROGRAM                 EVENT
----- -------- ----- ---------- ---------- ----------------------- ------------------------------------------------
...
 1536       47     0          0         72 oracle@serverxx (J07T)  Streams AQ: waiting for messages in the queue
  333       47     0          0         72 oracle@serverxx (J08S)  Streams AQ: waiting for messages in the queue
 1098       47     0          0         72 oracle@serverxx (J06I)  Streams AQ: waiting for messages in the queue
   98       47     0          0         72 oracle@serverxx (J089)  Streams AQ: waiting for messages in the queue
 1462       47     0          0         72 oracle@serverxx (J03X)  Streams AQ: waiting for messages in the queue
  730       47     0          0         72 oracle@serverxx (J02N)  Streams AQ: waiting for messages in the queue
  787       47     0          0         72 oracle@serverxx (J02T)  Streams AQ: waiting for messages in the queue
  839       47     0          0         72 oracle@serverxx (J07Z)  Streams AQ: waiting for messages in the queue
 1422       47     0          0         72 oracle@serverxx (J03V)  Streams AQ: waiting for messages in the queue
  114       47     0          0         72 oracle@serverxx (J07R)  Streams AQ: waiting for messages in the queue
   55       47     0          0         72 oracle@serverxx (J07Y)  Streams AQ: waiting for messages in the queue
 1343       47     0          0         72 oracle@serverxx (J07M)  pipe get
   54       47     0          0         72 oracle@serverxx (J01E)  Streams AQ: waiting for messages in the queue
 1425       47     0          0         72 oracle@serverxx (J010)  Streams AQ: waiting for messages in the queue
  394       47     0          0         72 oracle@serverxx (J04Z)  Streams AQ: waiting for messages in the queue
  918       47     0          0         72 oracle@serverxx (J00C)  Streams AQ: waiting for messages in the queue
  593       47     0          0         72 oracle@serverxx (J002)  Streams AQ: waiting for messages in the queue
 1265       47     0          0         72 oracle@serverxx (J06Y)  Streams AQ: waiting for messages in the queue
 1216       47     0          0         72 oracle@serverxx (J04C)  Streams AQ: waiting for messages in the queue
  177       47     0          0         72 oracle@serverxx (J068)  Streams AQ: waiting for messages in the queue
  809       47     0          0         72 oracle@serverxx (J008)  Streams AQ: waiting for messages in the queue
 1436       47     0          0         72 oracle@serverxx (J07L)  Streams AQ: waiting for messages in the queue
 1496       47     0          0         72 oracle@serverxx (J02L)  Streams AQ: waiting for messages in the queue
  665       47     0          0         72 oracle@serverxx (J02H)  Streams AQ: waiting for messages in the queue

  SID  MINUTES  BIT2 REDO_BYTES UNDO_BYTES PROGRAM                 EVENT
----- -------- ----- ---------- ---------- ----------------------- ------------------------------------------------
 1120       47     0          0         72 oracle@serverxx (J03W)  Streams AQ: waiting for messages in the queue
  496       47     0          0         72 oracle@serverxx (J029)  Streams AQ: waiting for messages in the queue
 1176       47     0          0         72 oracle@serverxx (J01T)  Streams AQ: waiting for messages in the queue
 1523       47     0          0         72 oracle@serverxx (J020)  Streams AQ: waiting for messages in the queue
 1366       47     0          0         72 oracle@serverxx (J08V)  Streams AQ: waiting for messages in the queue
  355       47     0          0         72 oracle@serverxx (J04Y)  Streams AQ: waiting for messages in the queue
 1470       47     0          0         72 oracle@serverxx (J03Y)  Streams AQ: waiting for messages in the queue
 1202       47     0          0         72 oracle@serverxx (J00K)  Streams AQ: waiting for messages in the queue
  641       48     0          0         72 oracle@serverxx (J02K)  Streams AQ: waiting for messages in the queue
  573       48     0          0         72 oracle@serverxx (J02E)  Streams AQ: waiting for messages in the queue
   41       48     0          0         72 oracle@serverxx (J01F)  Streams AQ: waiting for messages in the queue
  487       48     0          0         72 oracle@serverxx (J03J)  Streams AQ: waiting for messages in the queue
 1407      108     0          0         72 oracle@serverxx (J046)  pipe get
 1448      108     0          0         72 oracle@serverxx (J012)  pipe get
 1291      108     0          0         72 oracle@serverxx (J06V)  pipe get
  864      108     0          0         72 oracle@serverxx (J08Z)  pipe get
  850      108     0          0         72 oracle@serverxx (J08P)  pipe get

  SID  MINUTES  BIT2 REDO_BYTES UNDO_BYTES PROGRAM                 EVENT
----- -------- ----- ---------- ---------- ----------------------- ------------------------------------------------
 1368      468     0          0         72 oracle@serverxx (J03R)  pipe get
 1457      468     0          0         72 oracle@serverxx (J071)  Streams AQ: waiting for messages in the queue
 1201      468     0          0         72 oracle@serverxx (J06G)  PL/SQL lock timer
 1136      468     0          0         72 oracle@serverxx (J036)  Streams AQ: waiting for messages in the queue
 1189      468     0          0         72 oracle@serverxx (J03B)  pipe get
 1174      468     0          0         72 oracle@serverxx (J03A)  pipe get
 1162      468     0          0         72 oracle@serverxx (J06D)  pipe get
 1145      468     0          0         72 oracle@serverxx (J038)  pipe get
  521      468     0          0         72 oracle@serverxx (J000)  pipe get

171 rows selected.

The transactions above look strange:

  • None of them has allocated any redo space.
  • They have all allocated exactly 72 bytes in the undo buffer.
  • They’ve been holding IMU buffers for hours.
  • The 2nd bit in ktcxbflg isn’t set.

ktcxbflg

It’s unusual to leave the 2nd bit of ktcxbflg unset. v$transaction, by the way, filters such transactions out:

BITAND (ktcxbflg, 2) != 0

Let’s compare the transactions above with a regular transaction:

insert into t values (1);

  SID  MINUTES       BIT2 REDO_BYTES UNDO_BYTES PROGRAM                 EVENT
----- -------- ---------- ---------- ---------- ----------------------- ------------------------------------------------
  257        0          1        572        576 sqlplus@svdbp01i (TNS V PGA memory operation
                                                1-V3)

The 2nd bit of ktcxbflg is set, and the transaction is using some redo space.

What has produced the transactions holding the IMU buffers?

Inter-process communication

Based on the wait events pipe get and Streams AQ: waiting for messages in the queue we can conclude that the sessions have been involved in inter-process communication. The errorstack for these wait events provides more details.

AQ

ALTER session SET EVENTS 'wait_event["Streams AQ: waiting for messages in the queue"] {occurence:end_after 100} errorstack()';
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
c1a5e6e0       750  package body SYS.DBMS_AQ.DEQUEUE
c31eb7a0      2859  package body K.MSG_EXTL#.GET_NEXT_IN_MSG
c33f9130       489  package body K.MSG_IN#.GET_NEXT_MSG
c6fdb388      1215  procedure K.BGP_MSG_IN
c6fdb388      1404  procedure K.BGP_MSG_IN
bfb9fa78         1  anonymous block

The PL/SQL process was waiting on an AQ.

We can reproduce this behavior with the test case from the Tim Hall’s tutorial:

DECLARE
  l_dequeue_options     DBMS_AQ.dequeue_options_t;
  l_message_properties  DBMS_AQ.message_properties_t;
  l_message_handle      RAW(16);
  l_event_msg           event_msg_type;
BEGIN
  DBMS_AQ.dequeue(queue_name          => 'event_queue',
                  dequeue_options     => l_dequeue_options,
                  message_properties  => l_message_properties,
                  payload             => l_event_msg,
                  msgid               => l_message_handle);
END;
/
  SID  MINUTES       BIT2 REDO_BYTES UNDO_BYTES PROGRAM                 EVENT
----- -------- ---------- ---------- ---------- ----------------------- ------------------------------------------------
  265        2          0          0         72 sqlplus@svdbp01i (TNS V Streams AQ: waiting for messages in the queue
                                                1-V3)

There were no incoming messages. Simply put, the dequeue procedure creates a transaction and allocates an IMU buffer. This transaction remains open during the whole waiting time.

Alerts

ALTER session SET EVENTS 'wait_event["pipe get"] {occurence:end_after 100} errorstack()'
c2e8a078       169  package body SYS.DBMS_PIPE.RECEIVE_MESSAGE
c42bdc00       165  package body SYS.DBMS_ALERT.PIPE_WAIT
c42bdc00       363  package body SYS.DBMS_ALERT.WAITONE

This process was waiting for an alert.

Also here, we can easily reproduce the behavior:

declare
  L_MSG                     VARCHAR2(240);
  l_status                  pls_integer ;
begin
  DBMS_ALERT.WAITONE('TEST2', L_MSG, L_STATUS, 1000);
  insert into t values(1);
end;
exec DBMS_ALERT.SIGNAL('TEST2', 'message_text2'); 
commit;
  SID  MINUTES       BIT2 REDO_BYTES UNDO_BYTES PROGRAM                 EVENT
----- -------- ---------- ---------- ---------- ----------------------- ------------------------------------------------
  265        1          0          0         72 sqlplus@svdbp01i (TNS V pipe get
                                                1-V3)

Similarly, the session holds the allocated IMU buffer while waiting for an alert.

Is it a bug or a feature?

In both cases, the session is waiting for an incoming message/alert. Even though it’s an idle wait, the process started a pseudo-transaction which allocated 72 bytes in the IMU buffer. It’s a non-obvious and undocumented behavior. Many such processes could use up the IMU pool. Consequently, “real” transactions wouldn’t be able to use private strands.

Avaloq

The banking application Avaloq is particularly at risk as it heavily relies on AQ and alerts. For example, our production database has more than 350 job queue processes waiting for a message.

We need enough IMU buffers to handle the peak load. The number of IMU buffers can’t be explicitly specified. It’s rather derived from the transactions parameter (#IMU buffers = 0.1 * transactions).

MTTR advisor

There is another undocumented quirk related to public redo buffers. A process sometimes counts dirty buffers on all working data sets while holding a redo copy latch to prepare the information for the MTTR advisor. As the number of working data sets equals the number of CPUs, larger servers suffer more. [3]

Our primary goal is to reduce public redo buffer writes. But you can disable the MTTR advisor to alleviate the redo copy latch contention, until increasing the transactions parameter.

ORA-00600: [krcput_overwrite]

We’ve been occasionally hitting ORA-00600: [krcput_overwrite]. The affected Avaloq test databases had the transactions parameter set so low, that processes weren’t using private strands.

The error has being thrown in the Oracle function krcput:

...
krcput()+2249        call     ksesin()             008C173D0 ? 000000004 ?
                                                   000000000 ? 00000003E ?
                                                   000000000 ? 00000003E ?
kcrfw_redo_gen_ext(  call     krcput()             091ED0FD8 ? 7FFFBFFF18F8 ?
)+22019                                            7FFFBFFF24C0 ? 000000000 ?
                                                   000000002 ? 00000003E ?
kcbchg1_main()+7114  call     kcrfw_redo_gen_ext(  000000002 ? 7FFFBFFF24C0 ?
                              )                    7FFFBFFF2480 ? 000000000 ?
                                                   7FFFBFFF2188 ? 7FFFBD869CD0 ?
kcbchg()+482         call     kcbchg1_main()       000000008 ? 000000002 ?
                                                   000000037 ? 7FFFBFFF2DB0 ?
                                                   7FFFBFFF2E00 ? 000000000 ?
ktuapundo()+497      call     kcbchg()             000000008 ? 000000002 ?
...

krcput writes changed block addresses into the block change tracking buffer [4], and its caller kcrfw_redo_gen_ext copies changes to the public redo buffer. [5]

According to the support note Bug 30295790 – DB Crashes with ORA-00600 [krcput_overwrite] (Doc ID 30295790.8) it’s a concurrency issue: “DBA buffers are updated by CTWR as well as multiple foreground process to read and write krcpe entries in it.”

In other words, the problem occurs during concurrent writes to the public redo log buffer. Private strands reduce these calls, and help avoid the bug.

Summary

Private strands reduce contention on the public redo buffer and improve scalability. A process that hasn’t allocated an IMU buffer can’t use private strands.

A process waiting for an AQ message or an alert allocates an IMU buffer and holds it while waiting. Many such processes could use up the IMU pool. That would prevent other sessions from using private strands, and cause performance issues. This problem is worse on servers with more CPUs, because the MTTR advisor code protected by redo copy latch takes more time to complete.

The banking application Avaloq needs a thorough consideration, because it configures many background job queue processes to wait for AQ messages and alerts. Set the transactions parameter high enough to ensure that private strands are being used during the peak load.

References

  • [1] Wei Huang, Oracle Redo Strand. December 18, 2009.
  • [2] Frits Hoogland, Redo a blogpost. November 15, 2016.
  • [3] Nenad Noveljic, Redo Copy Latch Contention 2: MTTR Advisor. June 16, 2020.
  • [4] Tanel Poder, Oracle-l: intermittent long “log file sync” waits. Februar 10, 2020.
  • [5] Frits Hoogland, A look into Oracle redo, part 1: redo allocation latches. January 29, 2019.

Updates

October 10, 2020

A correction: The unset (not the set) 13th bit in x$ktcxb.ktcxbflg indicates that the private strands are being used. Thanks to Kun Sun for pointing this out.

October 23, 2020

The fix for the bug 30295790 DB Crashes with ORA-00600 [krcput_overwrite] is included in 19.9 19.9.0.0.DBRU:201020 (OCT 2020).

Thanks for sharing

Nenad Noveljic

4 Comments

  1. p1= 7643370632 p1 text = handle address
    p2= 4096 p2 text = buffer length
    p3 = 60 p3 text = timeout

    p1= 4752598408 p1 text = handle address
    p2= 4096 p2 text = buffer length
    p3 = 60 p3 text = timeout

    p1= 5254179256 p1 text = handle address
    p2= 4096 p2 text = buffer length
    p3 = 60 p3 text = timeout

    p1=2061237600 p1 text = handle address
    p2= 4096 p2 text = buffer length
    p3 = 60 p3 text = timeout

    I have one scheduler job which always sleeps on pipe get in 19c. The patch level is 19.12. Once i kill the session and rerun the session the job succeeds. While monitoring i got the above values where p3 always is recorded as timeout.

    I am unable to trace where the session is getting timed out. Could you please throw some light on it.

    • You haven’t mentioned the wait event name. It’s needed to get more information about the blocker.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.