parallel recovery push change

VKTM

A summary of my previous blog post: VKTM background process hung in nanosleep system call due to an OS bug, and stopped increasing the SGA variable ksudbrmseccnt_. Consequently, v$transaction.start_time is wrong, and AWR snapshots aren’t created anymore.

Parallel recovery

Parallel recovery on the standby database is another process which regularly calls nanosleep.

I observed a parallel recovery process waiting on the parallel recovery push change wait event 100% of its time. This wait event isn’t documented, so I inspected it with DTrace.

I’ve got the event code from the data dictionary:

select event# from v$event_name where name = 'parallel recovery push change' ;

    EVENT#
----------
      1021

and embedded it into the following DTrace script:

#!/usr/sbin/dtrace

#pragma D option quiet

pid$target::kskthbwt:entry
/ arg1 == 1021 /
{
  self->event = arg1 ;
  printf ("BEGIN wait event parallel recovery push change\n");
  ustack();
}

pid$target::kslwtectx:entry
/ self->event == 1021 /
{
  printf ("END wait event parallel recovery push change\n");
  self->event = 0 ;
}

pid$target::nanosleep:entry
/ self->event == 1021 /
{
  printf("nanosleep \n");
}

A short explanation of the script above: kskthbwt and kslwtectx are part of the Oracle wait interface API. They mark the beginning and the end of the wait event, respectively. kskthbwt receives the wait event code through the second argument (arg1).

The output:

BEGIN wait event parallel recovery push change

oracle`kskthbwt
oracle`kslwtbctx+0x30e
a.out`kcrm_logmgr_idle_action+0x257
a.out`kcrm_wait_for_reply+0x55
a.out`krr_do_media_recovery+0xd1f5
a.out`krddmr+0xc7b
a.out`krd_do_mrp+0x855
a.out`krd_logmerger_driver+0x23a5
a.out`krp_slave_main+0xbc0
a.out`ksvrdp_int+0x94b
a.out`opirip+0x263
oracle`opidrv+0x24b
oracle`sou2o+0x97
oracle`opimai_real+0x9f
oracle`ssthrdmain+0x266
oracle`main+0xa7
a.out`0xb4c237b

nanosleep

END wait event parallel recovery push change

nanosleep is part of the code instrumented by this wait event. The functions on the stack kcrm_logmgr_idle_action and kcrm_wait_for_reply, and a nanosleep call might indicate that this is an idle wait event. Inspite of that, ASH records it.

Debugging

You can simulate a hanging nanosleep call with the debugger.

First, set the conditional breakpoint on kskthbwt:

b kskthbwt if $rsi == 1021

Second, attach the parallel recovery process and continue:

ps -ef | grep pr00
  oracle 27582 26345   0 13:38:25 ?          15:45 ora_pr00_DB
attach 27582
c

Third, after hitting the conditional breakpoint, set a breakpoint on nanosleep and continue:

Thread 42 hit Breakpoint 1, 0x000000000a09c140 in kskthbwt ()
b nanosleep
c

Thread 42 hit Breakpoint 2, 0x00007fff63e5e3d4 in nanosleep () from /lib/64/libc.so.1

While waiting on the breakpoint, check the active session history:

select event,count(*) from v$active_session_history 
  where 
    program like '%PR00%' and sample_time > sysdate - 1 / 24 / 60 
  group by event order by count(*) desc ;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
parallel recovery push change                                            60

The parallel recovery process has spent all of its time waiting on parallel recovery push change.

It returns to the normal operational mode after detaching the process:

detach
select event,count(*) from v$active_session_history 
  where program like '%PR00%' and sample_time > sysdate - 1 / 24 / 60 
  group by event order by count(*) desc ;

EVENT                                                              COUNT(*)
---------------------------------------------------------------- ----------
parallel recovery push change                                            39
                                                                          5
control file sequential read                                              3
log file sequential read                                                  2
control file parallel write                                               1

Some parallel recovery push change waits are normal. But completely lacking any other activity might be an indication that the process is stuck in nanosleep.

Summary

In summary, a parallel recovery process would grind to a halt when a nanosleep system call hung. Parallel recovery process waiting on parallel recovery push change 100% of its time is an indication of a such problem.

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.