VKTM
A summary of my previous blog post: VKTM background process hung in nanosleep system call due to an OS bug, and stopped increasing the SGA variable ksudbrmseccnt_. Consequently, v$transaction.start_time is wrong, and AWR snapshots aren’t created anymore.
Parallel recovery
Parallel recovery on the standby database is another process which regularly calls nanosleep.
I observed a parallel recovery process waiting on the parallel recovery push change wait event 100% of its time. This wait event isn’t documented, so I inspected it with DTrace.
I’ve got the event code from the data dictionary:
select event# from v$event_name where name = 'parallel recovery push change' ;
EVENT#
----------
1021
and embedded it into the following DTrace script:
#!/usr/sbin/dtrace
#pragma D option quiet
pid$target::kskthbwt:entry
/ arg1 == 1021 /
{
self->event = arg1 ;
printf ("BEGIN wait event parallel recovery push change\n");
ustack();
}
pid$target::kslwtectx:entry
/ self->event == 1021 /
{
printf ("END wait event parallel recovery push change\n");
self->event = 0 ;
}
pid$target::nanosleep:entry
/ self->event == 1021 /
{
printf("nanosleep \n");
}
A short explanation of the script above: kskthbwt and kslwtectx are part of the Oracle wait interface API. They mark the beginning and the end of the wait event, respectively. kskthbwt receives the wait event code through the second argument (arg1).
The output:
BEGIN wait event parallel recovery push change
oracle`kskthbwt
oracle`kslwtbctx+0x30e
a.out`kcrm_logmgr_idle_action+0x257
a.out`kcrm_wait_for_reply+0x55
a.out`krr_do_media_recovery+0xd1f5
a.out`krddmr+0xc7b
a.out`krd_do_mrp+0x855
a.out`krd_logmerger_driver+0x23a5
a.out`krp_slave_main+0xbc0
a.out`ksvrdp_int+0x94b
a.out`opirip+0x263
oracle`opidrv+0x24b
oracle`sou2o+0x97
oracle`opimai_real+0x9f
oracle`ssthrdmain+0x266
oracle`main+0xa7
a.out`0xb4c237b
nanosleep
END wait event parallel recovery push change
nanosleep is part of the code instrumented by this wait event. The functions on the stack kcrm_logmgr_idle_action and kcrm_wait_for_reply, and a nanosleep call might indicate that this is an idle wait event. Inspite of that, ASH records it.
Debugging
You can simulate a hanging nanosleep call with the debugger.
First, set the conditional breakpoint on kskthbwt:
b kskthbwt if $rsi == 1021
Second, attach the parallel recovery process and continue:
ps -ef | grep pr00
oracle 27582 26345 0 13:38:25 ? 15:45 ora_pr00_DB
attach 27582
c
Third, after hitting the conditional breakpoint, set a breakpoint on nanosleep and continue:
Thread 42 hit Breakpoint 1, 0x000000000a09c140 in kskthbwt ()
b nanosleep
c
Thread 42 hit Breakpoint 2, 0x00007fff63e5e3d4 in nanosleep () from /lib/64/libc.so.1
While waiting on the breakpoint, check the active session history:
select event,count(*) from v$active_session_history
where
program like '%PR00%' and sample_time > sysdate - 1 / 24 / 60
group by event order by count(*) desc ;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
parallel recovery push change 60
The parallel recovery process has spent all of its time waiting on parallel recovery push change.
It returns to the normal operational mode after detaching the process:
detach
select event,count(*) from v$active_session_history
where program like '%PR00%' and sample_time > sysdate - 1 / 24 / 60
group by event order by count(*) desc ;
EVENT COUNT(*)
---------------------------------------------------------------- ----------
parallel recovery push change 39
5
control file sequential read 3
log file sequential read 2
control file parallel write 1
Some parallel recovery push change waits are normal. But completely lacking any other activity might be an indication that the process is stuck in nanosleep.
Summary
In summary, a parallel recovery process would grind to a halt when a nanosleep system call hung. Parallel recovery process waiting on parallel recovery push change 100% of its time is an indication of a such problem.