Avaloq Database Crash

Our site suffered the crash of an Avaloq test database. The following information was found in the alert log:

PMON (ospid: 1112): terminating the instance due to error 471
TIMESTAMP
System state dump requested by (instance=1, osid=1112 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u00/oracle/orabase/diag/rdbms/XXX/trace/XXX_diag_1221_20151231063902.trc
Dumping diagnostic data in directory=[cdmp_20151231063902], requested by (instance=1, osid=1112 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 1112

I looked into the pmon trace XXX_pmon_1112.trc and found out that the PMON terminated the instance because the database writer process DBW2 was not running any more:

Background process DBW2 found dead
Oracle pid = 12
OS pid (from detached process) = 1234
OS pid (from process state) = 1234

Note here the OS pid of the DBW process – 1234. This is an important clue for later.

There weren’t any trace files left by the process 1234, so I assumed that the sudden death was due to some external impact. To find out the cause I used the kill.d dtrace script by Gregg Brendan. The culprit was indeed revealed next day when I harvested the kill.d results :

...
26804   aaafile.sh     9 1234   -1
27913   aaafile.sh     9 1234   -1
28988   aaafile.sh     9 1234   -1
28509   aaafile.sh     9 1234   -1
27393   aaafile.sh     9 1234   -1
29964   aaafile.sh     9 1234   -1
...

The avaloq program aaafile.sh was killing the 1234 process. In this case, the victim happened to be a database writer.

But why would Avaloq do this?

To answer this question I correlated the information in alert log with the Avaloq aaafile log files.

Thu Jan 07 15:19:58 2016
ALTER SYSTEM enable restricted session;
Immediate Kill Session#: 811, Serial#: 3
Immediate Kill Session: sess: 316711c00  OS pid: 12280
Immediate Kill Session#: 1084, Serial#: 77
Immediate Kill Session: sess: 314742c68  OS pid: 11927
Immediate Kill Session#: 1142, Serial#: 1179
Immediate Kill Session: sess: 3167a1f68  OS pid: 12020
Thu Jan 07 15:20:01 2016
opiodr aborting process unknown ospid (12068) as a result of ORA-28
Thu Jan 07 15:20:22 2016
opidrv aborting process J099 ospid (28486) as a result of ORA-28
Thu Jan 07 15:23:30 2016
Completed checkpoint up to RBA [0xd.2.10], SCN: 85281115086
aaafile log:
svavr02a:2016-01-07_15-20:"":INFO: PARAMS: CMD = kill, FILE = 1234
...
svavr02a:2016-01-07_15-20:"":INFO: Processing Command: kill process (1234)
kill: 1234: no such process
...
svavr02a:2016-01-07_15-20:"":INFO: PARAMS: CMD = kill, FILE = 12112
...
svavr02a:2016-01-07_15-20:"":INFO: Processing Command: kill process (12112)
...
svavr02a:2016-01-07_15-20:"":INFO: PARAMS: CMD = kill, FILE = 1234
...
svavr02a:2016-01-07_15-20:"":INFO: Processing Command: kill process (1234)
kill: 1234: no such process
...
svavr02a:2016-01-07_15-20:"":INFO: PARAMS: CMD = kill, FILE = 28486
...
svavr02a:2016-01-07_15-20:"":INFO: Processing Command: kill process (28486)

From the information above the following can be deduced:
– the session and os kills were caused by setting the session level 0 (session level 0 is the application shutdown in Avaloq)
– every database session kill was followed by the kill -9 of the corresponding shadow process
– there were a lot of kill -9 1234 for no obvious reason

To find out where the kill -9 1234 was coming from, I traced the session which sets the session level to zero and found the following query on the v$process view:

      SELECT P.SPID
            ,S.PROCESS
            ,S.MACHINE
            ,S.PROGRAM
      FROM   V$SESSION  S
            ,V$PROCESS  P
      WHERE  S.SID      = :b0
        AND  S.SERIAL#  = :b1
        AND  P.ADDR (+) = S.PADDR;

This query was executed after almost every alter system disconnect session immediate.

Therefore, I concluded that Avaloq is quering the v$process.spid to find out the OS PID. However, as kill oracle session and kill -9 ar not atomic operations it can happen that the shadow process already disappears when by the time the query gets executed returning NULL for SPID.

Since Avaloq selects v$session.process in the same query I assumed that this value is used as a fallback if SPID is NULL. In the Oracle Documentation the column process is described as: “Operating system client process ID”. However, when I selected process from v$session I was suprised to see that for all the JDBC processes the value was set to 1234. The Metalink note “How to Set V$SESSION Properties Using the JDBC Thin Driver (Doc ID 147413.1)” explains that this is the expected behaviour. Hence, Avaloq will kill any process with this PID, if there are JDBC sessions coming from the database server itself. Avaloq doesn’t try to kill all the sessions, but just the ones from the users which are registered in the table SEC_USER. In our case it was the messaging agent which was logged in as user I. In particular, below is the query found in the trace files which selects the sessions for killing:

SELECT SID     SID
,SERIAL# SERIAL
FROM   GV$SESSION
WHERE  AUDSID   > 0
AND  AUDSID  != USERENV('sessionid')
AND  TYPE     = 'USER'
AND ',' || :b0 || ',' NOT LIKE '%,' || USERNAME || ',%'
AND  USERNAME IN (
SELECT ORACLE_USER
FROM   SEC_USER
WHERE  ORACLE_USER NOT IN ('SYS', 'K', 'K_READ')
)
AND  NVL(CLIENT_INFO,' ') NOT LIKE 'BGP_OS.%'
AND  PROGRAM  NOT LIKE 'amqzlaa%';

As a workaround, make sure that you close the JDBC connections of the users in K.SEC_USER – except for K, K_READ and SYS – coming from the database server prior to setting session level to zero.

Update 02/02/2016: Avaloq released the patch #1695233 in the release 3.9.1.5.0. The patch recreates the procedure k.kill_session. The patch is successfully tested in my envoironment.

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.