Our site suffered the crash of an Avaloq test database. The following information was found in the alert log:
PMON (ospid: 1112): terminating the instance due to error 471 TIMESTAMP System state dump requested by (instance=1, osid=1112 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u00/oracle/orabase/diag/rdbms/XXX/trace/XXX_diag_1221_20151231063902.trc Dumping diagnostic data in directory=[cdmp_20151231063902], requested by (instance=1, osid=1112 (PMON)), summary=[abnormal instance termination]. Instance terminated by PMON, pid = 1112
I looked into the pmon trace XXX_pmon_1112.trc and found out that the PMON terminated the instance because the database writer process DBW2 was not running any more:
Background process DBW2 found dead Oracle pid = 12 OS pid (from detached process) = 1234 OS pid (from process state) = 1234
Note here the OS pid of the DBW process – 1234. This is an important clue for later.
There weren’t any trace files left by the process 1234, so I assumed that the sudden death was due to some external impact. To find out the cause I used the kill.d dtrace script by Gregg Brendan. The culprit was indeed revealed next day when I harvested the kill.d results :
... 26804 aaafile.sh 9 1234 -1 27913 aaafile.sh 9 1234 -1 28988 aaafile.sh 9 1234 -1 28509 aaafile.sh 9 1234 -1 27393 aaafile.sh 9 1234 -1 29964 aaafile.sh 9 1234 -1 ...
The avaloq program aaafile.sh was killing the 1234 process. In this case, the victim happened to be a database writer.
But why would Avaloq do this?
To answer this question I correlated the information in alert log with the Avaloq aaafile log files.
Thu Jan 07 15:19:58 2016 ALTER SYSTEM enable restricted session; Immediate Kill Session#: 811, Serial#: 3 Immediate Kill Session: sess: 316711c00 OS pid: 12280 Immediate Kill Session#: 1084, Serial#: 77 Immediate Kill Session: sess: 314742c68 OS pid: 11927 Immediate Kill Session#: 1142, Serial#: 1179 Immediate Kill Session: sess: 3167a1f68 OS pid: 12020 Thu Jan 07 15:20:01 2016 opiodr aborting process unknown ospid (12068) as a result of ORA-28 Thu Jan 07 15:20:22 2016 opidrv aborting process J099 ospid (28486) as a result of ORA-28 Thu Jan 07 15:23:30 2016 Completed checkpoint up to RBA [0xd.2.10], SCN: 85281115086
aaafile log: svavr02a:2016-01-07_15-20:"":INFO: PARAMS: CMD = kill, FILE = 1234 ... svavr02a:2016-01-07_15-20:"":INFO: Processing Command: kill process (1234) kill: 1234: no such process ... svavr02a:2016-01-07_15-20:"":INFO: PARAMS: CMD = kill, FILE = 12112 ... svavr02a:2016-01-07_15-20:"":INFO: Processing Command: kill process (12112) ... svavr02a:2016-01-07_15-20:"":INFO: PARAMS: CMD = kill, FILE = 1234 ... svavr02a:2016-01-07_15-20:"":INFO: Processing Command: kill process (1234) kill: 1234: no such process ... svavr02a:2016-01-07_15-20:"":INFO: PARAMS: CMD = kill, FILE = 28486 ... svavr02a:2016-01-07_15-20:"":INFO: Processing Command: kill process (28486)
From the information above the following can be deduced:
– the session and os kills were caused by setting the session level 0 (session level 0 is the application shutdown in Avaloq)
– every database session kill was followed by the kill -9 of the corresponding shadow process
– there were a lot of kill -9 1234 for no obvious reason
To find out where the kill -9 1234 was coming from, I traced the session which sets the session level to zero and found the following query on the v$process view:
SELECT P.SPID ,S.PROCESS ,S.MACHINE ,S.PROGRAM FROM V$SESSION S ,V$PROCESS P WHERE S.SID = :b0 AND S.SERIAL# = :b1 AND P.ADDR (+) = S.PADDR;
This query was executed after almost every alter system disconnect session immediate.
Therefore, I concluded that Avaloq is quering the v$process.spid to find out the OS PID. However, as kill oracle session and kill -9 ar not atomic operations it can happen that the shadow process already disappears when by the time the query gets executed returning NULL for SPID.
Since Avaloq selects v$session.process in the same query I assumed that this value is used as a fallback if SPID is NULL. In the Oracle Documentation the column process is described as: “Operating system client process ID”. However, when I selected process from v$session I was suprised to see that for all the JDBC processes the value was set to 1234. The Metalink note “How to Set V$SESSION Properties Using the JDBC Thin Driver (Doc ID 147413.1)” explains that this is the expected behaviour. Hence, Avaloq will kill any process with this PID, if there are JDBC sessions coming from the database server itself. Avaloq doesn’t try to kill all the sessions, but just the ones from the users which are registered in the table SEC_USER. In our case it was the messaging agent which was logged in as user I. In particular, below is the query found in the trace files which selects the sessions for killing:
SELECT SID SID ,SERIAL# SERIAL FROM GV$SESSION WHERE AUDSID > 0 AND AUDSID != USERENV('sessionid') AND TYPE = 'USER' AND ',' || :b0 || ',' NOT LIKE '%,' || USERNAME || ',%' AND USERNAME IN ( SELECT ORACLE_USER FROM SEC_USER WHERE ORACLE_USER NOT IN ('SYS', 'K', 'K_READ') ) AND NVL(CLIENT_INFO,' ') NOT LIKE 'BGP_OS.%' AND PROGRAM NOT LIKE 'amqzlaa%';
As a workaround, make sure that you close the JDBC connections of the users in K.SEC_USER – except for K, K_READ and SYS – coming from the database server prior to setting session level to zero.
Update 02/02/2016: Avaloq released the patch #1695233 in the release 184.108.40.206.0. The patch recreates the procedure k.kill_session. The patch is successfully tested in my envoironment.