Data loss
The application started getting ORA-00600 errors on a recently plugged PDB database. As it turned out, it wasn’t an ordinary ORA-00600. This test database couldn’t be rescued and had to be refreshed from the production.
This problem was insidious – the plug operation itself didn’t return any errors. ORA-00600 started appearing only after the application was running for a while and wanted to query missing data.
In this article, I’ll describe a scenario of irreversible data loss after plugging a PDB. What’s worse, the data loss can remain unnoticed for a while. Once the application starts failing, it isn’t obvious what’s happening. The ORA-00600 messages are cryptic and don’t tell anything about the root cause. Keep reading if you’re tasked with moving PDBs around – this information can save you a lot of trouble.
ORA-00600 [25027]
We were seeing the following error:
ORA-00600: internal error code, arguments: [25027], [3], [3], [6],[2667591712], [0], [1], [1], [], [], [], []
In particular, the error was appearing when selecting data only from some application tables. The problem was reproducible at will.
The error was raised on a consistent read (kcbgtcr function is on the stack):
...
krtd2abh_pdb_cursor call dbgeEndDDEInvocatio 7F9320B59718 7F931BB45FA8
()+1780 nImpl() 7F9320B999C0 7FFD77A76568 ?
7F9320B999C0 000000082 ?
kcbgtcr()+11012 call krtd2abh_pdb_cursor 7F9320B59718 ? 7F931BB45FA8 ?
() 09F003820 ? 7FFD77A76568 ?
7F9320B999C0 000000006
ktrget2()+877 call kcbgtcr() 7F931B405DE8 000000000
0000005C4 000000000
7F9320B999C0 ? 000000006 ?
kdirfrs()+1705 call ktrget2() 7F931B405DD8 ? 7F931B40A578 ?
0000005C4 000000000
7F9320B999C0 ? 000000006 ?
qerixFetchFastFullS call kdirfrs() 7F931B405DD8 ? 7F931B40A578 ?
can()+815 0000005C4 ? 000000000 ?
7F9320B999C0 ? 000000006 ?
qerghFetch()+603 call qerixFetchFastFullS 7F931B405D01 ? 000000000 ?
can() 012D64FE0 7FFD77A815E0
000000000 ? 000000006 ?
qerwnFetch()+16689 call qerghFetch() 0A31DA108 7F931B404D98
003FACBA0 7FFD77A81800
000007FFF 0A31DA0C8
qervwFetch()+166 call qerwnFetch() 000000000 7F931B44AEE8
004060930 7FFD77A819C0
000007FFF 0A31DA0C8 ?
qerwnFetch()+16689 call qervwFetch() 0A31D8A00 7F931B44B7E8
003FACBA0 7FFD77A81AC0
000007FFF 0A31DA0C8 ?
opifch2()+3211 call qerwnFetch() 000000000 7F931B456E08
012ADE300 7FFD77A82080
000000001 0A31DA0C8 ?
kpoal8()+4185 call opifch2() 000000000 ? 7F931B456E08 ?
7FFD77A82440 7FFD77A82080 ?
000000001 ? 0A31DA0C8 ?
...
There are lots of MOS notes with ORA-00600 [25027], mainly for older Oracle releases. None of the published ORA-00600 [25027] bugs seemed relevant for our case. 25027 is probably a value that developers hardcoded to roughly identify the code segment which introduced the error. So I had to focus on other arguments to identify the root cause. The second and the third arguments, [3] and [3], are container id, and [6] is the tablespace id. That’s not enough to narrow down the root cause.
Therefore, I focused on deciphering 2667591712, the fifth argument, which was different for each failed query.
Consistent read buffer status trace
In the course of troubleshooting I discovered the undocumented “consistent read buffer status” trace event, which provided the key information for interpreting the fifth argument:
ALTER SESSION SET EVENTS='10200 trace name context forever, level 2' ;
oerr ora 10200
10200, 00000, "consistent read buffer status"
// *Cause:
// *Action:
env [0x7f05457e6f60]: (scn: 0x00000893d2c725cf xid: 0x0000.000.00000000 uba: 0x00000000.0000.00 statement num=0 parent xid: 0x0000.000.00000000 st-scn: 0x0000000000000000 hi-scn: 0x0000000000000000 ma-scn: 0x00000893d2c725c2 flg: 0x00000660)
ktrexf(): returning 9 on: 0x7f054ae96588 cr-scn: 0xffffffffffffffff xid: 0x0000.000.00000000 uba: 0x00000000.0000.00 cl-scn: 0xffffffffffffffff sfl: 0
ktrgcm(): completed for block <0x0006 : 0x01823ed5> objd: 0x00045bee
ktrget2(): completed for block <0x0006 : 0x01823ed5> objd: 0x00045bee
ktrget2(): started for block <0x0006 : 0x01823ed5> objd: 0x00045bee
env [0x7f05457e6f60]: (scn: 0x00000893d2c725cf xid: 0x0000.000.00000000 uba: 0x00000000.0000.00 statement num=0 parent xid: 0x0000.000.00000000 st-scn: 0x0000000000000000 hi-scn: 0x0000000000000000 ma-scn: 0x00000893d2c725c2 flg: 0x00000660)
ktrexf(): returning 9 on: 0x7f054ae96588 cr-scn: 0xffffffffffffffff xid: 0x0000.000.00000000 uba: 0x00000000.0000.00 cl-scn: 0xffffffffffffffff sfl: 0
ktrgcm(): completed for block <0x0006 : 0x01823ed5> objd: 0x00045bee
ktrget2(): completed for block <0x0006 : 0x01823ed5> objd: 0x00045bee
ktrget2(): started for block <0x0006 : 0x9f003820> objd: 0x00045bee
env [0x7f05457e6f60]: (scn: 0x00000893d2c725cf xid: 0x0000.000.00000000 uba: 0x00000000.0000.00 statement num=0 parent xid: 0x0000.000.00000000 st-scn: 0x0000000000000000 hi-scn: 0x0000000000000000 ma-scn: 0x00000893d2c725c2 flg: 0x00000660)
ORA-00600: internal error code, arguments: [25027], [3], [3], [6], [2667591712], [0], [1], [1], [], [], [], []
The last accessed block prior to ORA-00600 was the block with the address 0x9f003820. This block address indeed matches the fifth argument in the ORA-00600 error message:
(gdb) p/x 2667591712
$1 = 0x9f003820
Block address consists of file_id and block_id.
block_id is stored in the 22 least significant bits:
p/u 0x9f003820 & 0b1111111111111111111111
$15983 = 14368
The file_id is stored in the 10 most significant bits:
p/u 0x9f003820 >> 22
$15982 = 636
So the block_id 14368 in the file_id 636 is causing the problem.
Missing file
But the file 636 doesn’t exist:
select 1 from sys.file$ where relfile# = 636 ;
no rows selected
However, x$ktfbue, the extent table, has references to this non-existing file:
select count(*) from x$ktfbue
where KTFBUESEGTSN = 6 and KTFBUEFNO = 636 ;
COUNT(*)
----------
13
There’s also an entry for the requested block in the missing file:
SQL> select count(*) from x$ktfbue
where KTFBUESEGTSN = 6 and KTFBUEFNO = 636
and 14368 between KTFBUEBNO and KTFBUEBNO + KTFBUEBLKS -1 ;
COUNT(*)
----------
1
The following query lists all the missing files:
select KTFBUESEGTSN ts#, ktfbuefno file# from sys.x$ktfbue
minus
select ts#,relfile# from sys.file$ ;
TS# FILE#
---------- ----------
6 636
The alert log provides the clue about the missing file with the id 636:
Create pluggable database PDB1 using '/u00/oracle/orabase/admin/PDB1C/log/PDB1_2022-02-09_15_17_28_recover_final.xml' NOCOPY TEMPFILE REUSE
...
PDB1(4):Deleting old file#523 from file$
PDB1(4):Deleting old file#636 from file$
...
PDB1(4):Adding new file#673 to file$(old file#523). fopr-0, newblks-720896, oldblks-32768
PDB1(4):Successfully created internal service PDB1 at open
According to alert log, the plugin operation removed datafiles and added them again under a new file_id. But there wasn’t any “636 added” entry. We checked the directory with datafailes and indeed saw that the file wasn’t there. Now we understood what happened. We didn’t have a manifest when plugging the database. We created it based on the directory with datafiles instead. That manifest was incomplete because the file was missing. Unfortunately, Oracle didn’t raise any error on plugin, so the problem remained unnoticed until the application started to retrieve the missing data.
Demo
I created the tablespace DATA with two small datafiles and the table T, which spread across both datafiles:
create tablespace data datafile size 5M autoextend off ;
alter tablespace data add datafile size 5M autoextend off ;
create user u no authentication default tablespace data ;
alter user u quota unlimited on data ;
grant create table to u ;
create table u.t (c varchar2(4000));
insert into u.t select lpad('A',4000,'A') from dual connect by level <= 1000 ;
commit;
select relative_fno,file_id,count(*) from dba_extents where segment_name='T'
group by relative_fno,file_id ;
RELATIVE_FNO FILE_ID COUNT(*)
------------ ---------- ----------
15 15 19
16 16 4
select file_id, relative_fno, file_name from dba_data_files
where tablespace_name = 'DATA' ;
FILE_ID RELATIVE_FNO
---------- ------------
FILE_NAME
--------------------------------------------------------------------------------
15 15
/u00/oracle/oradata01/DBER20C/PDB12/DBER20C/D822EC6D95DC3533E0531FE1900A73AD/da
tafile/o1_mf_data_k0sxfg1w_.dbf
16 16
/u00/oracle/oradata01/DBER20C/PDB12/DBER20C/D822EC6D95DC3533E0531FE1900A73AD/da
tafile/o1_mf_data_k0sxfg9f_.dbf
select ts# from file$ where file# in (15,16);
TS#
----------
8
8
Then, I unplugged the PDB, deleted a file and plugged the database again:
connect / as sysdba
alter pluggable database PDB12 close ;
ALTER PLUGGABLE DATABASE PDB12 UNPLUG INTO '/tmp/PDB12.xml' ;
DROP PLUGGABLE DATABASE PDB12 KEEP DATAFILES;
!rm /u00/oracle/oradata01/DBER20C/PDB12/DBER20C/D822EC6D95DC3533E0531FE1900A73AD/datafile/o1_mf_data_k0sxfg9f_.dbf
!rm /tmp/PDB12.xml
begin
DBMS_PDB.RECOVER (
pdb_descr_file => '/tmp/PDB12.xml',
pdb_name => 'PDB12',
filenames => '/u00/oracle/oradata01/DBER20C/PDB12/DBER20C/D822EC6D95DC3533E0531FE1900A73AD/datafile/'
);
end ;
/
Neither “create pluggable database” nor “database open” raised any errors:
create pluggable database PDB12 using '/tmp/PDB12.xml'
NOCOPY
TEMPFILE REUSE;
alter pluggable database PDB12 open ;
alter pluggable database PDB12 close ;
alter pluggable database PDB12 open ;
alter session set container=PDB12
Expectedly, the deleted file 16 isn’t there:
select file_id, relative_fno from dba_data_files
where tablespace_name = 'DATA' ;
FILE_ID RELATIVE_FNO
---------- ------------
21 15
Alert log
The alert log confirms that the file wasn’t added:
PDB12(4):Buffer Cache flush finished: 4
Completed: create pluggable database PDB12 using '/tmp/PDB12.xml'
NOCOPY
TEMPFILE REUSE
2022-02-16T14:09:56.495509+01:00
alter pluggable database PDB12 open
2022-02-16T14:09:56.498522+01:00
PDB12(4):Pluggable database PDB12 opening in read write
PDB12(4):SUPLOG: Initialize PDB SUPLOG SGA, old value 0x0, new value 0x18
PDB12(4):Autotune of undo retention is turned on.
PDB12(4):Endian type of dictionary set to little
PDB12(4):Undo initialization recovery: Parallel FPTR complete: start:2258534527 end:2258534529 diff:2 ms (0.0 seconds)
PDB12(4):Undo initialization recovery: err:0 start: 2258534527 end: 2258534529 diff: 2 ms (0.0 seconds)
PDB12(4):[1556525] Successfully onlined Undo Tablespace 2.
PDB12(4):Undo initialization online undo segments: err:0 start: 2258534529 end: 2258534562 diff: 33 ms (0.0 seconds)
PDB12(4):Undo initialization finished serial:0 start:2258534527 end:2258534563 diff:36 ms (0.0 seconds)
PDB12(4):Deleting old file#10 from file$
PDB12(4):Deleting old file#11 from file$
PDB12(4):Deleting old file#12 from file$
PDB12(4):Deleting old file#13 from file$
PDB12(4):Deleting old file#14 from file$
PDB12(4):Deleting old file#15 from file$
PDB12(4):Deleting old file#16 from file$
PDB12(4):Adding new file#20 to file$(old file#10). fopr-0, newblks-60800, oldblks-16000
PDB12(4):Adding new file#18 to file$(old file#11). fopr-0, newblks-25600, oldblks-12800
PDB12(4):Adding new file#19 to file$(old file#12). fopr-0, newblks-26880, oldblks-1280
PDB12(4):Adding new file#17 to file$(old file#13). fopr-0, newblks-12800, oldblks-12800
PDB12(4):Adding new file#22 to file$(old file#14). fopr-0, newblks-12800, oldblks-12800
PDB12(4):Adding new file#21 to file$(old file#15). fopr-0, newblks-640, oldblks-640
PDB12(4):Marking tablespace #7 invalid since it is not present in the describe file
PDB12(4):Successfully created internal service PDB12 at open
****************************************************************
Alert log Lists all deleted and added files. You can compare both lists to identify missing files.
Extracting file_id and block_id
We reproduced the error:
SQL> select * from u.t ;
select * from u.t
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [25027], [4], [4], [8], [67108992],
[0], [1], [1], [], [], [], []
ORA-00600 shows the block address. We already learned how to extract the file_id and block_id:
p/u 67108992 >> 22
16
p/u 67108992 & 0b1111111111111111111111
128
Our block is registered in x$ktfbue:
select count(*) from x$ktfbue
where KTFBUESEGTSN = 8 and KTFBUEFNO = 16
and 128 between KTFBUEBNO and KTFBUEBNO + KTFBUEBLKS -1 ;
COUNT(*)
----------
1
Diagnostic query
The following query lists the missing datafiles:
select KTFBUESEGTSN ts#, ktfbuefno file# from sys.x$ktfbue
minus
select ts#,relfile# from sys.file$ ;
TS# FILE#
---------- ----------
8 16
And the following query lists the affected segments:
SELECT distinct ds.owner, ds.segment_name
FROM sys.sys_dba_segs ds, sys.x$ktfbue e
WHERE e.ktfbuesegfno = ds.relative_fno
AND e.ktfbuesegbno = ds.header_block
AND e.ktfbuesegtsn = ds.tablespace_id
AND BITAND (NVL (ds.segment_flags, 0), 1) = 1
AND BITAND (NVL (ds.segment_flags, 0), 65536) = 0
and e.ktfbuefno not in (select relfile# from sys.file$) ;
OWNER
--------------------------------------------------------------------------------
SEGMENT_NAME
--------------------------------------------------------------------------------
U
T
Internals
Last but not least, I’ll show how to track consistent reads with debugger.
Let’s get the the object_id first:
select object_id, to_char(object_id,'xxxxxxxxx') hex from dba_objects where object_name = 'T'
2 ;
OBJECT_ID HEX
---------- ----------
23968 5da0
The Oracle C function kcbgtcr does consistent read. It receives the pointer to the structure containing a 64-bit variable that stores the object_id in the 32 most significant bits and the block address in the least 32 significant bits. The following breakpont commands extract the object_id, file_id and block_id and prints them in both decimal and hexadecimal notation (tested on Oracle 19.13 and 19.14):
b kcbgtcr
commands 1
set $value = *(long *)($rdi+0x4)
set $object_id = $value>>32
printf "object_id: 0x%x %u\n", $object_id, $object_id
set $file_id = ( $value & 0x00000000ffffffff ) >> 22
printf "file_id: 0x%x %u\n", $file_id, $file_id
set $block_id = $value & 0b1111111111111111111111
printf "block_id: 0x%x %u\n", $block_id, $block_id
end
You’ll hit the breakpoint for every fetched row:
Breakpoint 1, 0x00000000129d3250 in kcbgtcr ()
object_id: 0x5da0 23968
file_id: 0xf 15
block_id: 0x82 130
If the affected table contains many rows, and you would like to hit the breakpoint only on missing block read attempt, you can setup the following condition:
cond 2 ( ( *(long *)($rdi+0x4) & 0x00000000ffffffff ) >> 22 ) == 16 && ( *(long *)($rdi+0x4) & 0b1111111111111111111111 ) == 128
object_id: 0x5da0 23968
file_id: 0x10 16
block_id: 0x80 128
Conclusion
You have to be extremely careful when plugging a PDB, especially if you don’t have a reliable manifest. When a datafile is missing, the operation “succeeds”, i.e. it doesn’t raise any errors. However, there will be a data loss, which might remain unnoticed for a while. The sessions will get a cryptic ORA-00600 when trying to access missing data.
Therefore, you might want to test plugging first, especially if the PDB was unplugged by someone else. There are two checks you can do on your own. One is comparing the lists of the deleted and added file_ids in the alert log. The other is querying the missing files from undocumented data dictionary tables.
Updates
May 27, 2022
After 3+ months of back and forth, Oracle support concluded it’s expected behavior after a user error. Handle with care 🙂
July 5, 2022
Oracle support filed the following bug:
Bug 34318403 – ORA-0600 ERROR DURING PDB PLUG-IN WHEN DATAFILE IS REMOVED
Did you log a bug? If not I am settle that for you, just let me know.
If you can open a bug without an SR, that would be great! Thanks!
So there’s no SR with My Oracle Support? That’s step 1 for debugging any ORA-600. And your investigation is great, but we need a case to tie things together so the right remediation, if any, can be applied and documented.
The SR: SR 3-28656550031 : Plug succeeds even without all datafiles
Thank You for detailed investigation.
Regards
GG
You’re welcome
Hi Nenad,
would you please explain this in details.
(gdb) p/x 2667591712
$1 = 0x9f003820
Block address consists of file_id and block_id.
block_id is stored in the 22 least significant bits:
p/u 0x9f003820 & 0b1111111111111111111111
$15983 = 14368
The file_id is stored in the 10 most significant bits:
p/u 0x9f003820 >> 22
$15982 = 636
Hi Sparsh
1. p/x 2667591712
p/x is the gdb command that prints a number in hexadecimal notation. I had to convert it to compare the ORA-00600 output in decimal notation to block addresses in the buffer trace, which were in hexadecimal notation.
2. p/u 0x9f003820 & 0b1111111111111111111111
0b1111111111111111111111 in binary notation is 1111111111111111111111 (22 bits set):
p/t 0b1111111111111111111111
1111111111111111111111
This means that the bitwise AND operation (&) leaves the least significant bits unchanged and set the higher bits to 0. Since block_id is stored in the least significant 22 bits of the block address, block_id can be extracted by bitwise AND with 0b1111111111111111111111.
3. p/u 0x9f003820 >> 22
0x9f003820 is a 32 bit number. The command above above shifts the number to the right by 22 bits to get the upper 10 bits where the file_id is stored. p/u is to print the value as unsigned integer.
Best regards,
Nenad
great explanation.
thanks a ton
Awesome explanation. Good catch. Good that adding simple validation /verification based on alert log is good relief.
Thanks for sharing