Data Loss After Plugging a PDB

Data loss

The application started getting ORA-00600 errors on a recently plugged PDB database. As it turned out, it wasn’t an ordinary ORA-00600. This test database couldn’t be rescued and had to be refreshed from the production.

This problem was insidious – the plug operation itself didn’t return any errors. ORA-00600 started appearing only after the application was running for a while and wanted to query missing data.

In this article, I’ll describe a scenario of irreversible data loss after plugging a PDB. What’s worse, the data loss can remain unnoticed for a while. Once the application starts failing, it isn’t obvious what’s happening. The ORA-00600 messages are cryptic and don’t tell anything about the root cause. Keep reading if you’re tasked with moving PDBs around – this information can save you a lot of trouble.

ORA-00600 [25027]

We were seeing the following error:

ORA-00600: internal error code, arguments: [25027], [3], [3], [6],[2667591712], [0], [1], [1], [], [], [], []

In particular, the error was appearing when selecting data only from some application tables. The problem was reproducible at will.

The error was raised on a consistent read (kcbgtcr function is on the stack):

...
krtd2abh_pdb_cursor  call     dbgeEndDDEInvocatio  7F9320B59718 7F931BB45FA8
()+1780                       nImpl()              7F9320B999C0 7FFD77A76568 ?
                                                   7F9320B999C0 000000082 ?
kcbgtcr()+11012      call     krtd2abh_pdb_cursor  7F9320B59718 ? 7F931BB45FA8 ?
                              ()                   09F003820 ? 7FFD77A76568 ?
                                                   7F9320B999C0 000000006
ktrget2()+877        call     kcbgtcr()            7F931B405DE8 000000000
                                                   0000005C4 000000000
                                                   7F9320B999C0 ? 000000006 ?
kdirfrs()+1705       call     ktrget2()            7F931B405DD8 ? 7F931B40A578 ?
                                                   0000005C4 000000000
                                                   7F9320B999C0 ? 000000006 ?
qerixFetchFastFullS  call     kdirfrs()            7F931B405DD8 ? 7F931B40A578 ?
can()+815                                          0000005C4 ? 000000000 ?
                                                   7F9320B999C0 ? 000000006 ?
qerghFetch()+603     call     qerixFetchFastFullS  7F931B405D01 ? 000000000 ?
                              can()                012D64FE0 7FFD77A815E0
                                                   000000000 ? 000000006 ?
qerwnFetch()+16689   call     qerghFetch()         0A31DA108 7F931B404D98
                                                   003FACBA0 7FFD77A81800
                                                   000007FFF 0A31DA0C8
qervwFetch()+166     call     qerwnFetch()         000000000 7F931B44AEE8
                                                   004060930 7FFD77A819C0
                                                   000007FFF 0A31DA0C8 ?
qerwnFetch()+16689   call     qervwFetch()         0A31D8A00 7F931B44B7E8
                                                   003FACBA0 7FFD77A81AC0
                                                   000007FFF 0A31DA0C8 ?
opifch2()+3211       call     qerwnFetch()         000000000 7F931B456E08
                                                   012ADE300 7FFD77A82080
                                                   000000001 0A31DA0C8 ?
kpoal8()+4185        call     opifch2()            000000000 ? 7F931B456E08 ?
                                                   7FFD77A82440 7FFD77A82080 ?
                                                   000000001 ? 0A31DA0C8 ?
...

There are lots of MOS notes with ORA-00600 [25027], mainly for older Oracle releases. None of the published ORA-00600 [25027] bugs seemed relevant for our case. 25027 is probably a value that developers hardcoded to roughly identify the code segment which introduced the error. So I had to focus on other arguments to identify the root cause. The second and the third arguments, [3] and [3], are container id, and [6] is the tablespace id. That’s not enough to narrow down the root cause.

Therefore, I focused on deciphering 2667591712, the fifth argument, which was different for each failed query.

Consistent read buffer status trace

In the course of troubleshooting I discovered the undocumented “consistent read buffer status” trace event, which provided the key information for interpreting the fifth argument:

ALTER SESSION SET EVENTS='10200 trace name context forever, level 2' ;
oerr ora 10200
10200, 00000, "consistent read buffer status"
// *Cause:
// *Action:
env [0x7f05457e6f60]: (scn: 0x00000893d2c725cf   xid: 0x0000.000.00000000  uba: 0x00000000.0000.00  statement num=0  parent xid:  0x0000.000.00000000  st-scn: 0x0000000000000000  hi-scn: 0x0000000000000000  ma-scn: 0x00000893d2c725c2  flg: 0x00000660)
ktrexf(): returning 9 on:  0x7f054ae96588  cr-scn: 0xffffffffffffffff   xid: 0x0000.000.00000000  uba: 0x00000000.0000.00  cl-scn: 0xffffffffffffffff  sfl: 0
ktrgcm(): completed for block  <0x0006 : 0x01823ed5> objd: 0x00045bee
ktrget2(): completed for  block <0x0006 : 0x01823ed5> objd: 0x00045bee
ktrget2(): started for block  <0x0006 : 0x01823ed5> objd: 0x00045bee
env [0x7f05457e6f60]: (scn: 0x00000893d2c725cf   xid: 0x0000.000.00000000  uba: 0x00000000.0000.00  statement num=0  parent xid:  0x0000.000.00000000  st-scn: 0x0000000000000000  hi-scn: 0x0000000000000000  ma-scn: 0x00000893d2c725c2  flg: 0x00000660)
ktrexf(): returning 9 on:  0x7f054ae96588  cr-scn: 0xffffffffffffffff   xid: 0x0000.000.00000000  uba: 0x00000000.0000.00  cl-scn: 0xffffffffffffffff  sfl: 0
ktrgcm(): completed for block  <0x0006 : 0x01823ed5> objd: 0x00045bee
ktrget2(): completed for  block <0x0006 : 0x01823ed5> objd: 0x00045bee
ktrget2(): started for block  <0x0006 : 0x9f003820> objd: 0x00045bee
env [0x7f05457e6f60]: (scn: 0x00000893d2c725cf   xid: 0x0000.000.00000000  uba: 0x00000000.0000.00  statement num=0  parent xid:  0x0000.000.00000000  st-scn: 0x0000000000000000  hi-scn: 0x0000000000000000  ma-scn: 0x00000893d2c725c2  flg: 0x00000660)
ORA-00600: internal error code, arguments: [25027], [3], [3], [6], [2667591712], [0], [1], [1], [], [], [], []

The last accessed block prior to ORA-00600 was the block with the address 0x9f003820. This block address indeed matches the fifth argument in the ORA-00600 error message:

(gdb) p/x 2667591712
$1 = 0x9f003820

Block address consists of file_id and block_id.

block_id is stored in the 22 least significant bits:

p/u 0x9f003820 & 0b1111111111111111111111
$15983 = 14368

The file_id is stored in the 10 most significant bits:

p/u 0x9f003820 >> 22
$15982 = 636

So the block_id 14368 in the file_id 636 is causing the problem.

Missing file

But the file 636 doesn’t exist:

select 1 from sys.file$ where relfile# = 636 ;

no rows selected

However, x$ktfbue, the extent table, has references to this non-existing file:

select count(*) from x$ktfbue 
  where KTFBUESEGTSN = 6 and KTFBUEFNO = 636 ;


  COUNT(*)
----------
        13

There’s also an entry for the requested block in the missing file:

SQL> select count(*) from x$ktfbue 
  where KTFBUESEGTSN = 6 and KTFBUEFNO = 636 
    and 14368 between KTFBUEBNO and KTFBUEBNO + KTFBUEBLKS -1 ;

  COUNT(*)
----------
         1

The following query lists all the missing files:

select KTFBUESEGTSN ts#, ktfbuefno file# from sys.x$ktfbue
minus
select ts#,relfile# from sys.file$ ; 

       TS#      FILE#
---------- ----------
         6        636

The alert log provides the clue about the missing file with the id 636:

Create pluggable database PDB1 using '/u00/oracle/orabase/admin/PDB1C/log/PDB1_2022-02-09_15_17_28_recover_final.xml' NOCOPY TEMPFILE REUSE
...
PDB1(4):Deleting old file#523 from file$
PDB1(4):Deleting old file#636 from file$
...
PDB1(4):Adding new file#673 to file$(old file#523).             fopr-0, newblks-720896, oldblks-32768
PDB1(4):Successfully created internal service PDB1 at open

According to alert log, the plugin operation removed datafiles and added them again under a new file_id. But there wasn’t any “636 added” entry. We checked the directory with datafailes and indeed saw that the file wasn’t there. Now we understood what happened. We didn’t have a manifest when plugging the database. We created it based on the directory with datafiles instead. That manifest was incomplete because the file was missing. Unfortunately, Oracle didn’t raise any error on plugin, so the problem remained unnoticed until the application started to retrieve the missing data.

Demo

I created the tablespace DATA with two small datafiles and the table T, which spread across both datafiles:

create tablespace data datafile size 5M autoextend off ;
alter tablespace data add datafile size 5M autoextend off ;

create user u no authentication default tablespace data ;
alter user u quota unlimited on data ;
grant create table to u ;
create table u.t (c varchar2(4000));

insert into u.t select lpad('A',4000,'A') from dual connect by level <= 1000 ;
commit;

select relative_fno,file_id,count(*) from dba_extents where segment_name='T' 
  group by relative_fno,file_id ;

RELATIVE_FNO    FILE_ID   COUNT(*)
------------ ---------- ----------
          15         15         19
          16         16          4

select file_id, relative_fno, file_name from dba_data_files 
  where tablespace_name = 'DATA' ;

   FILE_ID RELATIVE_FNO
---------- ------------
FILE_NAME
--------------------------------------------------------------------------------
        15           15
/u00/oracle/oradata01/DBER20C/PDB12/DBER20C/D822EC6D95DC3533E0531FE1900A73AD/da
tafile/o1_mf_data_k0sxfg1w_.dbf

        16           16
/u00/oracle/oradata01/DBER20C/PDB12/DBER20C/D822EC6D95DC3533E0531FE1900A73AD/da
tafile/o1_mf_data_k0sxfg9f_.dbf

select ts# from file$ where file# in (15,16);

       TS#
----------
         8
         8

Then, I unplugged the PDB, deleted a file and plugged the database again:

connect / as sysdba
alter pluggable database PDB12 close ;

ALTER PLUGGABLE DATABASE PDB12 UNPLUG INTO '/tmp/PDB12.xml' ;
DROP PLUGGABLE DATABASE PDB12 KEEP DATAFILES; 

!rm /u00/oracle/oradata01/DBER20C/PDB12/DBER20C/D822EC6D95DC3533E0531FE1900A73AD/datafile/o1_mf_data_k0sxfg9f_.dbf

!rm /tmp/PDB12.xml

begin
DBMS_PDB.RECOVER (
   pdb_descr_file => '/tmp/PDB12.xml', 
   pdb_name       => 'PDB12',
   filenames      => '/u00/oracle/oradata01/DBER20C/PDB12/DBER20C/D822EC6D95DC3533E0531FE1900A73AD/datafile/'
);
end ;
/

Neither “create pluggable database” nor “database open” raised any errors:

create pluggable database PDB12 using '/tmp/PDB12.xml'
NOCOPY
TEMPFILE REUSE;

alter pluggable database PDB12 open ;

alter pluggable database PDB12 close ;
alter pluggable database PDB12 open ;

alter session set container=PDB12

Expectedly, the deleted file 16 isn’t there:

select file_id, relative_fno from dba_data_files
where tablespace_name = 'DATA' ;

FILE_ID RELATIVE_FNO
---------- ------------
21           15

Alert log

The alert log confirms that the file wasn’t added:

PDB12(4):Buffer Cache flush finished: 4
Completed: create pluggable database PDB12 using '/tmp/PDB12.xml'
NOCOPY
TEMPFILE REUSE
2022-02-16T14:09:56.495509+01:00
alter pluggable database PDB12 open
2022-02-16T14:09:56.498522+01:00
PDB12(4):Pluggable database PDB12 opening in read write
PDB12(4):SUPLOG: Initialize PDB SUPLOG SGA, old value 0x0, new value 0x18
PDB12(4):Autotune of undo retention is turned on.
PDB12(4):Endian type of dictionary set to little
PDB12(4):Undo initialization recovery: Parallel FPTR complete: start:2258534527 end:2258534529 diff:2 ms (0.0 seconds)
PDB12(4):Undo initialization recovery: err:0 start: 2258534527 end: 2258534529 diff: 2 ms (0.0 seconds)
PDB12(4):[1556525] Successfully onlined Undo Tablespace 2.
PDB12(4):Undo initialization online undo segments: err:0 start: 2258534529 end: 2258534562 diff: 33 ms (0.0 seconds)
PDB12(4):Undo initialization finished serial:0 start:2258534527 end:2258534563 diff:36 ms (0.0 seconds)
PDB12(4):Deleting old file#10 from file$
PDB12(4):Deleting old file#11 from file$
PDB12(4):Deleting old file#12 from file$
PDB12(4):Deleting old file#13 from file$
PDB12(4):Deleting old file#14 from file$
PDB12(4):Deleting old file#15 from file$
PDB12(4):Deleting old file#16 from file$
PDB12(4):Adding new file#20 to file$(old file#10).             fopr-0, newblks-60800, oldblks-16000
PDB12(4):Adding new file#18 to file$(old file#11).             fopr-0, newblks-25600, oldblks-12800
PDB12(4):Adding new file#19 to file$(old file#12).             fopr-0, newblks-26880, oldblks-1280
PDB12(4):Adding new file#17 to file$(old file#13).             fopr-0, newblks-12800, oldblks-12800
PDB12(4):Adding new file#22 to file$(old file#14).             fopr-0, newblks-12800, oldblks-12800
PDB12(4):Adding new file#21 to file$(old file#15).             fopr-0, newblks-640, oldblks-640
PDB12(4):Marking tablespace #7 invalid since it is not present               in the describe file
PDB12(4):Successfully created internal service PDB12 at open
****************************************************************

Alert log Lists all deleted and added files. You can compare both lists to identify missing files.

Extracting file_id and block_id

We reproduced the error:

SQL> select * from u.t ;
select * from u.t
                *
ERROR at line 1:
ORA-00600: internal error code, arguments: [25027], [4], [4], [8], [67108992],
[0], [1], [1], [], [], [], []

ORA-00600 shows the block address. We already learned how to extract the file_id and block_id:

p/u 67108992 >> 22
16

p/u 67108992 & 0b1111111111111111111111
128

Our block is registered in x$ktfbue:

select count(*) from x$ktfbue 
  where KTFBUESEGTSN = 8 and KTFBUEFNO = 16
    and 128 between KTFBUEBNO and KTFBUEBNO + KTFBUEBLKS -1 ;

  COUNT(*)
----------
         1

Diagnostic query

The following query lists the missing datafiles:

select KTFBUESEGTSN ts#, ktfbuefno file# from sys.x$ktfbue
minus
select ts#,relfile# from sys.file$ ; 

       TS#      FILE#
---------- ----------
         8         16

And the following query lists the affected segments:

SELECT distinct ds.owner, ds.segment_name
  FROM sys.sys_dba_segs ds, sys.x$ktfbue e
  WHERE  e.ktfbuesegfno = ds.relative_fno
    AND e.ktfbuesegbno = ds.header_block
    AND e.ktfbuesegtsn = ds.tablespace_id
    AND BITAND (NVL (ds.segment_flags, 0), 1) = 1
    AND BITAND (NVL (ds.segment_flags, 0), 65536) = 0
    and e.ktfbuefno not in (select relfile# from sys.file$) ;

OWNER
--------------------------------------------------------------------------------
SEGMENT_NAME
--------------------------------------------------------------------------------
U
T

Internals

Last but not least, I’ll show how to track consistent reads with debugger.

Let’s get the the object_id first:

select object_id, to_char(object_id,'xxxxxxxxx') hex from dba_objects  where object_name = 'T'
  2  ;

 OBJECT_ID HEX
---------- ----------
     23968       5da0

The Oracle C function kcbgtcr does consistent read. It receives the pointer to the structure containing a 64-bit variable that stores the object_id in the 32 most significant bits and the block address in the least 32 significant bits. The following breakpont commands extract the object_id, file_id and block_id and prints them in both decimal and hexadecimal notation (tested on Oracle 19.13 and 19.14):

b kcbgtcr
commands 1
set $value = *(long *)($rdi+0x4)
set $object_id = $value>>32
printf "object_id: 0x%x %u\n", $object_id, $object_id
set $file_id = ( $value & 0x00000000ffffffff ) >> 22
printf "file_id: 0x%x %u\n", $file_id, $file_id
set $block_id = $value & 0b1111111111111111111111
printf "block_id: 0x%x %u\n", $block_id, $block_id
end

You’ll hit the breakpoint for every fetched row:

Breakpoint 1, 0x00000000129d3250 in kcbgtcr ()
object_id: 0x5da0 23968
file_id: 0xf 15
block_id: 0x82 130

If the affected table contains many rows, and you would like to hit the breakpoint only on missing block read attempt, you can setup the following condition:

cond 2  ( ( *(long *)($rdi+0x4) & 0x00000000ffffffff ) >> 22 ) == 16 && ( *(long *)($rdi+0x4) & 0b1111111111111111111111 ) == 128
object_id: 0x5da0 23968
file_id: 0x10 16
block_id: 0x80 128

Conclusion

You have to be extremely careful when plugging a PDB, especially if you don’t have a reliable manifest. When a datafile is missing, the operation “succeeds”, i.e. it doesn’t raise any errors. However, there will be a data loss, which might remain unnoticed for a while. The sessions will get a cryptic ORA-00600 when trying to access missing data.

Therefore, you might want to test plugging first, especially if the PDB was unplugged by someone else. There are two checks you can do on your own. One is comparing the lists of the deleted and added file_ids in the alert log. The other is querying the missing files from undocumented data dictionary tables.

Updates

May 27, 2022

After 3+ months of back and forth, Oracle support concluded it’s expected behavior after a user error. Handle with care 🙂

July 5, 2022

Oracle support filed the following bug:
Bug 34318403 – ORA-0600 ERROR DURING PDB PLUG-IN WHEN DATAFILE IS REMOVED

Thanks for sharing

Nenad Noveljic

11 Comments

      • So there’s no SR with My Oracle Support? That’s step 1 for debugging any ORA-600. And your investigation is great, but we need a case to tie things together so the right remediation, if any, can be applied and documented.

  1. Hi Nenad,

    would you please explain this in details.

    (gdb) p/x 2667591712
    $1 = 0x9f003820

    Block address consists of file_id and block_id.

    block_id is stored in the 22 least significant bits:

    p/u 0x9f003820 & 0b1111111111111111111111
    $15983 = 14368

    The file_id is stored in the 10 most significant bits:

    p/u 0x9f003820 >> 22
    $15982 = 636

    • Hi Sparsh

      1. p/x 2667591712

      p/x is the gdb command that prints a number in hexadecimal notation. I had to convert it to compare the ORA-00600 output in decimal notation to block addresses in the buffer trace, which were in hexadecimal notation.

      2. p/u 0x9f003820 & 0b1111111111111111111111

      0b1111111111111111111111 in binary notation is 1111111111111111111111 (22 bits set):

      p/t 0b1111111111111111111111
      1111111111111111111111

      This means that the bitwise AND operation (&) leaves the least significant bits unchanged and set the higher bits to 0. Since block_id is stored in the least significant 22 bits of the block address, block_id can be extracted by bitwise AND with 0b1111111111111111111111.

      3. p/u 0x9f003820 >> 22

      0x9f003820 is a 32 bit number. The command above above shifts the number to the right by 22 bits to get the upper 10 bits where the file_id is stored. p/u is to print the value as unsigned integer.

      Best regards,
      Nenad

  2. Awesome explanation. Good catch. Good that adding simple validation /verification based on alert log is good relief.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.