PL/SQL Exception Handling Performance Degradation in Oracle 12.2

The aim of this blog post is to warn about a potential performance issue in Oracle 12.2 related to PL/SQL exceptions.

The test case is rather simple. First, I’ll create an empty table.

create table t (a number) ;

Secondly, I’ll run the following anonymous block which executes a simple query on an empty table in a loop.

declare
  l_c number ;
  i number ;
begin
  for i in 1..10000
  loop
    begin
      select a into l_c from t ;
    exception when no_data_found then
      null ;
    end ;
  end loop ;
end ;  
/

The key point here is that the no_data_found exception handler will run with each loop iteration. Of course, null is a lousy exception handler, but I’m using it here just for demonstration purposes.

In my case, the PL/SQL block completed instantly on a 12.1 database, whereas on 12.2 the execution took 12.49 seconds. That’s way too long.

Interestingly, the executed Oracle code is not instrumented. What I mean by that is, the CPU and the total wait time don’t account for the elapsed time.


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      2.25      12.49          0          0          0           1
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2      2.25      12.49          0          0          0           1

...

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  Disk file operations I/O                        1        0.00          0.00
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.00          0.00

Therefore, we need to use some OS diagnostics for further analysis. First of all, I executed pstack on the long running process on an Oracle 12.2 instance to get a rough idea which Oracle functions were being executed. This is what I got:

 ffff80ffbcf9298a __read () + a
 000000000a1f32b3 slmsrd () + 83
 0000000017ec5f6b lmsaprb () + 5b
 0000000009cc7c8e lmsagb1 () + 11e
 0000000009cc7b1e lmsagbf () + 4e
 0000000009cc7240 kgebse () + 180
 0000000009cc704c kgebem () + 4c
 0000000009cc6f19 ksemgm () + 69
 0000000009714f85 opiodr () + 455
...

According to Frits Hoogland’s sf.sh , kge functions are some kernel generic error functions. In fact, it is easy to verify that those functions are called when the PL/SQL exception handling code is being executed. In addition, we can inspect the arguments of the functions above with the following DTrace script:

pid$target::kge*:entry,
pid$target::lms*:entry
{
  printf("%x %x %x %x %x %x %x %x \n",arg0,arg1,arg2,arg3,arg4,arg5,arg6,arg7);
}

...
lmsapsc ffff80ffbd5b02c8 57b ffff80ffbfff62db 1f5 2 203a 0 3aad71e4 
lmsapbn ffff80ffbd5b02c8 2 57b ffff80ffbfff62db 1f5 0 0 0 
...

As you can see, 0x57b is being passed as an argument to some of the functions. 0x57b is 1403 in decimal notation which, in turn, is nothing else but the error code for the “no data found” condition.

As for the next step, I’m going to compare the number of function calls in both Oracle versions.

12.2:

  kgecss                                                            2
  kge_is_oserr                                                  20000
  kge_is_resig_mandatory                                        20000
  kgeade                                                        20000
  kgebem                                                        20000
  kgebse                                                        20000
  kgefrp                                                        20000
  kgekeep                                                       20000
  kgepop                                                        20000
  kgeselv                                                       20000
  kgespf                                                        20000
  kgespo                                                        20000
  kgetop                                                        20000
  lmsagb1                                                       20000
  lmsagbf                                                       20000
  lmsapbn                                                       20000
  lmsaprb                                                       20000
  lmsapsb                                                       20000
  lmsapsc                                                       20000
  lmsapts                                                       20000
  slmsrd                                                        20000
  kge_is_resig_mandatory_errframe                               40000
  kgeresl                                                       40000
  kgegec                                                        80000

12.1:

  kgecss                                                            1
  kge_is_oserr                                                  10000
  kge_is_resig_mandatory                                        10000
  kgeade                                                        10000
  kgebem                                                        10000
  kgebse                                                        10000
  kgefrp                                                        10000
  kgegbc                                                        10000
  kgekeep                                                       10000
  kgepop                                                        10000
  kgeselv                                                       10000
  kgespf                                                        10000
  kgespo                                                        10000
  lmsagb1                                                       10000
  lmsagbf                                                       10000
  lmsapsb                                                       10000
  lmsapsc                                                       10000
  kge_is_resig_mandatory_errframe                               20000
  kgeresl                                                       20000
  kgetop                                                        20000
  kgegec                                                        50000

Although the most obvious difference is twice as much calls for almost every function in 12.2 compared to 12.1, this is not the main reason for the severe performance decrease. As a matter of fact, an attentive reader might notice that the slmsrd calls are entirely missing in the 12.1 output.

Again, I’ll use Frits’ sf.sh to figure out the purpose of the Oracle kernel function slmsrd:

./sf.sh slmsrd
slmsrd -- system specific call

slms stands for “system specific call”. And if “rd” stands for read, the question is, what is being read during exception handling? Knowing that every read operation results in a system call, we can truss the Oracle dedicated process to answer this question:

truss -f -tread -p 5522
...
5222:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
5522:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
5522:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
5522:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
5522:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
5522:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
5522:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
5522:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
5522:   read(9, "\f\0 x05\0\0 P\0 y05\0\0".., 512)      = 512
...

As you can see, not only are there thousands of reads on the file behind the descriptor 9, but also the same piece of information is read 20000 times for 10000 raised exceptions.

Next, with pfiles we can find out which file is being read:

pfiles 5522
5522:   oracleAVAR19 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
...
   9: S_IFREG mode:0755 dev:286,65600 ino:4840002 uid:1000 gid:1000 size:1344512
      O_RDONLY|O_LARGEFILE FD_CLOEXEC
      .../rdbms/mesg/oraus.msb
...

Finally, the culprit is revealed. It’s oraus.msb that contains all of the error messages. Not only is the error message text completely unnecessary in this case, but also Oracle database doesn’t bother with caching it.

Just imagine a mess which might arise on a highly consolidated platform with applications frequently executing exception handling code and by doing so, heavily reading from oraus.msb at the same time. In addition, a session trace wouldn’t provide any clue that the the application is waiting on a completely irrelevant piece of information to be read from the filesystem.

Workarounds

Unfortunately, I couldn’t find any documents on Metalink about the problem. However, I’d like to suggest some possible mitigation countermeasures. Before doing that, it’s worth noting that the Oracle home in this case was placed on a separate ZFS file system which was not configured for caching data, i.e. primarycache was set to metadata (only). The reason for doing this is to avoid cache thrashing by loading Oracle homes into the cache during OS backups. Frankly, it’s the first time I’ve ever seen that the query elapsed time is depending on the Oracle home file access latency. Anyway, after setting primarycache to all, the code started to run fast in 12.2 as well.

In case you don’t have a possibility of either caching Oracle home or putting it on an SSD drive, there are also couple of things you could do on the application side to reduce the number of message file reads.

First of all, check if you can create conditions in the environment which would lead to less exceptions in total. For instance, if you insert a row into the table above, no_data_found won’t fire any more and the code will execute fast.

Secondly, you could increase the exception handler’s scope if feasible. For example, you can take it out of the loop:

declare
  l_c number ;
  i number ;
begin
  for i in 1..10000
  loop
    select a into l_c from t ;
  end loop ;
exception when no_data_found then
  null ;
end ;  
/

Lastly, you could replace the entire exception handler with an explicit cursor like this:

declare
  l_c number ;
  i number ;
  cursor c is select a from t ;
begin
  open c ;
  for i in 1..10000
  loop
    fetch c into l_c ;
    if c%notfound then 
	  null;
	end if ;
  end loop ;
end ;  
/

Summary

In conclusion, a performance issue has been introduced in Oracle 12.2. The message file is repeatedly being read by dedicated processes, even in cases when no error has to be thrown. In particular, applications which frequently raise PL/SQL exceptions could be affected. To make the problem worse, the trace file doesn’t provide any clues to the root-cause. Currently, there doesn’t seem to be any information on Metalink about it. Nevertheless, I highlighted different possibilities of mitigating the issue.

Update February 17, 2022

The problem is fixed in recent Oracle releases. I tested it on 19.14 and 21.4 PDB. There’s only one read per execution:

sudo /bin/strace  -p 3924265 -c

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 83.81    0.000088           0       284           getrusage
 14.29    0.000015          15         1           read
  1.90    0.000002           2         1           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.000105           0       286           total
Thanks for sharing

Nenad Noveljic

4 Comments

  1. Hi Nenad, great post and investigation!
    Please mind it is important to show the line beneath the function found with sf.sh, which indicates what part of the function name corresponds to known code layers. When you look at slmsrd, you see that the first two letters of the slmsrd function mean it’s in the system specific code layer.
    $ ./sf.sh slmsrd
    slmsrd — system specific call

    So ‘sl’ means system specific call, I would say ‘ms’ would mean message or error message, and ‘rd’ obviously read.

    In fact, if you use the ‘wildcard mode’ on sf.sh with sl, you’ll see it’s a code layer that performs access to system calls:
    $ ./sf.sh -w sl
    sl system specific call
    slcpu get cpu usage
    slcpums get cpu usage (for getrusage)
    sltrgatime64 arbitrary time lookup
    sltrgftime64 time lookup main (for clock_gettime)
    sltrgftime64_cgt clock_gettime

    • Hi Frits,

      First of all, let me thank you for the kind words. I’m honored.

      Also, thank you for the additional clarification on sf.sh. In particular, I wasn’t event aware of the wild card mode.

      By the way, it’s such a helpful utility when analyzing Oracle function calls.

  2. About the comment on filesystem caching for the Oracle home(s): in the past I’ve experienced performance issues when it the Oracle home was placed on a high latency device (NFS in that case). There are multiple files (I’ve forgotten which exactly) which are very intensely read and quite probably Oracle assumes that they’ll be cached.

    • Actually, I’ve been completely unaware of this dependency. I mean I’ve never observed that performance depended on Oracle home file access latency, at least not to such extent. On the other hand, it might be that I haven’t hit any such issue so far, simply because I’ve never installed Oracle homes on a high latency file system.

      Anyway, I’ll keep your comment in mind, especially in cases where the total of CPU and wait time doesn’t account for the elapsed time.

      However, in this case, I’d rather attribute this coding anti-pattern to a bug than to a deliberate decision.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.