Estimating DB Time With ASH: Logon Storm Pitfall

This blog post is inspired by John Beresniewicz’s presentation DB Time, Average Active Sessions, and ASH Math.

In a nutshell, DB time can be estimated with the sum(usecs_per_row) over an observed time period, where usecs_per_row is the distance between the adjacent samples in active session history (ASH). Average active sessions (AAS) can then be calculated as DB Time / elapsed time.

John also wrote two queries for estimating DB time and AAS based on ASH samples and comparing the result with more precise time model statistics: ASHmetricsAAS.sql and ASH_TImeModel_AAS.sql, respectively.

Generally, I’ve got very good results while experimenting with John’s queries. For example, let’s consider a workload where multiple concurrent sessions are executing a SQL in a PL/SQL loop:

for i in {1..30}
do
sqlplus /nolog << EOD &
connect / as sysdba
declare 
  l integer ;
begin
  for i in 1..1e9
  loop
    select count(*) into l from dba_objects,dba_objects ;
  end loop;
end ;
/
EOD
done

We can see that the estimate is perfect:

@ASHmetricsAAS.sql

END_TIME               MDBTIME  ASHDBTIME   ASHCOUNT
------------------- ---------- ---------- ----------
2020-03-20 11:29:49 29.8652114         30       1802
2020-03-20 11:30:48 30.3782081         30       1774
2020-03-20 11:31:48 29.9119818         30       1802

@ASH_TImeModel_AAS.sql

END_TIME            DBTIME_PCTDIFF    ASH_AAS TIMEMODEL_AAS
------------------- -------------- ---------- -------------
2020-03-20 11:29:49              0       29.9          29.9
2020-03-20 11:30:48            1.6       29.9          30.4
2020-03-20 11:31:48              0       29.9          29.9

(John’s scripts also estimate the CPU time, but I removed this part from the output.)

However, I’ve also noticed some discrepancies, whose root cause could be tracked down to a large number of logon operations.

I’ll demonstrate this by producing a logon storm:

for j in {1..10000}
do
  for i in {1..30}
  do
sqlplus /nolog << EOD &
connect / as sysdba
EOD
  done
done

The ASH based estimate is completely unreliable – only around 1% of the total DB time is recorded in ASH:

@ASHmetricsAAS.sql

END_TIME               MDBTIME  ASHDBTIME   ASHCOUNT
------------------- ---------- ---------- ----------
2020-03-20 11:38:54 95.5325744 .916666667         57
2020-03-20 11:39:53 70.7569325 .440677966         31
2020-03-20 11:40:54 57.5036575 .803278689         54

@ASH_TImeModel_AAS.sql

END_TIME            DBTIME_PCTDIFF    ASH_AAS TIMEMODEL_AAS
------------------- -------------- ---------- -------------
2020-03-20 11:38:54           99.1         .9          95.5
2020-03-20 11:39:53           99.4         .4          70.8
2020-03-20 11:40:54           98.6         .8          57.5

Where does this discrepancy come from?

Most of the time is consumed while the session hasn’t been created yet. This time will be added to the total DB time, but as the entry in v$session exists only for a tiny fraction of the total logon time, there will be only very few ASH samples captured.

I found a way to measure the exact fraction of the time during which the session exists and is active.

First, I’m stopping the dedicated server process immediately after forking. I’m using the stop_forked.d script running in the background. I described this script in my blog post Debugging Session Creation.

/* stop_forked.d */

#pragma D option quiet

proc:::exec
/ execname == "sqlplus" /
{
  self->parent = pid ;
}

proc:::exec-success
/ self->parent &&
  strtok(strtok(curpsinfo->pr_psargs," "),"oracle") == $1 /
{
  printf("DB: %s, pid: %d \n",
    strtok(strtok(curpsinfo->pr_psargs," "),"oracle"),
    pid
  );
  stop();
}

The script writes out the ID of the just forked dedicated server process.

sudo -u root /usr/sbin/dtrace -w -s stop_forked.d '"DB1"'
DB: DB1, pid: 2083

The dedicated server process will resume after starting the script measure_login_time.d.

/* measure_login_time.d */

#pragma D option quiet

pid$target::main:entry
{
  self->started = timestamp;
}

pid$target::kzsini1:entry
{
  self->session_started = timestamp;
}

pid$target::kpoauth:return
{
  this->completed = timestamp;
  printf("Total time [ns]: %d\n", this->completed -  self->started);
  printf("Session time [ns]: %d\n", this->completed -  self->session_started);
}
sudo -u root /usr/sbin/dtrace -s measure_login_time.d -p 2083

The script above measures two elapsed times:

  1. Total logon time, which is the elapsed time between entering main and kzsini1. kzsini1 is the Oracle C function which initializes the session and creates the entry in v$session.
  2. Session time is the time during which the session exists and is marked as active. It is measured as the elapsed time between entering main and exiting kpoauth. kpoauth is the Oracle C function that handles authentication.

A sample output is below:

Total time [ns]: 337947071
Session time [ns]: 4764305

As you can see, the total time of the session is 1.4%, which matches the output of the ASH_TImeModel_AAS.sql script and explains the difference between the time model and ASH.

In conclusion, a large positive difference between the actual DB time and its ASH estimate might indicate that a substantial amount of processing time is spent on logons, due to, for example, ill-configured connection pools, or even lack thereof.

This discrepancy isn’t an anomaly – it’s how ASH is designed to work, as it records just the active sessions. Therefore, it’s good to be aware of the activities that are contributing to the total DB time, but don’t result in ASH samples.

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.