JOIN CARDINALITY MISESTIMATE in Oracle 12.2

In this blog post I’ll be describing a case with wrong join cardinality estimation caused by adaptive dynamic sampling (DS). As indicated in the title, the issue has been introduced in the release 12.2.

First of all, the test case is based on the assumption that optimizer_dynamic_sampling is set to 2:

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
optimizer_dynamic_sampling           integer     2

Also, before running the test I’m verifying that optimizer adaptive statistics have been enabled:

show parameter optimizer_adaptive_statistics

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
optimizer_adaptive_statistics        boolean     TRUE

Then, I’m creating three tables – t1, t2 and t3.

create table t1 ( n1 number ) ;

insert into t1
  select trunc(level/4)+1
    from dual connect by level <= 320000 ;
commit ;

exec dbms_stats.gather_table_stats( null, 'T1' ) ;

create table t2 ( n1 number , n2 number ) ;    
create table t3 ( n1 number , n2 number ) ;    

exec dbms_stats.gather_table_stats( null, 'T2' ) ;
exec dbms_stats.gather_table_stats( null, 'T3' ) ;

insert into t2
  select level, level
    from dual connect by level <=150000 ;

insert into t3
  select level, trunc(level/3) 
    from dual connect by level <=450000 ;

commit ;

It’s worth noting that I intentionally gathered statistics on the empty tables t2 and t3 before loading any data. By doing so, I’m misleading the optimizer to come up with wrong join cardinalities. As a consequence of this shameless manipulation, SQL plan directives (SPD) for DS will be created at the next execution, like for example:

--execute two times 
select distinct t3.n1
from t3 
  join t2 on t3.n2 = t2.n2 
where t2.n1 = 1  ;

exec dbms_spd.flush_sql_plan_directive ;

set long 1000
SELECT type,reason, state, d.notes
  FROM   dba_sql_plan_directives d, dba_sql_plan_dir_objects o
  WHERE  d.directive_id=o.directive_id
    AND  o.object_name in ( 'T2', 'T3' ) ;

TYPE                    REASON                               STATE
----------------------- ------------------------------------ ----------
NOTES
--------------------------------------------------------------------------------
DYNAMIC_SAMPLING        SINGLE TABLE CARDINALITY MISESTIMATE USABLE
<spd_note>
  <internal_state>NEW</internal_state>
  <redundant>NO</redundant>
  <spd_text>{(TEST.T3)}</spd_text>
</spd_note>

DYNAMIC_SAMPLING        JOIN CARDINALITY MISESTIMATE         USABLE
<spd_note>
  <internal_state>NEW</internal_state>
  <redundant>NO</redundant>
  <spd_text><{F(TEST.T2) - (TEST.T3)}</spd_text>
</spd_note>

DYNAMIC_SAMPLING        JOIN CARDINALITY MISESTIMATE         USABLE
<spd_note>
  <internal_state>NEW</internal_state>
  <redundant>NO</redundant>
  <spd_text>{F(TEST.T2) - (TEST.T3)}</spd_text>
</spd_note>

As you can see, SPDs for DS were indeed generated. As a matter of fact, the SPDs mentioned will kick in during the next execution inspite of the freshly updated statistics.

EXEC dbms_stats.gather_table_stats( null, 'T2' );
EXEC dbms_stats.gather_table_stats( null, 'T3' );

SELECT /*+ gather_plan_statistics */ t3.n1
  FROM t3 JOIN t2 ON t3.n2 = t2.n2
  WHERE t2.n1 IN (SELECT /*+ no_unnest */ n1 FROM t1 ) ;

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |    240K|00:06:03.69 |      54M|       |       |          |
|*  1 |  FILTER             |      |      1 |        |    240K|00:06:03.69 |      54M|       |       |          |
|*  2 |   HASH JOIN         |      |      1 |      1 |    449K|00:00:00.49 |    1317 |  9268K|  2474K| 8131K (0)|
|   3 |    TABLE ACCESS FULL| T2   |      1 |    150K|    150K|00:00:00.01 |     317 |       |       |          |
|   4 |    TABLE ACCESS FULL| T3   |      1 |    450K|    450K|00:00:00.02 |     997 |       |       |          |
|*  5 |   TABLE ACCESS FULL | T1   |    150K|      1 |  80271 |00:18:27.73 |      54M|       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter( IS NOT NULL)
   2 - access("T3"."N2"="T2"."N2")
   5 - filter("N1"=:B1)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - 1 Sql Plan Directive used for this statement

If you pay a closer attention to the estimated cardinality in the step 2 in the execution plan above, you’ll notice a huge discrepancy between the estimated and actual number of rows. What I mean by that is the optimizer estimated just one row while 449K were returned. In fact, a one-row estimation is often a cue for a completely wrong calculation.

In contrast, this error doesn’t happen at all when DS is not being used:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |    240K|00:06:47.81 |      54M|       |       |          |
|*  1 |  FILTER             |      |      1 |        |    240K|00:06:47.81 |      54M|       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    444K|    449K|00:00:00.50 |    1320 |  9268K|  2474K| 8123K (0)|
|   3 |    TABLE ACCESS FULL| T2   |      1 |    150K|    150K|00:00:00.01 |     317 |       |       |          |
|   4 |    TABLE ACCESS FULL| T3   |      1 |    450K|    450K|00:00:00.02 |     997 |       |       |          |
|*  5 |   TABLE ACCESS FULL | T1   |    150K|      1 |  80264 |00:18:13.30 |      54M|       |       |          |
-----------------------------------------------------------------------------------------------------------------

But what exactly went on here?

To answer this question I’m going to use information generated in two types of traces. One is the well-known CBO 10053 trace. The other is DS trace which I became aware of after having read Stefan Koehler’s blog post. So, this is how I activated both of the traces:

ALTER SESSION SET EVENTS 'trace[RDBMS.SQL_DS] disk=high';
ALTER SESSION SET EVENTS='10053 trace name context forever, level 1';

By looking at the trace file we can indeed confirm that DS produced a fairly accurate guess:

qksdsScaleResult(): Dumping scaled result (status = SUCCESS)
qksdsDumpResult(): DS Results: #exps=1, smp obj=T3
qksdsDumpResult():    T.CARD = qksdsDumpResult(): (mid=449810.0, low=449810.0, hig=449810.0)qksdsDumpResult(): 
qksdsDumpResult(): end dumping results 
    SPD: qosdGetFObjKeyFromStmt: sqlText = SELECT /* DS_SVC */ /*+ dynamic_sampling(0) no_sql_tune no_monitoring optimizer_features_enable(default) no_parallel  */ NVL(SUM(C1),0) FROM (SELECT /*+ qb_name("innerQuery")< NO_INDEX_FFS( "T3#1")  */ 1 AS C1 FROM "T3" "T3#1", "T2" "T2#0" WHERE ("T3#1"."N2"="T2#0"."N2")) innerQuery (objid = 1800385981567557396)

As a matter of fact, it even generated a finding for future reference:

    SPD: Generating finding id: type = 2, reason = 7, objcnt = 3, obItr = 0, objid = 1800385981567557396, objtyp = 4, vecsize = 0, obItr = 1, objid = 94576, objtyp = 1, vecsize = 0, obItr = 2, objid = 94577, objtyp = 1, vecsize = 0, fid = 4662502719207033285
    SPD: qosdCreateDir4DSResult retCode = CREATED, fid = 0

Furthermore, it can easily be verified that this finding is persisted as an SPD. Moreover, those newly created SPDs containing the result of DS will supersede the SPDs for DS itself, and in doing so completely eliminate DS in future optimizations:

SELECT type,reason, state, d.notes
  FROM   dba_sql_plan_directives d, dba_sql_plan_dir_objects o
  WHERE  d.directive_id=o.directive_id
  4      AND  o.object_name in ( 'T2', 'T3' ) ;

TYPE                    REASON                               STATE
----------------------- ------------------------------------ ----------
NOTES
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
DYNAMIC_SAMPLING        SINGLE TABLE CARDINALITY MISESTIMATE USABLE
<spd_note>
  <internal_state>NEW</internal_state>
  <redundant>NO</redundant>
  <spd_text>{(TEST.T3)}</spd_text>
</spd_note>

DYNAMIC_SAMPLING        JOIN CARDINALITY MISESTIMATE         SUPERSEDED
<spd_note>
  <internal_state>HAS_STATS</internal_state>
  <redundant>NO</redundant>
  <spd_text>{F(TEST.T2) - (TEST.T3)}</spd_text>
</spd_note>

DYNAMIC_SAMPLING        JOIN CARDINALITY MISESTIMATE         SUPERSEDED
<spd_note>
  <internal_state>HAS_STATS</internal_state>
  <redundant>NO</redundant>
  <spd_text>{F(TEST.T2) - (TEST.T3)}</spd_text>
</spd_note>

DYNAMIC_SAMPLING_RESULT VERIFY CARDINALITY ESTIMATE          USABLE
<spd_note>
  <internal_state>NEW</internal_state>
  <redundant>NO</redundant>
  <spd_text>{(TEST.T2, num_rows=150000) - (TEST.T3, num_rows=450000) - (SQL_ID:1jz21hbnyursn, 
T.CARD=450347[-2 -2])}</spd_text>
</spd_note>

DYNAMIC_SAMPLING_RESULT VERIFY CARDINALITY ESTIMATE          USABLE
<spd_note>
  <internal_state>NEW</internal_state>
  <redundant>NO</redundant>
  <spd_text>{(TEST.T2, num_rows=150000) - (TEST.T3, num_rows=450000) - (SQL_ID:1jz21hbnyursn, 
T.CARD=450347[-2 -2])}</spd_text>
</spd_note></code></pre>

Anyway, given the optimizer estimated the cardinality perfectly and those estimates are successfully being applied to the subsequent executions, the following question inevitably arises:

What went wrong with the on-going optimization process?

We can find a cue by further examining the trace:

>> Join Card adjusted from 444734.345351 to 0.000000 due to adaptive dynamic sampling, prelen=2
Adjusted Join Cards: adjRatio=0.000000 cardHjSmj=0.000000 cardHjSmjNPF=0.000000 cardNlj=0.000000 cardNSQ=0.000000 cardNSQ_na=444734.345351
Join Card - Rounded: 1 Computed: 0.000000

Surprisingly, the cardinality was adjusted to zero after DS, which explains the estimated cardinality of “1” in the execution plan. In my opinion, it is a bug. However, I couldn’t find any information on Metalink about it.

As far as previous releases are concerned, the same test case runs correctly on 12.1. The main difference between both releases with regard to this issue is that 12.2 persists estimated cardinalities after doing DS (see Mauro Pagano’s blog posts for more information). It is undoubtedly a significant enhancement compared to the previous release, where the information was stored in the result cache. Unfortunately, some boundary conditions apparently haven’t been handled properly since implementing this change. Consequently, you might observe some catastrophically wrong join cardinality estimates (and bad execution plans too) when dynamic sampling is performed.

Updates

March 20, 2018

More Realistic Test Case

Meanwhile, I published another test case which is a bit closer to reality. As you can see there, the execution plan changes for the better with the second execution. Also, the difference between the good and the bad plan is dramatic. Besides that, I reengineered the test case, so that no hint is used any more to prevent unnesting. Instead, the optimizer decides himself not to do the transformation as a result of a wrong cardinality estimate which, in turn, is caused by the adjustment-to-zero issue.

Workaround

The problem was apparently introduced with the bug fix “22817465 : AUTODOP CALCULATES DOP OF 1 FOR SPECIFIC QUERY”. So, switching off the bug fix resolves the issue:

alter session set "_fix_control"='21802552:off';

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |    240K|00:06:35.51 |      54M|       |       |          |
|*  1 |  FILTER             |      |      1 |        |    240K|00:06:35.51 |      54M|       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    458K|    449K|00:00:00.50 |    1295 |  9268K|  2474K| 8130K (0)|
|   3 |    TABLE ACCESS FULL| T2   |      1 |    150K|    150K|00:00:00.01 |     314 |       |       |          |
|   4 |    TABLE ACCESS FULL| T3   |      1 |    450K|    450K|00:00:00.02 |     978 |       |       |          |
|*  5 |   TABLE ACCESS FULL | T1   |    150K|      1 |  80001 |00:19:23.80 |      54M|       |       |          |
-----------------------------------------------------------------------------------------------------------------

Also, it can be easily verified by looking at the optimizer trace that the adjustement now is being done correctly:

Join Card adjusted from 444734.345351 to 458393.000000 due to adaptive dynamic sampling, prelen=2

Big thanks to Nigel Bayliss, the product manager for the Oracle Optimizer, for reaching out after reading this blog post and persevering in finding out the root cause and the workaround.

Bug number – May 9, 2018

The following bug has been raised for the problem: Bug 27989222 : WRONG CARDINALITY ESTIMATION WITH STALE DIRECTIVES WHEN FIX 21802552 IS ENABLED.

Workaround side-effects – July 5, 2019

I changed “alter system” to “alter session” in the workaround, as switching off the fix globally can produce side-effects elsewhere. Thanks to Patrick Jolliffe for pointing that out.
Of course, you can also set it only for a SQL by hint.

19c – October 28, 2020

The bug ist still here – tested on 19.7.0.0.200414. Continue to be careful with the workaround, as it can cause problems elsewhere. We switched off the bug fix 21802552 globally and have been enabling it on a case-by-case basis.

21c – September 2, 2021

The bug isn’t fixed yet.

Thanks for sharing

Nenad Noveljic

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.