A Data Model Fix for Suboptimal Semi-Join and Anti-Join Cardinality Estimates

Problem definition

This blog post is a demonstration of how a simple change in a data model can massively improve a join cardinality estimate.

The original execution plan consists of ~250 lines, but I reduced the minimum test case for reproducing the problem to only two tables. The first table T_SUBSET_OF_ACCOUNTS contains a subset of accounts:

create table T_SUBSET_OF_ACCOUNTS ( id number ) ;

The accounts can be either active or closed. The table, though, doesn’t contain that information. Instead, additional information about closed accounts is stored in the separate table T_SUBSET_OF_ACCOUNTS:

create table T_CLOSED_ACCOUNTS ( id number ) ;

Consequently, a semi-join is required to retrieve only closed accounts:

select * from T_SUBSET_OF_ACCOUNTS a 
  where exists ( select 1 from T_CLOSED_ACCOUNTS c where a.id = c.id );

Similarly, active accounts are retrieved with an anti-join:

select * from T_SUBSET_OF_ACCOUNTS a 
  where not exists ( select 1 from T_CLOSED_ACCOUNTS c where a.id = c.id ) ;

A problem arises because of the specific combination of two dataset properties.

One is that the majority of the accounts are still active. This means that they don’t have a matching row in the table T_CLOSED_ACCOUNTS. In other words, a semi-join between both tables returns only few rows.

Another is that the accounts have being permanently opened and closed. As a consequence, the range of ID values in T_SUBSET_OF_ACCOUNTS is contained in T_CLOSED_ACCOUNTS. Because of the overlapping ranges, optimizer thinks that the join will return many rows.

Simply put, the overlapping ranges of the join columns lead to high join cardinality estimates, but since both tables contain only few matches, the semi-join cardinality is being massively overestimated. By the same token, the anti-join cardinality is being massively underestimated.

Dataset

It’s fairly easy to simulate the production dataset.

The following statement inserts even numbers between 400 and 800 into T_SUBSET_OF_ACCOUNTS:

insert into T_SUBSET_OF_ACCOUNTS select level + 400 from dual 
  where mod(level, 2) = 0 connect by level <= 400 ;
commit ;

select min(id), max(id), count(*) from T_SUBSET_OF_ACCOUNTS ;

   MIN(ID)    MAX(ID)   COUNT(*)
---------- ---------- ----------
       402        800        200

exec dbms_stats.gather_table_stats(null, 'T_SUBSET_OF_ACCOUNTS');

The following statement inserts odd numbers between 1 and 8000 into T_CLOSED_ACCOUNTS:

insert into T_CLOSED_ACCOUNTS select level from dual 
  where mod(level, 2) = 1 connect by level <= 8000 ;

commit ;

select min(id), max(id), count(*) from T_CLOSED_ACCOUNTS ;

   MIN(ID)    MAX(ID)   COUNT(*)
---------- ---------- ----------
         1       7999       4000

exec dbms_stats.gather_table_stats(null, 'T_CLOSED_ACCOUNTS');

Since T_SUBSET_OF_ACCOUNTS stores only even and T_CLOSED_ACCOUNTS only odd numbers, there aren’t any matching rows. Simply put, I’m simulating the condition that the T_SUBSET_OF_ACCOUNTS table contains only active accounts.

Semi-join

But since the ID value ranges fully overlap, optimizer thinks that join will return many rows. The result is disastrous – the cardinality is overestimated by orders of magnitude:

select /*+ gather_plan_statistics */ * from T_SUBSET_OF_ACCOUNTS a 
  where exists ( select 1 from T_CLOSED_ACCOUNTS c where a.id = c.id ) ;

select * from table(dbms_xplan.display_cursor(NULL,NULL, 'LAST, ALLSTATS')) ;

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name                 | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                      |      1 |        |      0 |00:00:00.01 |      21 |       |       |          |
|*  1 |  HASH JOIN SEMI    |                      |      1 |    200 |      0 |00:00:00.01 |      21 |  2078K|  2078K| 1526K (0)|
|   2 |   TABLE ACCESS FULL| T_SUBSET_OF_ACCOUNTS |      1 |    200 |    200 |00:00:00.01 |       6 |       |       |          |
|   3 |   TABLE ACCESS FULL| T_CLOSED_ACCOUNTS    |      1 |   4000 |   4000 |00:00:00.01 |      15 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."ID"="C"."ID")

Anti-join

For the same reason, optimizer massively underestimates the anti-join cardinality:

select /*+ gather_plan_statistics */ * from T_SUBSET_OF_ACCOUNTS a 
  where not exists ( select 1 from T_CLOSED_ACCOUNTS c where a.id = c.id ) ;

select * from table(dbms_xplan.display_cursor(NULL,NULL, 'LAST, ALLSTATS')) ;

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name                 | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                      |      1 |        |    200 |00:00:00.01 |      21 |       |       |          |
|*  1 |  HASH JOIN ANTI    |                      |      1 |      2 |    200 |00:00:00.01 |      21 |  2546K|  2546K| 1502K (0)|
|   2 |   TABLE ACCESS FULL| T_SUBSET_OF_ACCOUNTS |      1 |    200 |    200 |00:00:00.01 |       6 |       |       |          |
|   3 |   TABLE ACCESS FULL| T_CLOSED_ACCOUNTS    |      1 |   4000 |   4000 |00:00:00.01 |      15 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."ID"="C"."ID")

The NOT EXISTS subquery above selects only active accounts. That was exactly the problem in the production query when I first spotted the issue. The active accounts were used as the input to other operations in the execution plan, and their cardinality misestimate resulted in a bad plan.

Outer join to anti-join transformation

By the way, the application query didn’t use exactly NOT EXISTS for producing the anti-join. It was written as an outer join instead:

select a.* from T_SUBSET_OF_ACCOUNTS a 
  left outer join T_CLOSED_ACCOUNTS c on a.id = c.id 
  where c.id is null ;

Optimizer, however, produced the identical execution plan because it transformed the outer join to the anti-join:

select * from table(dbms_xplan.display_cursor(NULL,NULL, 'LAST, ALLSTATS')) ;

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name                 | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                      |      1 |        |    200 |00:00:00.01 |      21 |       |       |          |
|*  1 |  HASH JOIN ANTI    |                      |      1 |      2 |    200 |00:00:00.01 |      21 |  2546K|  2546K| 1486K (0)|
|   2 |   TABLE ACCESS FULL| T_SUBSET_OF_ACCOUNTS |      1 |    200 |    200 |00:00:00.01 |       6 |       |       |          |
|   3 |   TABLE ACCESS FULL| T_CLOSED_ACCOUNTS    |      1 |   4000 |   4000 |00:00:00.01 |      15 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("A"."ID"="C"."ID")

This transformation can be disabled by setting the undocumented parameter “_optimizer_outer_to_anti_enabled” to false:

alter session set "_optimizer_outer_to_anti_enabled" = false ;
---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name                 | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |                      |      1 |        |    200 |00:00:00.01 |      21 |       |       |          |
|*  1 |  FILTER             |                      |      1 |        |    200 |00:00:00.01 |      21 |       |       |          |
|*  2 |   HASH JOIN OUTER   |                      |      1 |      2 |    200 |00:00:00.01 |      21 |  2546K|  2546K| 1478K (0)|
|   3 |    TABLE ACCESS FULL| T_SUBSET_OF_ACCOUNTS |      1 |    200 |    200 |00:00:00.01 |       6 |       |       |          |
|   4 |    TABLE ACCESS FULL| T_CLOSED_ACCOUNTS    |      1 |   4000 |   4000 |00:00:00.01 |      15 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("C"."ID" IS NULL)
   2 - access("A"."ID"="C"."ID")

As we can see, the outer to anti-join transformation isn’t performed after disabling it with the hidden parameter. The cardinality estimate is, though, equally bad for the new plan.

Solution

In such cases, you can introduce the STATE column into the base table T_SUBSET_OF_ACCOUNTS and use it as a filter, instead of filtering data with semi-joins and anti-joins. Not only will the STATE column eliminate the join, but it’ll also yield a much better cardinality estimate:

alter table T_SUBSET_OF_ACCOUNTS add ( state number ) ;

update T_SUBSET_OF_ACCOUNTS set state = 1 ;

commit ;

exec dbms_stats.gather_table_stats(null, 'T_SUBSET_OF_ACCOUNTS');

select /*+ gather_plan_statistics */ * from T_SUBSET_OF_ACCOUNTS 
  where state = 1 ;

select * from table(dbms_xplan.display_cursor(NULL,NULL, 'LAST, ALLSTATS')) ;

The cardinality estimate is accurate:

----------------------------------------------------------------------------------------------------
| Id  | Operation         | Name                 | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                      |      1 |        |    200 |00:00:00.01 |       8 |
|*  1 |  TABLE ACCESS FULL| T_SUBSET_OF_ACCOUNTS |      1 |    200 |    200 |00:00:00.01 |       8 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("STATE"=1)

Summary

In summary, an application stores a subset of the accounts in a table. The accounts can be either active or closed. Unfortunately, this table doesn’t have a column to indicate the account state. Instead, it stores the information about closed accounts in a separate table. Therefore, it requires a semi-join to retrieve closed accounts and an anti-join to query the active accounts. Since the join values in both tables overlap, optimizer comes up with a large semi-join cardinality estimate. But since the most of the accounts are still active, and therefore not contained in the closed accounts table, the query, in reality, returns only few rows. In other words, optimizer overestimates the semi-join and underestimates the anti-join cardinality. To avoid these misestimates – and eliminate joins altogether – a STATE column can be introduced to indicate whether the account is active or closed.

Further, we can even generalize this problem and the solution. When a semi-join returns few rows but the values of the join column overlap, the semi-join cardinality estimate will be massively overestimated. The anti-join cardinality will be underestimated in a such case. The problem can be solved by introducing a new column which indicates whether the value is contained in another table.

This small tweak on data model can drastically improve join cardinality estimate.

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.