Multi-column join

There’s a boundary condition with extended statistics that leads the join cardinality calculation astray.

I’ll be using the following model to show that:

drop table t1 ;
drop table t2 ;

create table t1
(
  id1 integer,
  id2 integer
);

insert into t1 select 1, mod(rownum, 200)
  from dual connect by level <= 1e5 ;
commit ;

exec dbms_stats.gather_table_stats(null, 't1');

create table t2
(
  id1 integer,
  id2 integer,
  filter integer
);

insert into t2 
  select 1, rownum, 
  case when rownum <= 400 then 1 else trunc(dbms_random.value(2,300)) end case
  from dual connect by level <= 2e5 ;
commit ;

begin
  dbms_stats.gather_table_stats(
    null, 't2', 
	method_opt => 'FOR ALL COLUMNS SIZE 1 FOR COLUMNS SIZE 300 filter'
  );
end;
/

select /*+ gather_plan_statistics */ 1 
  from t1, t2
  where 
    t1.id1 = t2.id1 and t1.id2 = t2.id2
    and t2.filter = 1 ;

The model above has the following characteristics:

t1 and t2 are joined on the composite key id1,id2
id1, the first column in the composite join key, is redundant in both tables – it’s always 1.
The join doesn’t filter any rows from t1, because all filtered rows from t2 are matched in t1.

Selectivity

The cardinality estimate is fairly accurate:

----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  99500 |00:00:00.03 |    1683 |       |       |          |
|*  1 |  HASH JOIN         |      |      1 |    100K|  99500 |00:00:00.03 |    1683 |  2078K|  2078K| 1530K (0)|
|*  2 |   TABLE ACCESS FULL| T2   |      1 |    400 |    400 |00:00:00.01 |     499 |       |       |          |
|   3 |   TABLE ACCESS FULL| T1   |      1 |    100K|    100K|00:00:00.01 |    1184 |       |       |          |
----------------------------------------------------------------------------------------------------------------

The optimizer used the following selectivity:

Join Card:  100000.000000 = outer (100000.000000) * inner (400.000000) * sel (0.002500)

The standard join selectivity formula (for not nulls) is:

sel = 1/greater(NDV(t1.id1,t1.id2),NDV(t2.id1,t2.id1))

set linesize 100
column table_name format a3
column column_name format a30
column density format 9.9000EEEE
select table_name, column_name, num_distinct, density from user_tab_cols 
  where table_name in ('T1','T2') order by table_name, column_name ;
  
TAB COLUMN_NAME                    NUM_DISTINCT      DENSITY
--- ------------------------------ ------------ ------------
T1  ID1                                       1   5.0775E-06
T1  ID2                                     200   5.0000E-06
T2  FILTER                                  299   3.2890E-03
T2  ID1                                       1   2.5378E-06
T2  ID2                                  200000   5.0000E-06

Since NDV(t1.id1) = NDV(t2.id1) = 1, the selectivity becomes:

sel = 1/greater(NDV(t1.id2),NDV(t2.id2)) [1]

Further, because of the filter on t2, NDV(t2.id2) is the NDV of the filtered result set (not the total NDV). Note that there are no statistics for a filtered NDV. Optimizer derives it from the total NDV and the density of the filter predicate, by using the calculation for the statistical problem known as “sampling without replacement”. [2]

We can see the result of that calculation by observing the cardinality estimate of the following query:

select distinct id2 from t2 where filter = 1 ;

-----------------------------------------------------------------------
| Id  | Operation          | Name | E-Rows |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |        |       |       |          |
|   1 |  HASH UNIQUE       |      |    400 |  2170K|  2170K| 1421K (0)|
|*  2 |   TABLE ACCESS FULL| T2   |    400 |       |       |          |
-----------------------------------------------------------------------

This number indeed matches with the value recorded in the optimizer trace: 1/400 = 0.0025

Extended statistics

Extended statistics on t2(id1,id2) shouldn’t change anything, because NDV(id1) * NDV(id2) = NDV(id1,id2). But the estimated cardinality declined by orders of magnitude after creating them:

select 
  dbms_stats.create_extended_stats( 
    ownname   => null, 
	tabname   => 't2', 
	extension => '(id1, id2)' 
  ) from dual ;

begin
  dbms_stats.gather_table_stats(
    null, 't2', 
	method_opt => 'FOR ALL COLUMNS SIZE 1 FOR COLUMNS SIZE 300 filter'
  );
end;
/

----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  99500 |00:00:00.04 |    1683 |       |       |          |
|*  1 |  HASH JOIN         |      |      1 |    200 |  99500 |00:00:00.04 |    1683 |  2078K|  2078K| 1568K (0)|
|*  2 |   TABLE ACCESS FULL| T2   |      1 |    400 |    400 |00:00:00.01 |     499 |       |       |          |
|   3 |   TABLE ACCESS FULL| T1   |      1 |    100K|    100K|00:00:00.01 |    1184 |       |       |          |
----------------------------------------------------------------------------------------------------------------

The selectivity slumped:

Join selectivity using 1 ColGroups: 5.0000e-06 (sel1 = 0.000000, sel2 = 0.000000)
Join Card:  200.000000 = outer (400.000000) * inner (100000.000000) * sel (5.0000e-06)

All statistics are correct, but the selectivity calculation changed – it’s now calculated based on the total NDV instead of the filtered NDV.

TAB COLUMN_NAME                    NUM_DISTINCT      DENSITY
--- ------------------------------ ------------ ------------
T1  ID1                                       1   5.0775E-06
T1  ID2                                     200   5.0000E-06
T2  FILTER                                  299   3.2890E-03
T2  ID1                                       1   2.5378E-06
T2  ID2                                  200000   5.0000E-06
T2  SYS_STU3RTXGYOX7NS$MIUDXQDMQ0C       200000   5.0000E-06

The selectivity didn’t improve even after disabling the extended statistics:

alter session set "_optimizer_enable_extended_stats" = false ;

Using concatenated index cardinality sanity check for table T2
Revised join sel: 5.0000e-06 = 5.0000e-06 * (1/200000.00) * (1/5.0000e-06)
Join Card:  200.000000 = outer (400.000000) * inner (100000.000000) * sel (5.0000e-06)

The entry “Using concatenated index cardinality sanity check for table T2” is misleading, as I haven’t created any indexes yet. Apparently, a sanity check kicked in to create some ad-hoc join selectivity, but this selectivity is still wrong.

Then I disabled that sanity check too:

alter session set "_optimizer_join_sel_sanity_check" = false ;

The selectivity didn’t change either:

Join Card:  200.000000 = outer (400.000000) * inner (100000.000000) * sel (5.0000e-06)

Since NDV(id2) = NDV(id1,id2) we don’t know which NDV the optimizer took.

I manipulated the column statistics to find that out:

begin
  DBMS_STATS.SET_COLUMN_STATS (
    null, 't2', 'id2', 
     distcnt => 150000
  );
end; 
/


TAB COLUMN_NAME                    NUM_DISTINCT      DENSITY
--- ------------------------------ ------------ ------------
T2  ID2                                  150000   6.6667E-06
T2  SYS_STU3RTXGYOX7NS$MIUDXQDMQ0C       200000   5.0000E-06

It was the NDV(id2):

Join Card:  266.666667 = outer (400.000000) * inner (100000.000000) * sel (6.6667e-06)

The optimizer didn’t use the filtered NDV, even after deactivating the extended statistics. In other words, a sole presence of extended statistics ruins the calculation. The behavior is the same if we create the unique index or the primary key instead of extended statistics.

What can we do?

Solution

The solution is to create a matching extended statistics on the other table:

select 
  dbms_stats.create_extended_stats( 
    ownname   => null, 
	tabname   => 't1', 
	extension => '(id1, id2)' 
  ) from dual ;

exec dbms_stats.gather_table_stats(null, 't1');

Join Card:  100000.000000 = outer (100000.000000) * inner (400.000000) * sel (0.002500)

Summary

In summary, for correct join cardinality calculation we need matching extended statistics (or indexes) on both tables. With extended statistics on only one table with a filter, the optimizer uses the total NDV instead of the filtered NDV. As a consequence, selectivity and cardinality will be massively underestimated.

References

[1] Jonathan Lewis, Cost-Based Oracle Fundamentals. 2006.
[2] Alberto Dell’Era, SELECT WITHOUT REPLACEMENT. 2007.

Thanks for sharing

Wrong Join-Cardinality with Extended Statistics