Density Calculation with Hybrid Histograms

This blog post describes how the calculation of new density has modified with the introduction of hybrid histograms in Oracle 12c. I assume that a reader is familiar with the concept of new density which Alberto Dell’Era elaborated on in his white paper New Density calculation in 11g . The article is based on Oracle version 12.1.

Height-Balanced Histograms

The new density in height-balanced histograms is calculated during the optimization process and is used for estimating the cardinality of unpopular values. The following formula is used for the calculation of the new density with height-balanced histograms:

NewDensity = [(BktCnt - PopBktCnt) / BktCnt] / (NDV - PopValCnt)
  • NDV: number of distinct values
  • PopValCnt: number of popular values
  • BktCnt: total number of buckets in a height-balanced histogram
  • PopBktCnt: number of buckets containing popular values

Side note: for the sake of simplicity the formula assumes that there aren’t any null values in the column. Handling the case with null values is trivial – just subtract the num_nulls from the denominator.

The cardinality of an unpopular value is calculated by multiplying NewDensity with the total number of rows in the table:

card = num_rows * NewDensity.

Let’s take a closer look at:

[(BktCnt - PopBktCnt) / BktCnt]

This expression calculates the ratio between the number of rows containing unpopular values and the total number of rows. Alberto Dell’Era coined the term Not-Populars SubTable (NPST) for the subset of the table with unpopular values.

Next, let’s examine the premise underpinning the formula for NPST-factor. Since in height-balanced histograms all the buckets are of  the same size (or at least almost the same size), the ratio

[(BktCnt - PopBktCnt) / BktCnt]

will perfectly match the fraction of the rows having unpopular values.

Hybrid Histograms

In contrast, in hybrid histograms Oracle will group all samples having the same value in the same bucket. As a consequence, the buckets will vary in size.

Let’s see what happens with the formula in the case of the data distribution with a strong peak when using hybrid histograms. For this purpose, I’ll create a table containing 600 integer values, where all the values except one will have only one row. The remaining value will be inserted 300 times and will therefore occupy the half of the table.


drop table t ;

create table t (a number) ;

insert into t 
SELECT rownum
FROM dual
CONNECT BY level <= 300;

INSERT INTO T
SELECT 301
FROM dual
CONNECT BY level <= 300 ; 

commit ; 

exec dbms_stats.gather_table_stats( null,'T',method_opt=>'for columns size 254 A' );

There are 254 buckets in total, but only one containing the popular value:

column val format a3
column pop format 9999
column cnt format 9999
column samples format 9999

select 
  uth.endpoint_actual_value val
  ,case when 
  (uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets)>0 
  then 1 else 0 end pop
  ,uth.endpoint_repeat_count cnt
  ,(endpoint_number - nvl ( lag(endpoint_number)over (order by endpoint_number asc) , 0 )) samples
  from
    user_tab_histograms uth
    ,user_tab_col_statistics ucs
  where
    uth.table_name   = ucs.table_name
    and uth.column_name   = ucs.column_name
    and uth.table_name    = 'T'
    and uth.column_name   = 'A'
order by endpoint_repeat_count ;

VAL   POP   CNT SAMPLES
--- ----- ----- -------
1       0     1       1
2       0     1       1
3       0     1       1
...
299     0     1       1
301     1   300     301

Notice the skew in samples per bucket – the bucket with the popular value contains 301 samples. In contrast, the buckets with the unpopular values have only one or two samples.

The query for identifying popular values is based on Mohamed Houri’s query from his blog post 12c hybrid histogram .

The factoring expression

(BktCnt - PopBktCnt)/BktCnt

gives us

(254 -1 ) / 254 = 0.9960

which would mean that 99.60% of the table was populated by unpopular values. Obviously, the value is out of the ballpark as we know that only the half of the table is populated by unpopular values. This error is a consequence of uneven sizes of hybrid histograms. Therefore, the ratio of the count of the buckets can’t be used any more for calculating the percentage of the samples with unpopular values. The samples has to be counted instead.

New Formula

Therefore, after replacing

(BktCnt - PopBktCnt)/BktCnt

with

num_rows_unpopular/num_rows

we get the following:

NewDensityHybrid
= num_rows_unpopular / num_rows / NDV_unpopular
= (num_rows - num_rows_popular) / num_rows / (NDV - PopValCnt)

Letting

num_rows_popular = sum(endpoint_repeat_count_populars) * (num_rows/sample_size)

NewDensityHybrid
= (num_rows-sum(endpoint_repeat_count_populars)*(num_rows/sample_size))/(NDV-PopValCnt)/num_rows
= (1-sum(endpoint_repeat_count_populars)/sample_size)/(NDV-PopValCnt)

Finally, I’m providing the query for calculating the NewDensity value with hybrid histograms which you can download from my github repository :

References

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.