Tables Containing “Current” and “Historical” Rows

I’ve seen quite a few applications with tables containing both “current” and “historical” rows. In this blog post I’m going to demonstrate how devastating this pattern to cardinality estimates can be.

To illustrate this point, I’m going to create a table which stores product_ids with prices:

create table prices (product_id number, price number, is_current number) ;

is_current is 1 for current prices and 0 for the historical rows. Any given product_id can have only one current price. Consequently, the product_id is unique for is_current=1. Note that this is something that a human would know intuitively, but the optimizer misses this information completely.

Let’s continue with generating a data set for the test case. First, I’m going to populate the table with 100000 current prices:

exec dbms_random.seed(1) ;

insert into prices 
  select rownum,trunc(dbms_random.value(100,10000),2),1 
    from dual connect by level <=100000 ; 

Second, I’m going to insert historical records (is_current=0):

insert into prices 
  select 
    trunc(dbms_random.normal * 50000 ),
    trunc(dbms_random.value(100,10000),2),0 
    from dual connect by level <=300000 ;

Finally, I’m going to add a lot of history records for a small number of products. This will lead to several popular values which will pop up in the histogram as the endpoint values:

insert into prices 
  select mod(rownum,10),trunc(dbms_random.value(100,10000),2),0
    from dual connect by level <= 100000 ; 
commit ; 

begin
  dbms_stats.gather_table_stats( 
    user,'PRICES',method_opt=>'for all columns size skewonly'
  );
end;

At the end, we’ve got a table with 500000 rows, of which are 100000 current:

select is_current,count(*) from prices group by is_current ;
IS_CURRENT	COUNT(*)
1	        100000
0	        400000

Self-Join

To select all the prices of currently active products we have to perform the following self-join:

select count(*) from prices active,prices history 
  where active.is_current = 1 and active.product_id = history.product_id ;

COUNT(*)
332901

As we already know, there can be at most only one active price for any given product (remember, the optimizer doesn’t have this information!). Therefore, the self-join will be one-to-many. As a consequence, such a self-join can’t possibly return more rows than the total number of rows in the table. But let’s check the optimizer’s estimation:


--------------------------------------------------------------------------------------
| Id  | Operation           | Name   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |        |     1 |    13 |       |  1053  (38)| 00:00:01 |
|   1 |  SORT AGGREGATE     |        |     1 |    13 |       |            |          |
|*  2 |   HASH JOIN         |        |   205M|  2551M|  1984K|  1053  (38)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| PRICES |   100K|   781K|       |   164   (2)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| PRICES |   500K|  2441K|       |   163   (1)| 00:00:01 |
--------------------------------------------------------------------------------------

Ouch. The optimizer estimated that the join would return 205 million rows, which is, of course, out of the ballpark.

By running the following query it can be verified that the largest contributor to the join cardinality are the popular values:

select 
(select 
    sum(
        endpoint_repeat_count*endpoint_repeat_count*num_rows*num_rows/
        ucs.sample_size/ucs.sample_size
    )  
    from
        user_tab_histograms uth
       ,user_tab_col_statistics ucs
       ,user_tables ut
    where
		uth.table_name   = ucs.table_name
		and uth.table_name   = ut.table_name
		and uth.column_name   = ucs.column_name
		and uth.table_name    = 'PRICES'
		and uth.column_name   = 'PRODUCT_ID'
		and (uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) > 0
)
* (
    select 
(
    select samples from 
    (
        select 
          endpoint_value,
          (endpoint_number - nvl(lag(endpoint_number)over (order by endpoint_number asc) , 0 )) samples 
          from user_tab_histograms 
          where table_name = 'PRICES' and column_name='IS_CURRENT'
    ) where endpoint_value=1
) / ( select num_rows from user_tables where table_name='PRICES' ) ratio_current
    from dual ) join_card_populars
 from dual ;

205599507.076327

The query above assumes that there is a frequency histogram on the column prices and a hybrid histogram for the column product_id.

Partitioning

Luckily, there is a solution to this problem which doesn’t require any changes to the application code. The secret weapon is partitioning the table on is_current:


create table prices_partitioned
 (product_id number, price number, is_current number)
 partition by list (is_current)
 ( partition current_p values(1), partition history values(0) ) ;

insert into prices_partitioned select * from prices ;

commit ;

begin 
  dbms_stats.gather_table_stats(
    user,'PRICES_PARTITIONED',method_opt=>'for all columns size skewonly'
  );
end ;
/  

Being able to rely on the partition statistics, the optimizer can now accurately estimate the number of “current” rows which is driving the join.


----------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name               | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                    |     1 |    13 |       |   731   (1)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE         |                    |     1 |    13 |       |            |          |       |       |
|*  2 |   HASH JOIN             |                    |   271K|  3442K|  1984K|   731   (1)| 00:00:01 |       |       |
|   3 |    PARTITION LIST SINGLE|                    | 99991 |   781K|       |   131   (1)| 00:00:01 |   KEY |   KEY |
|   4 |     TABLE ACCESS FULL   | PRICES_PARTITIONED | 99991 |   781K|       |   131   (1)| 00:00:01 |     1 |     1 |
|   5 |    PARTITION LIST ALL   |                    |   500K|  2441K|       |   261   (1)| 00:00:01 |     1 |     2 |
|   6 |     TABLE ACCESS FULL   | PRICES_PARTITIONED |   500K|  2441K|       |   261   (1)| 00:00:01 |     1 |     2 |
----------------------------------------------------------------------------------------------------------------------

select num_rows from user_tab_partitions 
  where table_name='PRICES_PARTITIONED' and partition_name='CURRENT_P';

NUM_ROWS
100000

As we can see, this solution is very effective, but it comes with a price tag. You would need to license the enterprise edition and the partitioning option.

Conclusion

If your data model forsees storing “current” and “historical” rows in the same table, there is a chance that there will be queries doing self-joins drived by the “current” rows. This model can give rise to wrong join cardinality estimates applied on skewed data distributions. The more emphasized the skew is, the larger is the error. Table partitioning turns out to be a good solution in such cases.

References

 

Thanks for sharing

Nenad Noveljic

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.