Slow DMLs With Larger Blocks

I tested the performance of a simple INSERT in a PL/SQL loop. Unsurprisingly, the block size doesn’t have any measurable impact on the performance, as the total data size remains the same after changing the block size.

However, after turning on block checksums, performance slumps with larger block sizes. This deterioration becomes particulary dramatic when db_block_checking is set to FULL. The graph below shows how the elapsed time changes with the block size and different checksum modes:

(The elapsed times are normalized by dividing them with the elapsed time of the execution without a checksum.)

We can infer the following from the graph:

  • Block size doesn’t have any impact on performance when checksum is turned off.
  • LOW checksum doesn’t have any impact either.
  • The overhead skyrockets for larger blocks with MEDIUM and FULL checksums.

The last observation is counterintuitive and, as such, worth a more detailed consideration. I mean it’s expected that the checksum calculation for a single block is proportional to the block size. However, since the total amount of data doesn’t change, the elapsed time should stay the same, right?

Let’s see why that’s not the case.

Since the whole elapsed time is spent on CPU I started the analysis with stack profiling:

dtrace -x ustackframes=100 -n 'profile-97 /pid == 26348/ { @[ustack()] = count(); } tick-60s { exit(0); }' 

The most executed stack with 32k block size, expectedly, contains some block checking functions:

libc.so.1`qsort+0x575
oracle`kd4_entries_overlap+0x42
oracle`kdb4chk1+0xecd
oracle`kd4chk+0x79
oracle`kdgchk+0x2ac
oracle`kcbchk_ctx+0xe1
oracle`kco_blkchk+0x9a
oracle`kcoapl+0x825
oracle`kcbapl+0x147
oracle`kcrfw_redo_gen_ext+0x1054
oracle`kcbchg1_main+0x1bca
oracle`kcbchg1+0xe3
oracle`ktuchg2+0x903
oracle`ktbchg2+0x118
oracle`kdtchg+0x453
oracle`kdtwrp+0xa46
oracle`kdtSimpleInsRow+0x1b3
oracle`qerltcSimpleSingleInsRowCBK+0x3a
oracle`qerltcSingleRowLoad+0x13f
oracle`qerltcFetch+0x1ca
oracle`insexe+0x33e
...

But the topmost function is the generic C function qsort, which sorts an array. We can infer the cause of the performance decrease by analyzing its behavior. That isn’t difficult, since the function is well known and documented. It’s worth noting that the array size is passed as the second argument:

void qsort(void *base, size_t nel, size_t width,
   int (*compar)(const void *, const void *));

Obviously, the execution duration depends on the array size, that is, the number of elements to sort.

I wrote the following DTrace script to capture some qsort execution statistics:

pid$target::qsort:entry
{
  self->start_time = timestamp; 
  @elements_count["sorted elements total"]=sum(arg1);
  @elements_histogram["sorted elements"]=quantize(arg1);
}

pid$target::qsort:return
/ self->start_time /
{
  this->elapsed=timestamp - self->start_time;    
  @time["elapsed (ns)"]=sum(this->elapsed);
  @time_histogram=quantize(this->elapsed);
  @count["#calls"]=count();
  self->start_Time = 0;
}

In particular, the script collects the following information:

  • number of elements sorted – total and histogram (“sorted elements”)
  • elapsed time – total and histogram (“elapsed”)
  • total number of qsort calls (“#calls”)

Let’s compare the statistics for 2k, 8k and 32k executions:

2k:

sorted elements total                                       8000225
sorted elements                                   
   value  ------------- Distribution ------------- count    
       1 |                                         0        
       2 |@                                        1274     
       4 |@                                        2548     
       8 |@@                                       5096     
      16 |@@@@                                     10192    
      32 |@@@@@@@@                                 20375    
      64 |@@@@@@@@@@@@@@@@                         40768    
     128 |@@@@@@@@                                 19747    
     256 |                                         0        

elapsed [ns]                                              699793753

   value  ------------- Distribution ------------- count    
    1024 |                                         0        
    2048 |                                         1119     
    4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      86631    
    8192 |@@@@                                     10289    
   16384 |@                                        1733     
   32768 |                                         199      
   65536 |                                         23       
  131072 |                                         4        
  262144 |                                         2        
  524288 |                                         0        

#calls                                                       100000

8k:

sorted elements total                                      33179680
sorted elements                                   
   value  ------------- Distribution ------------- count    
       1 |                                         0        
       2 |                                         304      
       4 |                                         608      
       8 |                                         1216     
      16 |@                                        2432 
      32 |@@                                       4864     
      64 |@@@@                                     9682     
     128 |@@@@@@@@                                 19329    
     256 |@@@@@@@@@@@@@@@@                         38766    
     512 |@@@@@@@@@                                22801    
    1024 |                                         0        

elapsed [ns]                                             1563842848

   value  ------------- Distribution ------------- count    
    1024 |                                         0        
    2048 |                                         393      
    4096 |@@@@@@@@@                                21604    
    8192 |@@@@@@@@@@@@@                            32092    
   16384 |@@@@@@@@@@@@@@@@@@                       44842    
   32768 |                                         538      
   65536 |                                         297      
  131072 |                                         235      
  262144 |                                         1        
  524288 |                                         0        

#calls                                                       100002

32k:

sorted elements total                                     134287264
sorted elements                                   
   value  ------------- Distribution ------------- count    
       1 |                                         0        
       2 |                                         74       
       4 |                                         148      
       8 |                                         296      
      16 |                                         592      
      32 |                                         1184     
      64 |@                                        2368     
     128 |@@                                       4738     
     256 |@@@@                                     9472     
     512 |@@@@@@@@                                 18944    
    1024 |@@@@@@@@@@@@@@@                          38687    
    2048 |@@@@@@@@@                                23500    
    4096 |                                         0        

elapsed [ns]                                             6116756180

   value  ------------- Distribution ------------- count    
    1024 |                                         0        
    2048 |                                         23       
    4096 |@@                                       5202     
    8192 |@@@                                      7717     
    6384 |@@@@@@                                   14295    
   32768 |@@@@@@@@@@@                              27297    
   65536 |@@@@@@@@@@@@@@@@@@                       44033    
  131072 |                                         734      
  262144 |                                         356      
  524288 |                                         346      
 1048576 |                                         0        

#calls                                                       100003

Thanks to the histograms we can clearly see, that the number of sorted elements is proportional to the block size. Consequently, elapsed time per qsort execution increases too. The key point is that the number of executions doesn’t depend on the block size, because the checksum seems to be computed for every insert.

In other words, the overhead increases, because calculating the checksum for a larger block requires more work, and, at the same time, the number of checksum calculations remains the same after increasing the block size.

In conclusion, the DML performance sucks with large blocks and checksums turned on.

Thanks for sharing

Nenad Noveljic

2 Comments

  1. One question that I’d like to pursue is “how many rows per block”. Is the sort sizer an indication of the number of rows, and does that mean your row consisted of just a single column holding (e.g.) a sequence number?

    If this is the case then the observation seems important for indexes with small key values that might jump from (e.g.) 250 entries per leaf block to 1,000 entries per leaf block as you moved from 8KB to 32KB, but the impact on tables which might go from 40 rows to 160 rows on the same change.

    • Originally, I tested with a simple table containing just one integer column and without any indexes. The average sort size (qsort input) was roughly “rows per block”/12, irrespective of the block size.

      Now I did the second set of tests with and without an index:

      i_full_t_32k.log:Elapsed: 00:00:31.69
      i_full_t_8k.log:Elapsed: 00:00:17.23
      i_medium_t_32k.log:Elapsed: 00:00:10.48
      i_medium_t_8k.log:Elapsed: 00:00:09.99
      i_low_t_32k.log:Elapsed: 00:00:05.97
      i_low_t_8k.log:Elapsed: 00:00:10.61
      i_off_t_32k.log:Elapsed: 00:00:08.79
      i_off_t_8k.log:Elapsed: 00:00:07.52
      noi_full_t_32k.log:Elapsed: 00:00:18.19
      noi_full_t_8k.log:Elapsed: 00:00:09.32
      noi_medium_t_32k.log:Elapsed: 00:00:05.39
      noi_medium_t_8k.log:Elapsed: 00:00:05.80
      noi_low_t_32k.log:Elapsed: 00:00:04.41
      noi_low_t_8k.log:Elapsed: 00:00:04.51
      noi_off_t_32k.log:Elapsed: 00:00:04.30
      noi_off_t_8k.log:Elapsed: 00:00:04.30

      (The prefix “i” is for the tests with index, “noi” without.)

      I suggest the following arithmetic to predict the index overhead with the enabled checksum.

      Let dt be the table-only overhead ( i.e. without index):

      dt = noi_full – noi_off

      Let di be the predicted index-only overhead:

      di = dt * leaf_blocks/blocks = ( noi_full – noi_off ) * leaf_blocks/blocks

      The measured overhead of the index table i_full – i_off is:

      i_full – i_off = dt + di = ( noi_full – noi_off ) + ( noi_full – noi_off ) * leaf_blocks/blocks = ( noi_full – noi_off ) * (1 + leaf_blocks/blocks)

      For our case:

      SQL> select table_name,leaf_blocks from dba_indexes where table_name in (‘T_8K’,’T_32K’);

      TABLE_NAME LEAF_BLOCKS
      ——————————————————————————————————————————– ———–
      T_32K 49
      T_8K 199

      SQL> select table_name,blocks from dba_tables where table_name in (‘T_8K’,’T_32K’);

      TABLE_NAME BLOCKS
      ——————————————————————————————————————————– ———-
      T_8K 244
      T_32K 60

      Verification for 8k:
      i_full – i_off = 17.23 – 7.52 = 9.71
      ( noi_full – noi_off ) * (1 + leaf_blocks/blocks) = ( 9.32 – 4.30 ) * ( 1 + 199/244) = 9.11

      Verification for 32k:
      i_full – i_off = 31.69 – 8.79 = 22.9
      ( noi_full – noi_off ) * (1 + leaf_blocks/blocks) = ( 18.19 – 4.30 ) * ( 1 + 49/60 ) = 25.23

      The predicted values are within the measurement error.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.