is SIZE AUTO ever wrong? or am I wrong?

From: neil kodner <nkodner_at_gmail.com>
Date: Fri, 20 Nov 2009 12:05:13 -0500
Message-Id: <A7D6B815-CF33-4EF0-B03C-06C920775A3A_at_gmail.com>



Does Oracle ever make the wrong decision when using FOR COLUMN column_name SIZE AUTO?

I ask because I have table letter_bin with 4778159 rows, claims with 487842 rows.

letter_bin is indexed by both clm_id and prty_id

Here are the distributions of the counts of the columns:

table claims column clmt_prty_id

     count min max avg stddev
---------- ---------- ---------- ---------- ----------

    319156 1 29 1.52855657 .939775607

table letter_bin column prty_id

     count min max avg stddev
---------- ---------- ---------- ---------- ----------

    234326 1 1712371 20.3912157 3542.44185

table letter_bin column clm_id

     count min max avg stddev
---------- ---------- ---------- ---------- ----------

    470855 1 1424 10.147916 7.88601772

I just now realized that the numbers for letter_bin include NULLs.

excluding NULLs, I get

table letter_bin column clm_id

     count min max avg stddev
---------- ---------- ---------- ---------- ----------

    470859 1 1424 10.1464345 7.82825864

table letter_bin column prty_id

     count min max avg stddev
---------- ---------- ---------- ---------- ----------

    234326 1 42314 13.0835759 188.911964

letter_bin is typically queried by both clm_id and prty_id at the same time. although both columns are NULL able, when the table is queried in a certain fashion, both values are used, one as a literal and one as a result of a join. prty_id is supplied, clm_id is the result of querying by clmt_prty_id

I analyzed letter bin using for columns prty_id size auto, and for columns clm_id size auto. both 100% samples, execute separately.

When I compute statistics on column prty_id, I get 254 buckets.which I expect. When I compute statistics on clm_id, I get no histogram although I think I should be expecting one. To me, the concept of 'skewed' is somewhat nebulous but I think I ought to get a histogram here.

I prefer letter_bin to by clm_id and not by prty_id. I say this because the average number of rows in letter_bin for a clm_id should be 10 with a lower SD

an example query would be along the lines of select xyz from letter_bin a,claims b where a.prty_id = 123 and a.clm_id=b.clm_id and b.clmt_prty_id=456. The query typically uses access path of letter_bin-party_id and then join to claims. I would expect it to be the other way around.

select count(*) "count",count(distinct &2)"distinct",min(cnt)"min", max(cnt)"max", avg(cnt)"avg", stddev(cnt)"stddev"   from ( select /*+ PARALLEL (a 6) */ &2, count(*) cnt from &1 a group by &2 );

Now here's where the plot thickens: This query is executed thousands of times in a batch job. I think it would be more efficient for letter_bin to be accessed by clm_id, and its not. It's getting accessed by prty_id and HJ to claims. Am I being unreasonable, or are my stats no good?

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Nov 20 2009 - 11:05:13 CST

Original text of this message