Home » RDBMS Server » Performance Tuning » highly skewed data distribution
highly skewed data distribution [message #134574] Thu, 25 August 2005 10:52 Go to next message
sebastianR
Messages: 33
Registered: August 2005
Member
Hi,

what shall I do if an index range scan is performing poor because of highly skewed data distribution? Do I have highly skewed data distribution at all?

I am searching in a column named KEY_VAL values like:

-'1%'
-'12%'
-'123%'
-'1234%'

At '123%' the CBO switches from full table scan to use my index which is the absolut right decision.

I've selected the count of '1%', '2%', '3%', ... ,'9%', 'A%', 'Z%' with the following result:
KEY_VAL | count(KEY_VAL)
==================
'1%'      61 000
'2%'      48 000
'3%'      38 000
'4%'      12 000
'5%'      19 000
'6%'      18 000
'7%'      17 000
'8%'      23 000
'9%'      25 000
'A%'      11 000
...
...
'Z%'      27 000
==================
total rows 870 000 (640 000 distinct KEY_VALs)

Is this list evidence for skewed or non skewed data distribution?

I have made a histogram like this for the respective index and table:
exec dbms_stats.gather_table_stats(ownname => 'myscheme', tabname => 'table1', method_opt => 'FOR COLUMNS KEY_VAL SIZE 250');
exec dbms_stats.gather_index_stats(ownname => 'myscheme', indname => 'IDX_OBJ_KEY_1');


In the table there are 640887 DISTINCT values of KEY_VAL where there are 870 000 rows in total.

And now there is a TestCase which totally makes me wonder:
Searching '123%' returned 593 rows in 12 seconds.
But searching 'AA0%' returns 996 rows in only 4 seconds!
After intense testing 'AA0%' remains the exception to be that fast. But I don't understand why?

1. question:
Am I right putting the histograms bucket size as high as possible, due I have many distinct values?
Ad1) Is there a way to set the bucket size even higher than 254?
Ad2) Am I wrong using a histogram in this case at all?

2. question:
Why does 'AA0%' return that fast (4 seconds), where '123%' (and nearly all other tested cases) need at least 12 seconds?
AA0% returns more rows!!!, therefore I thought it would be slower? There seems to be no relation (at least no simple relation) between the number of rows returned and the time needed to execute?

3. question
Is it that the data is skewed, and the case 'AA0%' is just luckily placed in the index so that the search in the index is faster than searching for '123%'?

4. question
how is the index search performed? it is a b-tree (with depth blevel = 2, clustering factor 600 000); so the maximum search time would be calculated log N, right?
Did I get it right that when searching the index, it'll start by looking at the very middle position of the index and deciding if the value it is looking for, rests in the lower half or rests in the upper half, picking the correct half and doing the same there; getting the middle value and deciding to go to lower half or upper half; until it reaches the value? binary search (binäre suche)?
Am I totally wrong?

5. question
how is the index sorted? Alphabetically?

I really appreciate every comment or suggestion and I thank you for your time,

SebastianR

Re: highly skewed data distribution [message #134576 is a reply to message #134574] Thu, 25 August 2005 11:03 Go to previous messageGo to next message
sebastianR
Messages: 33
Registered: August 2005
Member
by the way, i don't index the field KEY_VAL directly,

the index looks like:
CREATE INDEX IDX_OBJ_KEY_1 ON FDMT_OBJ_KEY
("FDMT"."ISDATENULL"("END_DATE"), UPPER("KEY_VAL"))
LOGGING
TABLESPACE FDMT_IDX
PCTFREE    10
INITRANS   2
MAXTRANS   255
STORAGE    (
            INITIAL          64K
            MINEXTENTS       1
            MAXEXTENTS       2147483645
            PCTINCREASE      0
            BUFFER_POOL      DEFAULT
           )

where isdatenull is just checking if the supplied DATE IS NULL OR NOT;

the selectivity of END_DATE: 18 / 870 000
the selectivity of UPPER(KEY_VAL): 640 000

So I think the index will first filter every rows where the date is not null and not > SYSDATE, and then filter using KEY_VAL.

Here is the used SQL Query:
  SELECT DISTINCT T1.OBJ_ID, 4141 
  FROM TABLE1 T1 
  WHERE T1.end_date>sysdate 
  UNION SELECT DISTINCT T1.OBJ_ID, 4141 
  FROM TABLE1 T1
  WHERE UPPER(T1.KEY_VAL) LIKE UPPER('123%') 
  AND isdatenull(T1.END_DATE) = 1
  UNION  SELECT DISTINCT T2.KEY_ID, 4141 
  FROM TABLE2 T2
  WHERE UPPER(T2.PUID) LIKE UPPER('123%')


Re: highly skewed data distribution [message #134586 is a reply to message #134574] Thu, 25 August 2005 12:01 Go to previous messageGo to next message
smartin
Messages: 1803
Registered: March 2005
Location: Jacksonville, Florida
Senior Member
Ok, I like the thought you have given this, and the information you have provided. Couple thoughts in random order:

The response time isn't determined necessarily by the number of rows returned, but where those blocks are located. It is the number of blocks accessed, and the method (single or multiblock) of that access.

One thing you can try is to sort your data in different ways, and see if your attempts using '123%' vs 'AA0%' have more closely the same times. The ways to try are a) by date then key value within date b) by key value then date within key value c) some other column d) random order.

But, your situation is complicated by the fact that you are probably using an index skip scan (it would have been helpful for you to post your explain plan also), instead of a normal range scan or unique scan. A skip scan treats the index as if it were several different indexes, one for each of the different leading column values in the index.

So, you'll want to test what happens if you reverse the order of your index columns (try it both ways). Also you can probably use index key compression with some success here.

For more on access paths (how data is read), you should read the performance tuning guide that comes with 10g (even if you aren't on 10g). Specifically part IV.

I like to make my histogram bucket size as big as possible, which I think is in the 250ish range (it is documented, but I never remember it, so I just use 250). I also always like to gather histograms, regardless of whether I think the data is skewed or not (and regardless of whether it actually is skewed). The more info for the CBO the better.

As far as whether your key_val column is skewed, it doesn't look like it is skewed that much, with 600,000 unique values out of 800,000 total values, but we don't have enough information to know. Your count() query wasn't so much showing skewness on the key_val column itself, because you were only showing the leading character of key_val, not a count of the occurances of the whole value. But don't worry with it, just let oracle gather histograms for all columns when you get stats.

As far as stats go, I like to use the cascade option of gather_table_stats to get both index and table stats at the same time, rather than doing a separate call to gather_index_stats. gather_table_stats(owner,table,cascade=>true,method_opt=>'for all columns size 250'). You can use degree and estimate_percent params as needed.

And why do you have the isdatenull function thing in there to begin with, and why is it indexed? Btree indexes (which are sorted alphabetically) do not store null values. Use that to your advantage and think about what that means. Also, why can't you just use where date_col is null to find out if it is null?
Re: highly skewed data distribution [message #134699 is a reply to message #134574] Fri, 26 August 2005 05:02 Go to previous messageGo to next message
sebastianR
Messages: 33
Registered: August 2005
Member
First of all, I dumped the isDateNull function because it didn't improve performance at all.

I used it because i have two clauses regarding the field END_DATE:
1) END_DATE IS NULL
OR
2) END_DATE > SYSDATE

I used the isdatenull function so I could make an index on UPPER(KEY_VAL) (that is the field I am searching on) and END_DATE, because I believed in performance improvement using a concatenated index. The OR condition was messing up that concept, therefore I selected the same table twice joining them using UNION ALL so at one table I selected all rows which matched END_DATE>SYSDATE and on the other I could use my concatenated index on isdatenull(END_DATE), UPPER(KEY_VAL).
Please feel free to judge this approach, I'd appreciate your thoughts on this in general.
Anyway, I was wrong, a simple index on upper(KEY_VAL) is equally good.
Therefore, please allow me to repost my actual select-query:
(SELECT DISTINCT T1.OBJ_ID,1228
FROM TABLE1 T1
WHERE UPPER(T1.KEY_VAL) LIKE UPPER('123%')'
AND (T1.END_DATE > SYSDATE OR T1.END_DATE IS NULL)
UNION SELECT DISTINCT T2.KEY_ID, 1228
FROM TABLE2 T2
WHERE UPPER(T2.PUID) LIKE UPPER('123%'));			


I am not sure if I understood what you meant by 'sort your data in different ways'. Do you speak of the possibility of poor clustering?

Do you want me to physically reorder my data, on different columns, or did you just mean I should change the order of my where-clauses?
I did the latter with no noticable change, but I've read this interesting white paper about indexing: download.oracle.com/oowsf2004/1272_wp.pdf

In there, the author points out 3 reasons for poor indexing, or non-choosen indexes by the CBO:
1. Selectivity
2. Uneven Data distribution
3. Poor Clustering
I am pretty sure now, that Selectivity and Data distribution are good, so I am looking now if maybe my clustering is bad.

And I think that might be it, because when sorted ASC on the primary key KEY_ID (which is not used when selecting) the values in the KEY_VAL fields are differing from row to row.

As this is not an exact determination for bad clustering I will post my index information:
table info:
num_rows 870 000
blocks 6254

num_rows: 871439
distinct_keys_: 640884
avg leaf blocks per key: 1
avg data blocks per key: 1
clustering factor: 605687 <--- ISN'T THIS VEEEERY HIGH???

As I read here (http://www.blacksheepnetworks.com/security/resources/oracle/faqdbapf.htm):
USER_INDEXES.CLUSTERING_FACTOR - This defines how ordered the rows are in the index. If CLUSTERING_FACTOR approaches the number of blocks in the table, the rows are ordered. If it approaches the number of rows in the table, the rows are randomly ordered. In such a case, it is unlikely that index entries in the same leaf block will point to rows in the same data blocks.

Okay, so 605687 nearly is 640884: bad clustering
What possibilities do I have to make this better? Please post any ideas and/or suggestions.

Thank you in advance,

SebastianR
Re: highly skewed data distribution [message #134702 is a reply to message #134574] Fri, 26 August 2005 05:05 Go to previous messageGo to next message
sebastianR
Messages: 33
Registered: August 2005
Member
For for the sake of completeness here the explain plan for the posted select query:

Operation	Object Name	Rows	Bytes	Cost	Object Node	In/Out	PStart	PStop

SELECT STATEMENT Optimizer Mode=CHOOSE		493  	 	372  	 	      	             	 
  SORT UNIQUE		493  	7 K	372  	 	      	             	 
    UNION-ALL		  	 	 	 	      	             	 
      TABLE ACCESS BY INDEX ROWID	TABLE1.TABLE1_OBJ_KEY	493  	7 K	364  	 	      	             	 
        INDEX RANGE SCAN	TABLE1.SR_OBJ_KEY_5	518  	 	4  	 	      	             	 
      TABLE ACCESS BY INDEX ROWID	TABLE1.TABLE1_ASSET	1  	17  	3  	 	      	             	 
        INDEX RANGE SCAN	TABLE1.IX_TABLE1_ASSET_2	1  	 	2  	 	      	             	 

Re: highly skewed data distribution [message #135016 is a reply to message #134574] Mon, 29 August 2005 08:21 Go to previous message
sebastianR
Messages: 33
Registered: August 2005
Member
I've now created a temporary Table with the exact same data like the one I am working with, except I added a
order by key_val

clause to it, so the rows are now in the order I like for the specific select-query.

I also created the same indexes and gathered statistics for the table and indexes.

Now, if I look on the index statistics, I get the following:

TABLE_NAME	BLEVEL	AVG_DATA_BLOCKS_PER_KEY	AVG_LEAF_BLOCKS_PER_KEY	CLUSTERING_FACTOR

SR_TEST_OBJ_KEY	SR_TEST_IDX_1	24	6	448
SR_TEST_OBJ_KEY	SR_TEST_IDX_2	1	1	9261
SR_TEST_OBJ_KEY	SR_TEST_IDX_3	1	1	641089



where the first index is on end_date, the second on UPPER(KEY_VAL) and the third on REVERSE(UPPER(KEY_VAL)).

Ignoring the first index (on end_date), the second one greatly improved on the clustering_factor while the third one remained the same (which is absolutly clear to me).

Testing the search with closing wildcard ('123%'), there is no significant difference in execution time.

So I guess the clustering_factor is not a too critical factor at all.

SebastianR
Previous Topic: clustering vs. reorganization
Next Topic: Update issue with high volume table
Goto Forum:
  


Current Time: Fri Dec 09 12:54:04 CST 2022