Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> RE: Statistical sampling and representative stats collection

RE: Statistical sampling and representative stats collection

From: John Kanagaraj <john.kanagaraj_at_hds.com>
Date: Tue, 28 May 2002 10:48:56 -0800
Message-ID: <F001.0046D0E0.20020528104856@fatcity.com>


Hi Larry,

I sure wasn't disappointed to receive this note from a CBO/SQL guru such as yourself. Thanks for the ideas.

> My question here is would this database and your findings be
> applicable to
> other databases and the nature of their data? I wouldn't

The nature of the data and the variances in between are so great I would never try and generalize everything. However, the number (or percentage) of generic cases is significant compared to the specific, and that is probably what I was trying to get to. In many cases, the act and cost of collecting the statistics by itself is so huge and significant compared to what is actually used out of all that data.

> since the characteristics can be so different, but, if I'm reading you
> correctly, you aren't saying that ESTIMATE is always the only
> way. But, the

That is right. I think it was a case of moving on from carpet bombing (either ESTIMATE only or COMPUTE only) to some sort of precision bombing. There is still some collateral damage (as seen below)

> Sunday. Someone accidentally analyzed the schema at 30% on
> Monday and a lot
> of things went down the toilet. Going back to COMPUTE fixed
> things. Then
> again, maybe a 10% ESTIMATE would have fixed things. Jack and
> I both work
> with a guy who has talked about COMPUTE resulting in
> undesired plans, 10%
> did as well. They got the desired plans by going to 1%. So,
> even if one
> agrees that we don't necessarily have to COMPUTE, and in many
> (probably
> most?) cases we don't, there is still a lot of testing to be
> done to find
> the "best" estimate percent, and this could very well be different for
> various objects. And I think that's the battle we all face --
> what is the
> best sampling percentage. And right now, it still seems to be
> done on a
> trial and error basis. I have some ideas on how one might
> attack this in an
> automated fashion, but it's still a *very* rough idea that I
> need to bounce
> off a few cohorts.

I think you summarized it very well. How does one figure out what is the best percentage, given that we will NOT be able to complete a COMPUTE within the period allowed? When does one stop experimenting (significant DBA cost) and how does one make sure that the apple cart is not upset by new data patterns? I would be very interested in estimating the value using some automated fashion. I would be honored to be part of that bunch of cohorts!

Btw, there is an option in 9i DBMS_STATS.AUTO_SAMPLE_SIZE. Has anyone tried this out? Or know how it works?  

> for these stats on non-indexed columns? If so, I disagree on
> this point,
> even if it is general advice and not a rule. Stats on
> non-indexed columns
> can play a *large* role in CBO decisions. I'm not going to go
> into details
> and examples here illustrating that, but those stats can
> still help decide
> the driving table, the join methods between tables, etc. I
> built a sample
> case some time back to illustrate the importance of gathering these
> non-indexed column stats. Now, it might not be important for
> all systems,
> but if you are ever using indexed columns, and, still
> specifying criteria on
> non-indexed columns, the gathering of stats on the
> non-indexed columns could
> be *very* important. I can send you more details back-channel
> if you are
> interested.

I agree with you, although I do have to contend that the effect is not very pronounced in all databases. This was revealed in some depth in Wolfgang's paper at IOUG where he was able to actually determine this in a 10053 trace, and it was an eye opener for me. The issue I have with this is that the default number of buckets is 2 and that is nowhere near what is needed. On the other hand, indiscriminately increasing the bucket size would leave you with a _Large_ number of histograms and result in the 'row cache lock' latch being taken out more that it should have been (as well as add to the cost of parsing).

> And your approach very well could take care of most cases for
> many people.
> It's an interesting idea and something certainly worth
> playing around with.

Yep - and I did add a YMMV :)

I would love to see this thread grow.

John Kanagaraj
Oracle Applications DBA
DBSoft Inc
(W): 408-970-7002

The manuals for Oracle are here: http://tahiti.oracle.com The manual for Life is here: http://www.gospelcom.net

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing). Received on Tue May 28 2002 - 13:48:56 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US