Re: Collecting statistics for large established database

From: Mark D Powell <Mark.Powell_at_eds.com>
Date: Wed, 19 Aug 2009 07:45:37 -0700 (PDT)
Message-ID: <411e9562-20bb-40f6-94fe-fbdfec9ac396_at_o6g2000yqj.googlegroups.com>



On Aug 19, 7:25 am, vsevolod afanassiev
<vsevolod.afanass..._at_gmail.com> wrote:
> Let's say we have a relatively large database (2 TB or more). It is
> being used by an application that went live more than 3 years ago, so
> historic (transactional) tables contain more than 1000 days worth of
> data. There has been no changes to the table structure, i.e. all
> tables are the same as 3 years ago. The tables are not partitioned.
> This is classic OLTP application where most tables get populated by
> online users (call centre staff). One year we may have 1000 people in
> call centres, another year we may have 2000, but we are not going to
> have one million. So transactional tables experience linear growth.
>
> Question: why do we need to collect table/index statistics on a
> regular basis (i.e. daily or weekly?) At the end of the day the
> optimizer does not care about number of records in a table, only
> relative values are important. Let's say we have table A with
> 1,000,000 rows and table B with 1,000 rows. One year later  table A
> may have 1,500,000 rows and table B may have 2,000 rows. It shouln't
> have any impact on the plan. Yes it is possible that as a result of
> table growth a different plan may become slightly more efficient, but
> in real-life scenario plan stability is more important.
>
> I think the only situation where regular statistics collection is
> important is when plan depends on MIN or MAX value in a column. A
> purge job it may be executing something like
>
> DELETE FROM TABLEA WHERE CREATE_DATE < SYSDATE - 180
>
> If column CREATE_DATE is indexed then Oracle needs to know MIN
> (CREATE_DATE) to decide whether to use this index or not. But this is
> rare situation. It may be simple to add hints to a few statements that
> rely on MIN / MAX value than collect statistics every day and risk
> plan instability.
>
> In 10g default job for statistics collection relies on table
> monitoring. Tables get analyzed when total number of inserted /
> updated / deleted rows exceeds 10%. There are two problems with this
> approach: (1) it makes statistics collection unpredictable from DBA
> perspective (2) the statement updating internal table OBJECT_USAGE is
> responsible for significant percentage (around 30%) of soft parses.
>
> It seems to me that for established applications the best approach is
> to stop statistics collection (except for partitioned tables where new
> partitions are created every day/week/month etc). It will guarantee
> plan stability. It is possible that performance of some statements
> will deteriorate, but it will happen gradually. With regular
> statistics collection there is always treat of optimizer choosing sub-
> optimal plan. I have seen quite a few of them, typically it involves
> switch from INDEX RANGE SCAN to FAST FULL INDEX SCAN, switch from
> nested loops to hash join, etc - often much slower than the original
> plan.

The following Oracle support documents discuss the need and timing of cost based optimizer statistics collection. The first 605439.1 tells you to collect statistics once and not bother to do so again. The second, 44961.1, tells you to use a small table sample size and 100% index samples regualary.

Both are wrong. First there is not one sample size that works well for all tables in all applications.

Some sites can collect statistics once when the tables reach a steady state and then not bother to update the statistics and have good performing SQL. Other sites will discover this approach will quickly lead to several queries using bad SQL plans.

It depends on the application design, the data, the data usage pattern, and what columns are indexed. With the first approach tables whose indexed access depends on advancing key values like sequences and dates may run into issues where more and more of the data over time no longer fits into the frozen end point values. The net effect worsing performance over time.

Other sites may such as 10g up where Oracle automatically collects statistics daily and determines based on monitoring how much of the data has been changed when to update statistics for a specific table will find this the approach to follow. However, for some sites the daily updating will result in some queries going to lunch.

The fix is probably to tune the statistics for the effected tables queries manually, lock them in place, and then let Oracle handle the rest as it wants.

There is no, one perfect solution so my suggestion is in fact on 10g+ to let Oracle handle the statistics as it wants, then manually override the statistics collection for those objects where Oracle seems to get it wrong. I believe this is the best compromise approach. Let Oracle auto-tune most of the statistics and then just tune what Oracle gets wrong. This should be a managable portion of the total.

11g introduces a couple of very important statistics collection changes that work to eliminate the worse issues: the distinctness of column data being wrong due to sample size and the cost of partition table statistics.

HTH -- Mark D Powell -- Received on Wed Aug 19 2009 - 09:45:37 CDT

Original text of this message