Re: 10gR1 - Checkpoint not complete -

From: joel garry <joel-garry_at_home.com>
Date: 30 Oct 2006 15:47:50 -0800
Message-ID: <1162252070.351959.22630@i42g2000cwa.googlegroups.com>

p.santos000_at_gmail.com wrote:
> Joel,
>
> Here is what v$sgainfo looks like.
>
> NAME | BYTES
> --------------------------------|----------------
> Fixed SGA Size | 1,309,712
> Redo Buffers | 524,288
> Buffer Cache Size | 2,147,483,648
> Shared Pool Size | 1,073,741,824
> Large Pool Size | 218,103,808
> Java Pool Size | 33,554,432
> Streams Pool Size | 0
> Granule Size | 16,777,216
> Maximum SGA Size | 6,442,450,944
> Startup overhead in Shared Pool | 100,663,296
> Free SGA Memory Available | 2,952,790,016
>
> Our application does a lot of very quirky things that are painful for a
> DBA, but
> until we redesign our new application next year, I just have to find
> ways to
> deal with the thrashing of the database.

Unnecessary committing in the app, perhaps?

>
> For example, our customers at any point during the day can update
> in batches of 500,000 any column in their tables. While at the same
> time users can go only and update their own information... so the batch
> updates blocks these single record updates etc ... customers have
> their own tables ..causing lots of similar sql ... this is the reason
> for a 2GB buffer cache.

This is where V$BH may really help you out. Those large values for db sequential read mean that your SGA is being thrashed. See the performance tuning manual to see how to figure out how to put those tables or indices in a recycle pool. In a nutshell, any table that is having 500000 index lookups followed by table lookups oughta be removed from the default pool, because they are unnecessarily forcing themselves to the head of the LRU line. Then the flushing of dirty blocks to disk can't finish before the next time it needs to start. That's why the usual cure of simply making the redo logs bigger to give the checkpoint time to complete doesn't work, you have to fix the thrashing SGA too.

On the other hand, it's possible that the number of files and amount of transactional data you have are just plain overwhelming the db writer and log writer. (File headers need to be locked for checkpoint updates). Are you RAID-5, perhaps?

What's the deal with the shared pool size?

>
> So anytime the top 5 wait events change .. we can be really slow, then
> all of the sudden 3-5 customers start batch updating columns with
> anywhere from 15MM to 40MM records per table. .. this is in part of the
> reason we generate so much redo.
>
> Our top five events this morning were these.
>
> Top 5 Timed Events
> ~~~~~~~~~~~~~~~~~~ % Total
> Event Waits Time (s) DB Time
> Wait Class
> ------------------------------ ------------ ----------- ---------
> --------------
> db file sequential read 4,598,287 28,970 35.84
> User I/O
> CPU time 20,952 25.92
> buffer busy waits 16,131 15,356 19.00
> Concurrency
> db file parallel write 34,904 13,730 16.99
> System I/O
> db file scattered read 292,277 6,568 8.13
> User I/O
> -------------------------------------------------------------
>
> 'buffer busy waits' was because of a batch update running in
> parallel... probably updating records in same/similar blocks.
>
> I understand that the LOG_CHECKPOINT parameters changed and although
> the recommendation is to use FAST_START_MTTR_TARGET, when I upgraded
> from 8.1.7 to 10.1.0.4 .. I had other battles to fight.

Those LOG_CHECKPOINT parameters are to control how fast you can recover. See
http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14237/dynviews_1133.htm . Keep in mind that it may be more important not to screw up current operations than to quickly recover. If you have an impossible Service Level Agreement, that needs to be resolved, too. So you should turn on FSMT to find out what is optimal_logfile_size according to v$instance_recovery.

Metalink Note:147468.1 can help.

--
@home.com is bogus.
http://developers.slashdot.org/developers/06/10/30/1138257.shtml

Received on Mon Oct 30 2006 - 17:47:50 CST