Re: 10gR1 - Checkpoint not complete -

From: <p.santos000_at_gmail.com>
Date: 27 Oct 2006 13:53:04 -0700
Message-ID: <1161982384.192816.296860@k70g2000cwa.googlegroups.com>

Joel,

Here is what v$sgainfo looks like.

NAME                            |           BYTES
--------------------------------|----------------
Fixed SGA Size                  |       1,309,712
Redo Buffers                    |         524,288
Buffer Cache Size               |   2,147,483,648
Shared Pool Size                |   1,073,741,824
Large Pool Size                 |     218,103,808
Java Pool Size                  |      33,554,432
Streams Pool Size               |               0
Granule Size                    |      16,777,216
Maximum SGA Size                |   6,442,450,944
Startup overhead in Shared Pool |     100,663,296
Free SGA Memory Available       |   2,952,790,016

Our application does a lot of very quirky things that are painful for a DBA, but
until we redesign our new application next year, I just have to find ways to
deal with the thrashing of the database.

For example, our customers at any point during the day can update in batches of 500,000 any column in their tables. While at the same time users can go only and update their own information... so the batch updates blocks these single record updates etc ... customers have their own tables ..causing lots of similar sql ... this is the reason for a 2GB buffer cache.

So anytime the top 5 wait events change .. we can be really slow, then all of the sudden 3-5 customers start batch updating columns with anywhere from 15MM to 40MM records per table. .. this is in part of the reason we generate so much redo.

Our top five events this morning were these.

Top 5 Timed Events

~~~~~~~~~~~~~~~~~~                                        % Total
Event                                 Waits    Time (s)   DB Time

Wait Class

------------------------------ ------------ ----------- ---------
--------------
db file sequential read           4,598,287      28,970     35.84
User I/O
CPU time                                         20,952     25.92
buffer busy waits                    16,131      15,356     19.00
Concurrency
db file parallel write               34,904      13,730     16.99
System I/O
db file scattered read              292,277       6,568      8.13
User I/O
          -------------------------------------------------------------

'buffer busy waits' was because of a batch update running in parallel... probably updating records in same/similar blocks.

I understand that the LOG_CHECKPOINT parameters changed and although the recommendation is to use FAST_START_MTTR_TARGET, when I upgraded from 8.1.7 to 10.1.0.4 .. I had other battles to fight.

-peter

joel garry wrote:
> p.santos000_at_gmail.com wrote:
> > Hey guys,
> > I thought I try to get some help in trying to get this problem
> > resolved.
> > I've struggled with the 'Checkpoint not complete' error message
> > before, and thought I had it
> > resolved, but I can't seem to make it go way.
> >
> > I'm currently running 10.1.0.4 on Solaris 64bit.
> > From my analysis of this issue, I get the "Checkpoint not complete"
> > message nearly evertime
> > during a log switch. From looking at the checkpoint error timestamps
> > and the log switches,
> > 99% of the time the CKPT process fails during the log swith.
> >
> > I've always been reluctant to checkpoint too much, but maybe I need
> > to increase checkpoints,
> > sot that the entire process is faster during log switches ?????
> >
> > Here is the data. (log siwtches are occurring around 20+ minutes)
> >
> > V$LOG
> > =================
> >
> > GROUP#| SEQUENCE#| MBYTES|ARC|STATUS |FIRST_TIME
> > |LASTED_MIN
> >
> > ----------|----------|----------|---|---------------|--------------|----------
> > 1| 32198| 600|YES|INACTIVE |10/27/06 09:24|
> > 21
> > 2| 32199| 600|YES|INACTIVE |10/27/06 09:44|
> > 26
> > 3| 32200| 600|NO |CURRENT |10/27/06 10:11|
> > 6
> > 4| 32193| 600|YES|INACTIVE |10/27/06 07:12|
> > 27
> > 5| 32194| 600|YES|INACTIVE |10/27/06 07:39|
> > 26
> > 6| 32195| 600|YES|INACTIVE |10/27/06 08:05|
> > 40
> > 7| 32196| 600|YES|INACTIVE |10/27/06 08:45|
> > 29
> > 8| 32197| 600|YES|INACTIVE |10/27/06 09:14|
> > 10
> >
> >
> > ALERT_LOG (timestamps match log swith)
> > =================
> > --- These all happen during log switch...
> >
> > Fri Oct 27 08:05:31 2006
> > Thread 1 cannot allocate new log, sequence 32195
> > Checkpoint not complete
> > Current log# 5 seq# 32194 mem# 0: /z1/oradata/mail/redo_5_5.log
> >
> >
> > Fri Oct 27 08:45:18 2006
> > Thread 1 cannot allocate new log, sequence 32196
> > Checkpoint not complete
> > Current log# 6 seq# 32195 mem# 0: /z0/oradata/mail/redo_6_6.log
> >
> >
> > Fri Oct 27 09:24:21 2006
> > Thread 1 cannot allocate new log, sequence 32198
> > Checkpoint not complete
> > Current log# 8 seq# 32197 mem# 0: /z0/oradata/mail/redo_8_8.log
> >
> >
> > Fri Oct 27 09:44:59 2006
> > Thread 1 cannot allocate new log, sequence 32199
> > Checkpoint not complete
> > Current log# 1 seq# 32198 mem# 0: /z1/oradata/mail/redo_1_1.log
> >
> >
> > Fri Oct 27 10:11:13 2006
> > Thread 1 cannot allocate new log, sequence 32200
> > Checkpoint not complete
> > Current log# 2 seq# 32199 mem# 0: /z0/oradata/mail/redo_2_2.log
> >
> >
> > KEY PARAMETERS
> > =====================
> >
> > NAME |TYPE |VALUE
> > -------------------------------------------|----------------|--------
> > log_checkpoint_interval |integer |80000
> > log_checkpoint_timeout |integer |1800
> > fast_start_mttr_target |integer |0
> > log_buffer |integer |524288
> > disk_asynch_io |boolean |TRUE
> >
> > ** Haven't configured fast_start_mttr_target because log_checK*
> > parameters
> > were already set when I upgrade from 8.1.7 and so It was just one less
> > thing to
> > worry about configuring...

> Are you saying you used 817 parameters for 10g? I hope I'm misreading
> that. See the last sentence in
> http://www.jlcomp.demon.co.uk/faq/log_checkpoint.html

> In addition to what Mark said, you should post what your top waits are,
> and show SGA. I expect some issues on the size of your SGA and your
> log_buffer.

> Also you may want to research the use of async i/o on your particular
> system.

>


> >

> > AWR report from 9-10  AM this morning

> > ========================

> > Statistic                                             |Total

> > ----------------------------------------------------------- | ......

> > DBWR checkpoints                               | 4

> > background checkpoints completed  | 3

> > background checkpoints started        | 2

> >

> > - the way  I see it I'm probably not checkpointing enough .. which

> > causes the ckpt

> >  process during log switches to take too long .. is this accurate ?

> >

> > BTW - fixing application to generate less redo, has been explored and

> > we can't do any more on that front .... We generate anywhere between

> > 200K - 1MB of redo per second ... averaging out around 600-700K per

> > second.

> >
> > as always any feedback is helpful...

> You may need more and smaller logs with a bigger log buffer.

> 45 minutes for a checkpoint not to complete is rather worrisome. Your
> application code may be doing something that keeps fighting over dirty
> buffers - do you have something that repeatedly updates a small number
> of rows?

> jg
> --
> @home.com is bogus.
> So which DS should I buy the kids?
> http://www.signonsandiego.com/uniontrib/20061027/news_1b27sony.html
Received on Fri Oct 27 2006 - 15:53:04 CDT