RE: Application of archived logs on physical standby very slow

From: Luca Canali <Luca.Canali_at_cern.ch>
Date: Fri, 29 Jan 2010 11:38:37 +0100
Message-ID: <E0C8B02740B9074FAF5257A265F32C4A09D3737C_at_cernxchg70.cern.ch>



Hi,

What do you see in V$RECOVERY_PROGRESS as "Active Apply Rate"? (see also discussion in http://jarneil.wordpress.com/2008/07/15/monitoring-your-standby-with-vrecovery_progress/) I have 35655 KB/s on one of my systems for example.

I had a look at the discussion on redo apply tuning in Larry Carpenter's (et al.) book at page 119 [one can find it on google books too in case paper in not available :)] they mention several diagnostic and tuning techniques, but on the subject of MRP scan/read of logs they say "This part of the recovery is easily handled by the single MRP process".

Cheers,
L.

PS: in terms of further investigations http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gRecoveryBestPractices.pdf seems to suggest setting Log_Archive_Trace=8192 for some 'advance troubleshooting'

-----Original Message-----
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Laimutis.Nedzinskas_at_seb.lt Sent: 29 January 2010 10:15
To: ORACLE-L
Subject: Re: Application of archived logs on physical standby very slow

I have the same issue.

vmstat shows low CPU, iostat is low but standby starts lagging by 50GB of redo when primary generates redo at some 15-20MB/second rate.

redo validation (alter system dump logfile validate...) shows 150-200 MB/sec read rate for redo.

My primary suspect is redo reader process. As far as I know it is only one process which reads and naturally it is a bottleneck.

vmstat may show next to nothing because our OS sees 64 CPUs (which is not but todays multicore, multithreaded, multi-whatever architectures translates into that many "CPU's" )

If one and the only reader takes 100% CPU then vmstat must show ~100/64 percent which is exactly what I get.

Is the reader tunable ? I doubt it....


Please consider the environment before printing this e-mail

                                                                           
             Niall Litchfield                                              
             <niall.litchfield                                             
             _at_gmail.com>                                                To 
             Sent by:                  Martin Bach                         
             oracle-l-bounce_at_f         <development_at_the-playground.de>     
             reelists.org                                               cc 
                                       ORACLE-L <oracle-l_at_freelists.org>   
                                                                   Subject 
             2010.01.28 15:00          Re: Application of archived logs on 
                                       physical standby very slow          
                                                                           
             Please respond to                                             
             niall.litchfield_at_                                             
                 gmail.com                                                 
                                                                           
                                                                           




What does the 10046 trace for the MRP process show up? You may well also endup doing os level tracing with strace or similar.

On Thu, Jan 28, 2010 at 9:15 AM, Martin Bach <development_at_the-playground.de
> wrote:

      Hi Niall!

      couldn't agree more. I seem to have the dilemma where my storage
      array seems to be in good working order and my OS doesn't report any
      unusually high IO service times. I'm wondering where else I could
      look for clues as to why, trying to follow a methodical approach to
      troubleshooting.

      The load is clearly caused by Oracle processes, shutting the standby
      down reduces the iowaits to 0 (but that doesn't really solve my
      problem :)

      Regards,

      Martin!


      On 01/27/2010 07:06 PM, Niall Litchfield wrote:
            I'd have thought this problem was eminently suited to method-r.

            On 1/27/10, Johnson, William L (TEIS)<
            WLJohnson_at_tycoelectronics.com>  wrote:
                  What do you have your parallel_execution_message_size set
                  to on the standby
                  database?

                  -----Original Message-----
                  From: oracle-l-bounce_at_freelists.org [mailto:
                  oracle-l-bounce_at_freelists.org]
                  On Behalf Of Martin Bach
                  Sent: Wednesday, January 27, 2010 12:13 PM
                  To: ORACLE-L
                  Subject: Application of archived logs on physical standby
                  very slow

                  Dear listers,

                  I'd like to add another "interesting" problem to today's
                  interesting
                  RMAN problem. I have quite a bit of output from
                  diagnostic tools,
                  hopefully this doesn't got too badly mangled...

                  Our alerting system fired when my physical standby
                  database went out of
                  sync with production. DR and production share the same
                  storage (an entry
                  level MSA) and are multipathed. I could observe this on
                  the hosts (both
                  RHEL 5.2 32bit, don't ask):

                  Wed Jan 27 10:42:42 2010
                  RFS[2]: Archived Log:
                  '/u03/oradata/dr/arch/1_49264_596779339.dbf'
                  Primary database is in MAXIMUM PERFORMANCE mode
                  Wed Jan 27 10:47:55 2010
                  Media Recovery
                  Log /u03/oradata/dr/arch/1_49252_596779339.dbf
                  Wed Jan 27 10:55:27 2010
                  Media Recovery
                  Log /u03/oradata/dr/arch/1_49253_596779339.dbf
                  Wed Jan 27 11:01:46 2010

                  Online redo logs are 500M in size and so are most
                  archived logs, and as
                  you can see it takes 5-10 minutes to apply just one,
                  which is consistent.

                  top - 11:25:37 up 57 days, 17:34, 13 users,  load
                  average: 0.55, 1.05, 1.19
                  Tasks: 209 total,   1 running, 208 sleeping,   0 stopped,
                  0 zombie
                  Cpu0  :  2.8%us,  1.9%sy,  0.0%ni, 51.9%id, 43.5%wa,
                  0.0%hi,  0.0%si,
                  0.0%st
                  Cpu1  :  1.9%us,  0.9%sy,  0.0%ni, 42.6%id, 54.6%wa,
                  0.0%hi,  0.0%si,
                  0.0%st
                  Cpu2  :  0.9%us,  0.0%sy,  0.0%ni, 98.1%id,  0.9%wa,
                  0.0%hi,  0.0%si,
                  0.0%st
                  Cpu3  :  0.0%us,  0.9%sy,  0.0%ni, 99.1%id,  0.0%wa,
                  0.0%hi,  0.0%si,
                  0.0%st
                  Mem:   8310064k total,  8019968k used,   290096k free,
                  36456k buffers
                  Swap:  8152948k total,    70124k used,  8082824k free,
                  7568912k cached

                     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM
                  TIME+  COMMAND

                  21647 oracle    15   0 2178m  48m  31m S    3  0.6
                  0:00.41 ora_mrp0_dr

                    7496 oracle    15   0 2145m 1.5g 1.5g S    2 18.4
                  190:47.10
                  ora_p000_dr
                    7500 oracle    15   0 2145m 1.5g 1.5g S    2 18.4
                  199:38.12
                  ora_p002_dr
                    2580 root      10  -5     0    0    0 S    1  0.0
                  17:42.07
                  [kjournald]

                    7498 oracle    16   0 2145m 1.5g 1.5g D    1 18.4
                  189:40.17
                  ora_p001_dr

                  The only processes on cpu are MRP0 and its parallel
                  slaves. Funny
                  enough, iostat doesn't seem to see a problem:

                  avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                              0.00    0.00    0.25   24.75    0.00   75.00

                  Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s
                  wkB/s avgrq-sz
                  avgqu-sz   await  svctm  %util
                  cciss/c0d0        0.00     4.00  0.00  8.00     0.00
                  48.00    12.00
                      0.01    0.88   0.50   0.40
                  cciss/c0d1        0.00     5.00  0.00 18.00     0.00
                  92.00    10.22
                      0.02    0.89   0.06   0.10
                  sda               0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  sdb               0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  sdc               0.00   211.00  0.00 189.00     0.00
                  1600.00    16.93
                       0.49    2.57   2.57  48.60
                  sdd               0.00   203.00  0.00 183.00     0.00
                  1544.00    16.87
                       0.50    2.71   2.74  50.10
                  sde               0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  sdf               0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  dm-0              0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  dm-1              0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  dm-2              0.00     0.00  0.00 786.00     0.00
                  3144.00     8.00
                       2.09    2.64   1.26  98.70
                  dm-3              0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  dm-4              0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  dm-5              0.00     0.00  0.00 786.00     0.00
                  3144.00     8.00
                       2.09    2.65   1.26  98.80
                  dm-6              0.00     0.00  0.00 23.00     0.00
                  92.00     8.00
                      0.02    0.83   0.04   0.10
                  dm-7              0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00
                  dm-8              0.00     0.00  0.00  0.00     0.00
                  0.00     0.00
                      0.00    0.00   0.00   0.00

                  Service time is pretty decent I'd say and the SAN
                  administrators tell me
                  that everything on the array is fine. I checked for AIO
                  and that's also
                  used:

                  [oracle]$ cat /proc/slabinfo | grep ^k
                  kioctx                25     60    192   20    1 :
                  tunables  120   60
                    8 : slabdata      3      3      0
                  kiocb                 64     90    128   30    1 :
                  tunables  120   60
                    8 : slabdata      3      3      0

                  Data files are scattered over mount points /u02-/u06, and
                  there are no
                  multipathing (or other I/O related errors) reported in
                  dmesg and
                  /var/log/messages. Archive logs are received on the DR
                  box in time, and
                  there are no gaps. Our sys admin tells me that device
                  mapper multipath
                  is happily distributing IO round robin. Oh yes, and it's
                  Oracle 10.2.0.4
                  32bit.

                  Now does anyone have an idea on how to proceed?

                  Regards,

                  Martin
                  --
                  Martin Bach
                  OCM 10g
                  http://martincarstenbach.wordpress.com.
                  http://www.linkedin.com/in/martincarstenbach
                  --
                  http://www.freelists.org/webpage/oracle-l


                  --
                  http://www.freelists.org/webpage/oracle-l





      --
      Martin Bach
      OCM 10g
      http://martincarstenbach.wordpress.com.
      http://www.linkedin.com/in/martincarstenbach



--
Niall Litchfield
Oracle DBA
http://www.orawin.info

--
http://www.freelists.org/webpage/oracle-l


--
http://www.freelists.org/webpage/oracle-l
Received on Fri Jan 29 2010 - 04:38:37 CST

Original text of this message