Re: Application of archived logs on physical standby very slow

From: <Laimutis.Nedzinskas_at_seb.lt>
Date: Fri, 29 Jan 2010 11:15:26 +0200
Message-ID: <OFAAC39E57.D9EF40A1-ONC22576BA.00318DE4-C22576BA.0032DE25_at_seb.lt>



I have the same issue.

vmstat shows low CPU, iostat is low but standby starts lagging by 50GB of redo when primary generates redo at some 15-20MB/second rate.

redo validation (alter system dump logfile validate...) shows 150-200 MB/sec read rate for redo.

My primary suspect is redo reader process. As far as I know it is only one process which reads and naturally it is a bottleneck.

vmstat may show next to nothing because our OS sees 64 CPUs (which is not but todays multicore, multithreaded, multi-whatever architectures translates into that many "CPU's" )

If one and the only reader takes 100% CPU then vmstat must show ~100/64 percent which is exactly what I get.

Is the reader tunable ? I doubt it....


Please consider the environment before printing this e-mail

                                                                           
             Niall Litchfield                                              
             <niall.litchfield                                             
             _at_gmail.com>                                                To 
             Sent by:                  Martin Bach                         
             oracle-l-bounce_at_f         <development_at_the-playground.de>     
             reelists.org                                               cc 
                                       ORACLE-L <oracle-l_at_freelists.org>   
                                                                   Subject 
             2010.01.28 15:00          Re: Application of archived logs on 
                                       physical standby very slow          
                                                                           
             Please respond to                                             
             niall.litchfield_at_                                             
                 gmail.com                                                 
                                                                           
                                                                           




What does the 10046 trace for the MRP process show up? You may well also endup doing os level tracing with strace or similar.

On Thu, Jan 28, 2010 at 9:15 AM, Martin Bach <development_at_the-playground.de > wrote:

      Hi Niall!

      couldn't agree more. I seem to have the dilemma where my storage
      array seems to be in good working order and my OS doesn't report any
      unusually high IO service times. I'm wondering where else I could
      look for clues as to why, trying to follow a methodical approach to
      troubleshooting.

      The load is clearly caused by Oracle processes, shutting the standby
      down reduces the iowaits to 0 (but that doesn't really solve my
      problem :)

      Regards,

      Martin!


      On 01/27/2010 07:06 PM, Niall Litchfield wrote:
            I'd have thought this problem was eminently suited to method-r.

            On 1/27/10, Johnson, William L (TEIS)<
            WLJohnson_at_tycoelectronics.com>  wrote:
                  What do you have your parallel_execution_message_size set
                  to on the standby
                  database?

                  -----Original Message-----
                  From: oracle-l-bounce_at_freelists.org [mailto:
                  oracle-l-bounce_at_freelists.org]
                  On Behalf Of Martin Bach
                  Sent: Wednesday, January 27, 2010 12:13 PM
                  To: ORACLE-L
                  Subject: Application of archived logs on physical standby
                  very slow

                  Dear listers,

                  I'd like to add another "interesting" problem to today's
                  interesting
                  RMAN problem. I have quite a bit of output from
                  diagnostic tools,
                  hopefully this doesn't got too badly mangled...

                  Our alerting system fired when my physical standby
                  database went out of
                  sync with production. DR and production share the same
                  storage (an entry
                  level MSA) and are multipathed. I could observe this on
                  the hosts (both
                  RHEL 5.2 32bit, don't ask):

                  Wed Jan 27 10:42:42 2010
                  RFS[2]: Archived Log:
                  '/u03/oradata/dr/arch/1_49264_596779339.dbf'
                  Primary database is in MAXIMUM PERFORMANCE mode
                  Wed Jan 27 10:47:55 2010
                  Media Recovery
                  Log /u03/oradata/dr/arch/1_49252_596779339.dbf
                  Wed Jan 27 10:55:27 2010
                  Media Recovery
                  Log /u03/oradata/dr/arch/1_49253_596779339.dbf
                  Wed Jan 27 11:01:46 2010

                  Online redo logs are 500M in size and so are most
                  archived logs, and as
                  you can see it takes 5-10 minutes to apply just one,
                  which is consistent.

                  top - 11:25:37 up 57 days, 17:34, 13 users,  load
                  average: 0.55, 1.05, 1.19
                  Tasks: 209 total,   1 running, 208 sleeping,   0 stopped,
                  0 zombie
                  Cpu0  :  2.8%us,  1.9%sy,  0.0%ni, 51.9%id, 43.5%wa,
                  0.0%hi,  0.0%si,
                  0.0%st
                  Cpu1  :  1.9%us,  0.9%sy,  0.0%ni, 42.6%id, 54.6%wa,
                  0.0%hi,  0.0%si,
                  0.0%st
                  Cpu2  :  0.9%us,  0.0%sy,  0.0%ni, 98.1%id,  0.9%wa,
                  0.0%hi,  0.0%si,
                  0.0%st
                  Cpu3  :  0.0%us,  0.9%sy,  0.0%ni, 99.1%id,  0.0%wa,
                  0.0%hi,  0.0%si,
                  0.0%st
                  Mem:   8310064k total,  8019968k used,   290096k free,
                  36456k buffers
                  Swap:  8152948k total,    70124k used,  8082824k free,
                  7568912k cached


   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM
TIME+  COMMAND 21647 oracle    15   0 2178m  48m  31m S    3  0.6 0:00.41 ora_mrp0_dr
  7496 oracle    15   0 2145m 1.5g 1.5g S    2 18.4
190:47.10 ora_p000_dr
  7500 oracle    15   0 2145m 1.5g 1.5g S    2 18.4
199:38.12 ora_p002_dr
  2580 root      10  -5     0    0    0 S    1  0.0
17:42.07 [kjournald]
  7498 oracle    16   0 2145m 1.5g 1.5g D    1 18.4
189:40.17 ora_p001_dr The only processes on cpu are MRP0 and its parallel slaves. Funny enough, iostat doesn't seem to see a problem: avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.25   24.75    0.00   75.00
Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s wkB/s avgrq-sz avgqu-sz   await  svctm  %util cciss/c0d0        0.00     4.00  0.00  8.00     0.00 48.00    12.00
    0.01    0.88   0.50   0.40
cciss/c0d1        0.00     5.00  0.00 18.00     0.00 92.00    10.22
    0.02    0.89   0.06   0.10
sda               0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
sdb               0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
sdc               0.00   211.00  0.00 189.00     0.00 1600.00    16.93
     0.49    2.57   2.57  48.60
sdd               0.00   203.00  0.00 183.00     0.00 1544.00    16.87
     0.50    2.71   2.74  50.10
sde               0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
sdf               0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
dm-0              0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
dm-2              0.00     0.00  0.00 786.00     0.00 3144.00     8.00
     2.09    2.64   1.26  98.70
dm-3              0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
dm-4              0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
dm-5              0.00     0.00  0.00 786.00     0.00 3144.00     8.00
     2.09    2.65   1.26  98.80
dm-6              0.00     0.00  0.00 23.00     0.00 92.00     8.00
    0.02    0.83   0.04   0.10
dm-7              0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
dm-8              0.00     0.00  0.00  0.00     0.00 0.00     0.00
    0.00    0.00   0.00   0.00
Service time is pretty decent I'd say and the SAN administrators tell me that everything on the array is fine. I checked for AIO and that's also used: [oracle]$ cat /proc/slabinfo | grep ^k kioctx                25     60    192   20    1 : tunables  120   60
  8 : slabdata      3      3      0
kiocb                 64     90    128   30    1 : tunables  120   60
  8 : slabdata      3      3      0
Data files are scattered over mount points /u02-/u06, and there are no multipathing (or other I/O related errors) reported in dmesg and /var/log/messages. Archive logs are received on the DR box in time, and there are no gaps. Our sys admin tells me that device mapper multipath is happily distributing IO round robin. Oh yes, and it's Oracle 10.2.0.4 32bit. Now does anyone have an idea on how to proceed? Regards, Martin -- Martin Bach OCM 10g http://martincarstenbach.wordpress.com. http://www.linkedin.com/in/martincarstenbach -- http://www.freelists.org/webpage/oracle-l -- http://www.freelists.org/webpage/oracle-l -- Martin Bach OCM 10g http://martincarstenbach.wordpress.com. http://www.linkedin.com/in/martincarstenbach
--
Niall Litchfield
Oracle DBA
http://www.orawin.info

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Jan 29 2010 - 03:15:26 CST

Original text of this message