RE: Application of archived logs on physical standby very slow
Date: Mon, 1 Feb 2010 17:07:42 +0200
Message-ID: <OF538CE2CB.8D3DF0CD-ONC22576BD.0052F3B4-C22576BD.0053211C_at_seb.lt>
A followup:
Event "PX Deq: Test for msg" is actually a top most event but I suspect it is kind of idle event actually. Looks like it is polling stuff: are you done, are you done,....
Please consider the environment before printing this e-mail
Laimutis.Nedzinsk
as_at_seb.lt
Sent by: To
oracle-l-bounce_at_f ORACLE-L <oracle-l_at_freelists.org>
reelists.org cc
Subject
2010.02.01 17:03 RE: Application of archived logs on
physical standby very slow
Please respond to
Laimutis.Nedzinsk
as_at_seb.lt
Thank you for your input very much
>but on the subject of MRP scan/read of logs they say "This part of the
recovery is easily handled by the single MRP process".
I tried to generate some redo, checked again in controled environment and
yes, it may be so. Top events are "checkpoint completed" and "db file
parallel write"
Looks like it's time to search for hot disks, etc.
Brgds, Laimis
Please consider the environment before printing this e-mail
Luca Canali
<Luca.Canali_at_cern
.ch> To
Sent by: ORACLE-L <oracle-l_at_freelists.org>
oracle-l-bounce_at_f cc
reelists.org
Subject
RE: Application of archived logs on
2010.01.29 12:39 physical standby very slow
Please respond to
Luca.Canali_at_cern.
ch
Hi,
What do you see in V$RECOVERY_PROGRESS as "Active Apply Rate"? (see also
discussion in
http://jarneil.wordpress.com/2008/07/15/monitoring-your-standby-with-vrecovery_progress/
)
I have 35655 KB/s on one of my systems for example.
I had a look at the discussion on redo apply tuning in Larry Carpenter's (et al.) book at page 119 [one can find it on google books too in case paper in not available :)] they mention several diagnostic and tuning techniques, but on the subject of MRP scan/read of logs they say "This part of the recovery is easily handled by the single MRP process".
Cheers,
L.
PS: in terms of further investigations
http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gRecoveryBestPractices.pdf
seems to suggest setting Log_Archive_Trace=8192 for some 'advance troubleshooting'
-----Original Message-----
From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]
On Behalf Of Laimutis.Nedzinskas_at_seb.lt
Sent: 29 January 2010 10:15
To: ORACLE-L
Subject: Re: Application of archived logs on physical standby very slow
I have the same issue.
vmstat shows low CPU, iostat is low but standby starts lagging by 50GB of redo when primary generates redo at some 15-20MB/second rate.
redo validation (alter system dump logfile validate...) shows 150-200 MB/sec read rate for redo.
My primary suspect is redo reader process. As far as I know it is only one process which reads and naturally it is a bottleneck.
vmstat may show next to nothing because our OS sees 64 CPUs (which is not but todays multicore, multithreaded, multi-whatever architectures translates into that many "CPU's" )
If one and the only reader takes 100% CPU then vmstat must show ~100/64 percent which is exactly what I get.
Is the reader tunable ? I doubt it....
Please consider the environment before printing this e-mail
Niall Litchfield
<niall.litchfield
_at_gmail.com> To
Sent by: Martin Bach
oracle-l-bounce_at_f <development_at_the-playground.de>
reelists.org cc
ORACLE-L <oracle-l_at_freelists.org>
Subject
2010.01.28 15:00 Re: Application of archived logs on
physical standby very slow
Please respond to
niall.litchfield_at_
gmail.com
What does the 10046 trace for the MRP process show up? You may well also endup doing os level tracing with strace or similar.
On Thu, Jan 28, 2010 at 9:15 AM, Martin Bach <development_at_the-playground.de
> wrote:
Hi Niall!
couldn't agree more. I seem to have the dilemma where my storage
array seems to be in good working order and my OS doesn't report any
unusually high IO service times. I'm wondering where else I could
look for clues as to why, trying to follow a methodical approach to
troubleshooting.
The load is clearly caused by Oracle processes, shutting the standby
down reduces the iowaits to 0 (but that doesn't really solve my
problem :)
Regards,
Martin!
On 01/27/2010 07:06 PM, Niall Litchfield wrote:
I'd have thought this problem was eminently suited to method-r.
On 1/27/10, Johnson, William L (TEIS)<
WLJohnson_at_tycoelectronics.com> wrote:
What do you have your parallel_execution_message_size set
to on the standby
database?
-----Original Message-----
From: oracle-l-bounce_at_freelists.org [mailto:
oracle-l-bounce_at_freelists.org]
On Behalf Of Martin Bach
Sent: Wednesday, January 27, 2010 12:13 PM
To: ORACLE-L
Subject: Application of archived logs on physical standby
very slow
Dear listers,
I'd like to add another "interesting" problem to today's
interesting
RMAN problem. I have quite a bit of output from
diagnostic tools,
hopefully this doesn't got too badly mangled...
Our alerting system fired when my physical standby
database went out of
sync with production. DR and production share the same
storage (an entry
level MSA) and are multipathed. I could observe this on
the hosts (both
RHEL 5.2 32bit, don't ask):
Wed Jan 27 10:42:42 2010
RFS[2]: Archived Log:
'/u03/oradata/dr/arch/1_49264_596779339.dbf'
Primary database is in MAXIMUM PERFORMANCE mode
Wed Jan 27 10:47:55 2010
Media Recovery
Log /u03/oradata/dr/arch/1_49252_596779339.dbf
Wed Jan 27 10:55:27 2010
Media Recovery
Log /u03/oradata/dr/arch/1_49253_596779339.dbf
Wed Jan 27 11:01:46 2010
Online redo logs are 500M in size and so are most
archived logs, and as
you can see it takes 5-10 minutes to apply just one,
which is consistent.
top - 11:25:37 up 57 days, 17:34, 13 users, load
average: 0.55, 1.05, 1.19
Tasks: 209 total, 1 running, 208 sleeping, 0 stopped,
0 zombie
Cpu0 : 2.8%us, 1.9%sy, 0.0%ni, 51.9%id, 43.5%wa,
0.0%hi, 0.0%si,
0.0%st
Cpu1 : 1.9%us, 0.9%sy, 0.0%ni, 42.6%id, 54.6%wa,
0.0%hi, 0.0%si,
0.0%st
Cpu2 : 0.9%us, 0.0%sy, 0.0%ni, 98.1%id, 0.9%wa,
0.0%hi, 0.0%si,
0.0%st
Cpu3 : 0.0%us, 0.9%sy, 0.0%ni, 99.1%id, 0.0%wa,
0.0%hi, 0.0%si,
0.0%st
Mem: 8310064k total, 8019968k used, 290096k free,
36456k buffers
Swap: 8152948k total, 70124k used, 8082824k free,
7568912k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
21647 oracle 15 0 2178m 48m 31m S 3 0.6
0:00.41 ora_mrp0_dr
7496 oracle 15 0 2145m 1.5g 1.5g S 2 18.4
190:47.10
ora_p000_dr
7500 oracle 15 0 2145m 1.5g 1.5g S 2 18.4
199:38.12
ora_p002_dr
2580 root 10 -5 0 0 0 S 1 0.0
17:42.07
[kjournald]
7498 oracle 16 0 2145m 1.5g 1.5g D 1 18.4
189:40.17
ora_p001_dr
The only processes on cpu are MRP0 and its parallel
slaves. Funny
enough, iostat doesn't seem to see a problem:
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.25 24.75 0.00 75.00
Device: rrqm/s wrqm/s r/s w/s rkB/s
wkB/s avgrq-sz
avgqu-sz await svctm %util
cciss/c0d0 0.00 4.00 0.00 8.00 0.00
48.00 12.00
0.01 0.88 0.50 0.40
cciss/c0d1 0.00 5.00 0.00 18.00 0.00
92.00 10.22
0.02 0.89 0.06 0.10
sda 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
sdc 0.00 211.00 0.00 189.00 0.00
1600.00 16.93
0.49 2.57 2.57 48.60
sdd 0.00 203.00 0.00 183.00 0.00
1544.00 16.87
0.50 2.71 2.74 50.10
sde 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 786.00 0.00
3144.00 8.00
2.09 2.64 1.26 98.70
dm-3 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 786.00 0.00
3144.00 8.00
2.09 2.65 1.26 98.80
dm-6 0.00 0.00 0.00 23.00 0.00
92.00 8.00
0.02 0.83 0.04 0.10
dm-7 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-8 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
Service time is pretty decent I'd say and the SAN
administrators tell me
that everything on the array is fine. I checked for AIO
and that's also
used:
[oracle]$ cat /proc/slabinfo | grep ^k
kioctx 25 60 192 20 1 :
tunables 120 60
8 : slabdata 3 3 0
kiocb 64 90 128 30 1 :
tunables 120 60
8 : slabdata 3 3 0
Data files are scattered over mount points /u02-/u06, and
there are no
multipathing (or other I/O related errors) reported in
dmesg and
/var/log/messages. Archive logs are received on the DR
box in time, and
there are no gaps. Our sys admin tells me that device
mapper multipath
is happily distributing IO round robin. Oh yes, and it's
Oracle 10.2.0.4
32bit.
Now does anyone have an idea on how to proceed?
Regards,
Martin
--
Martin Bach
OCM 10g
http://martincarstenbach.wordpress.com.
http://www.linkedin.com/in/martincarstenbach
--
http://www.freelists.org/webpage/oracle-l
--
http://www.freelists.org/webpage/oracle-l
--
Martin Bach
OCM 10g
http://martincarstenbach.wordpress.com.
http://www.linkedin.com/in/martincarstenbach
-- Niall Litchfield Oracle DBA http://www.orawin.info -- http://www.freelists.org/webpage/oracle-l -- http://www.freelists.org/webpage/oracle-l -- http://www.freelists.org/webpage/oracle-l -- http://www.freelists.org/webpage/oracle-lReceived on Mon Feb 01 2010 - 09:07:42 CST
