Re: Application of archived logs on physical standby very slow
Date: Fri, 29 Jan 2010 11:15:26 +0200
Message-ID: <OFAAC39E57.D9EF40A1-ONC22576BA.00318DE4-C22576BA.0032DE25_at_seb.lt>
I have the same issue.
vmstat shows low CPU, iostat is low but standby starts lagging by 50GB of redo when primary generates redo at some 15-20MB/second rate.
redo validation (alter system dump logfile validate...) shows 150-200 MB/sec read rate for redo.
My primary suspect is redo reader process. As far as I know it is only one process which reads and naturally it is a bottleneck.
vmstat may show next to nothing because our OS sees 64 CPUs (which is not but todays multicore, multithreaded, multi-whatever architectures translates into that many "CPU's" )
If one and the only reader takes 100% CPU then vmstat must show ~100/64 percent which is exactly what I get.
Is the reader tunable ? I doubt it....
Please consider the environment before printing this e-mail
Niall Litchfield
<niall.litchfield
_at_gmail.com> To
Sent by: Martin Bach
oracle-l-bounce_at_f <development_at_the-playground.de>
reelists.org cc
ORACLE-L <oracle-l_at_freelists.org>
Subject
2010.01.28 15:00 Re: Application of archived logs on
physical standby very slow
Please respond to
niall.litchfield_at_
gmail.com
What does the 10046 trace for the MRP process show up? You may well also endup doing os level tracing with strace or similar.
On Thu, Jan 28, 2010 at 9:15 AM, Martin Bach <development_at_the-playground.de > wrote:
Hi Niall!
couldn't agree more. I seem to have the dilemma where my storage
array seems to be in good working order and my OS doesn't report any
unusually high IO service times. I'm wondering where else I could
look for clues as to why, trying to follow a methodical approach to
troubleshooting.
The load is clearly caused by Oracle processes, shutting the standby
down reduces the iowaits to 0 (but that doesn't really solve my
problem :)
Regards,
Martin!
On 01/27/2010 07:06 PM, Niall Litchfield wrote:
I'd have thought this problem was eminently suited to method-r.
On 1/27/10, Johnson, William L (TEIS)<
WLJohnson_at_tycoelectronics.com> wrote:
What do you have your parallel_execution_message_size set
to on the standby
database?
-----Original Message-----
From: oracle-l-bounce_at_freelists.org [mailto:
oracle-l-bounce_at_freelists.org]
On Behalf Of Martin Bach
Sent: Wednesday, January 27, 2010 12:13 PM
To: ORACLE-L
Subject: Application of archived logs on physical standby
very slow
Dear listers,
I'd like to add another "interesting" problem to today's
interesting
RMAN problem. I have quite a bit of output from
diagnostic tools,
hopefully this doesn't got too badly mangled...
Our alerting system fired when my physical standby
database went out of
sync with production. DR and production share the same
storage (an entry
level MSA) and are multipathed. I could observe this on
the hosts (both
RHEL 5.2 32bit, don't ask):
Wed Jan 27 10:42:42 2010
RFS[2]: Archived Log:
'/u03/oradata/dr/arch/1_49264_596779339.dbf'
Primary database is in MAXIMUM PERFORMANCE mode
Wed Jan 27 10:47:55 2010
Media Recovery
Log /u03/oradata/dr/arch/1_49252_596779339.dbf
Wed Jan 27 10:55:27 2010
Media Recovery
Log /u03/oradata/dr/arch/1_49253_596779339.dbf
Wed Jan 27 11:01:46 2010
Online redo logs are 500M in size and so are most
archived logs, and as
you can see it takes 5-10 minutes to apply just one,
which is consistent.
top - 11:25:37 up 57 days, 17:34, 13 users, load
average: 0.55, 1.05, 1.19
Tasks: 209 total, 1 running, 208 sleeping, 0 stopped,
0 zombie
Cpu0 : 2.8%us, 1.9%sy, 0.0%ni, 51.9%id, 43.5%wa,
0.0%hi, 0.0%si,
0.0%st
Cpu1 : 1.9%us, 0.9%sy, 0.0%ni, 42.6%id, 54.6%wa,
0.0%hi, 0.0%si,
0.0%st
Cpu2 : 0.9%us, 0.0%sy, 0.0%ni, 98.1%id, 0.9%wa,
0.0%hi, 0.0%si,
0.0%st
Cpu3 : 0.0%us, 0.9%sy, 0.0%ni, 99.1%id, 0.0%wa,
0.0%hi, 0.0%si,
0.0%st
Mem: 8310064k total, 8019968k used, 290096k free,
36456k buffers
Swap: 8152948k total, 70124k used, 8082824k free,
7568912k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM
TIME+ COMMAND
21647 oracle 15 0 2178m 48m 31m S 3 0.6
0:00.41 ora_mrp0_dr
7496 oracle 15 0 2145m 1.5g 1.5g S 2 18.4
190:47.10
ora_p000_dr
7500 oracle 15 0 2145m 1.5g 1.5g S 2 18.4
199:38.12
ora_p002_dr
2580 root 10 -5 0 0 0 S 1 0.0
17:42.07
[kjournald]
7498 oracle 16 0 2145m 1.5g 1.5g D 1 18.4
189:40.17
ora_p001_dr
The only processes on cpu are MRP0 and its parallel
slaves. Funny
enough, iostat doesn't seem to see a problem:
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.25 24.75 0.00 75.00
Device: rrqm/s wrqm/s r/s w/s rkB/s
wkB/s avgrq-sz
avgqu-sz await svctm %util
cciss/c0d0 0.00 4.00 0.00 8.00 0.00
48.00 12.00
0.01 0.88 0.50 0.40
cciss/c0d1 0.00 5.00 0.00 18.00 0.00
92.00 10.22
0.02 0.89 0.06 0.10
sda 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
sdc 0.00 211.00 0.00 189.00 0.00
1600.00 16.93
0.49 2.57 2.57 48.60
sdd 0.00 203.00 0.00 183.00 0.00
1544.00 16.87
0.50 2.71 2.74 50.10
sde 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 786.00 0.00
3144.00 8.00
2.09 2.64 1.26 98.70
dm-3 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 786.00 0.00
3144.00 8.00
2.09 2.65 1.26 98.80
dm-6 0.00 0.00 0.00 23.00 0.00
92.00 8.00
0.02 0.83 0.04 0.10
dm-7 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
dm-8 0.00 0.00 0.00 0.00 0.00
0.00 0.00
0.00 0.00 0.00 0.00
Service time is pretty decent I'd say and the SAN
administrators tell me
that everything on the array is fine. I checked for AIO
and that's also
used:
[oracle]$ cat /proc/slabinfo | grep ^k
kioctx 25 60 192 20 1 :
tunables 120 60
8 : slabdata 3 3 0
kiocb 64 90 128 30 1 :
tunables 120 60
8 : slabdata 3 3 0
Data files are scattered over mount points /u02-/u06, and
there are no
multipathing (or other I/O related errors) reported in
dmesg and
/var/log/messages. Archive logs are received on the DR
box in time, and
there are no gaps. Our sys admin tells me that device
mapper multipath
is happily distributing IO round robin. Oh yes, and it's
Oracle 10.2.0.4
32bit.
Now does anyone have an idea on how to proceed?
Regards,
Martin
--
Martin Bach
OCM 10g
http://martincarstenbach.wordpress.com.
http://www.linkedin.com/in/martincarstenbach
--
http://www.freelists.org/webpage/oracle-l
--
http://www.freelists.org/webpage/oracle-l
--
Martin Bach
OCM 10g
http://martincarstenbach.wordpress.com.
http://www.linkedin.com/in/martincarstenbach
-- Niall Litchfield Oracle DBA http://www.orawin.info -- http://www.freelists.org/webpage/oracle-lReceived on Fri Jan 29 2010 - 03:15:26 CST
