Re: Interesting RMAN induced "hang" behavior

From: hpuxrac <johnbhurley_at_sbcglobal.net>
Date: Sat, 28 Nov 2009 18:18:25 -0800 (PST)
Message-ID: <6f1d6244-5530-4f54-b8db-b918fba61715_at_r5g2000yqb.googlegroups.com>



On Nov 27, 11:59 am, Pat <pat.ca..._at_service-now.com> wrote:

snip

> So I recently ran down a really weird problem we were having on one of
> our production 10.2.0.4 servers.
>
> To make a long story short, recently during the RMAN backup window,
> the database would go into a "locked" state and all transactions would
> pause. Lock durations ranged from a few seconds (which nobody
> noticed), to 12 minutes (which was a p1 outage and got me up out of
> bed) last Friday.
>
> All the "stuck" processes reported they were waiting on lath:library
> cache, which made me scratch my head. During the "hangs" IO and load
> on the server dropped to zero. CPU was 99-100% idle. Blocks in/out was
> in the mid double digits, etc.
>
> I wasted a LOT of time fiddling with the REDO logs since I was sure
> that was the problem e.g. overnight backups = busy SAN = poor REDO
> flush = waits. I put the REDO on new luns, I brought in the SAN guys,
> I pulled 30 days worth of sar data, etc. That didn't turn out to be
> the problem at all though.
>
> We eventually ran the problem down to the RMAN backup *target* being
> backlogged. RMAN was reading of the SAN and sending it out over the
> wire to an IBM XIV where the backup set was being written. The XIV was
> getting more and more overloaded though, and since its a "backup"
> device the SAN guys more or less shrugged about it.
>
> From what I could observe though, RMAN was reading blocks off the SAN
> nice and quickly, then pinning them in memory. Then it'd turn around
> and flush them through to the XIV. The XIV though was so backlogged
> that it was taking minutes to flush the blocks through, and during
> that time the blocks were still pinned in memory. Pin the wrong blocks
> and you effectively put a global lock on the database, which is what
> we were running into.
>
> Moving the backups off the XIV and onto another, unloaded, backup
> device fixed the symptom and had the beneficial side effect of making
> my backups complete in 1/3 the time.
>
> The question I've got for the group though is, has anybody seen
> anything like this before? I feel like I've got to be missing
> something here; I mean I used to use RMAN to backup to *tape* for
> heaven's sake which is slower than even a highly loaded XIV, and I
> never saw this problem.
>
> What am I missing here? Does anybody have a good theory on why a slow
> backup target would cause these sorts of symptoms?
>
> I'm relatively confident I fixed the "problem" here, but I really want
> to understand what went on here in case I run into it somewhere else.

What was the lock? Was it always the same lock?

What does your backup script look like exactly? Received on Sat Nov 28 2009 - 20:18:25 CST

Original text of this message