Re: Interesting RMAN induced "hang" behavior

From: Pat <pat.casey_at_service-now.com>
Date: Sat, 28 Nov 2009 18:26:50 -0800 (PST)
Message-ID: <716bb417-81f3-47b8-8937-7831fe058a9c_at_g27g2000yqn.googlegroups.com>



On Nov 28, 6:18 pm, hpuxrac <johnbhur..._at_sbcglobal.net> wrote:
> On Nov 27, 11:59 am, Pat <pat.ca..._at_service-now.com> wrote:
>
> snip
>
>
>
> > So I recently ran down a really weird problem we were having on one of
> > our production 10.2.0.4 servers.
>
> > To make a long story short, recently during the RMAN backup window,
> > the database would go into a "locked" state and all transactions would
> > pause. Lock durations ranged from a few seconds (which nobody
> > noticed), to 12 minutes (which was a p1 outage and got me up out of
> > bed) last Friday.
>
> > All the "stuck" processes reported they were waiting on lath:library
> > cache, which made me scratch my head. During the "hangs" IO and load
> > on the server dropped to zero. CPU was 99-100% idle. Blocks in/out was
> > in the mid double digits, etc.
>
> > I wasted a LOT of time fiddling with the REDO logs since I was sure
> > that was the problem e.g. overnight backups = busy SAN = poor REDO
> > flush = waits. I put the REDO on new luns, I brought in the SAN guys,
> > I pulled 30 days worth of sar data, etc. That didn't turn out to be
> > the problem at all though.
>
> > We eventually ran the problem down to the RMAN backup *target* being
> > backlogged. RMAN was reading of the SAN and sending it out over the
> > wire to an IBM XIV where the backup set was being written. The XIV was
> > getting more and more overloaded though, and since its a "backup"
> > device the SAN guys more or less shrugged about it.
>
> > From what I could observe though, RMAN was reading blocks off the SAN
> > nice and quickly, then pinning them in memory. Then it'd turn around
> > and flush them through to the XIV. The XIV though was so backlogged
> > that it was taking minutes to flush the blocks through, and during
> > that time the blocks were still pinned in memory. Pin the wrong blocks
> > and you effectively put a global lock on the database, which is what
> > we were running into.
>
> > Moving the backups off the XIV and onto another, unloaded, backup
> > device fixed the symptom and had the beneficial side effect of making
> > my backups complete in 1/3 the time.
>
> > The question I've got for the group though is, has anybody seen
> > anything like this before? I feel like I've got to be missing
> > something here; I mean I used to use RMAN to backup to *tape* for
> > heaven's sake which is slower than even a highly loaded XIV, and I
> > never saw this problem.
>
> > What am I missing here? Does anybody have a good theory on why a slow
> > backup target would cause these sorts of symptoms?
>
> > I'm relatively confident I fixed the "problem" here, but I really want
> > to understand what went on here in case I run into it somewhere else.
>
> What was the lock?  Was it always the same lock?
>
> What does your backup script look like exactly?

I don't have the backup script handy (not VPN'd into the office at the moment, but I'll pull it for the curios on Monday).

Blocked sessions reported they were waiting on: latch: library cache

Locks were EXCLUSIVE TABLE TX on a couple of my big logging tables (which naturally stopped the world since the logging was part of most every transaction). Locks eventually clear, but I've seen them held for up to 12 minutes. Received on Sat Nov 28 2009 - 20:26:50 CST

Original text of this message