Interesting RMAN induced "hang" behavior

From: Pat <pat.casey_at_service-now.com>
Date: Fri, 27 Nov 2009 08:59:52 -0800 (PST)
Message-ID: <402a8c13-6308-43ff-9aad-d6fd49a4b242_at_m7g2000prd.googlegroups.com>



So I recently ran down a really weird problem we were having on one of our production 10.2.0.4 servers.

To make a long story short, recently during the RMAN backup window, the database would go into a "locked" state and all transactions would pause. Lock durations ranged from a few seconds (which nobody noticed), to 12 minutes (which was a p1 outage and got me up out of bed) last Friday.

All the "stuck" processes reported they were waiting on lath:library cache, which made me scratch my head. During the "hangs" IO and load on the server dropped to zero. CPU was 99-100% idle. Blocks in/out was in the mid double digits, etc.

I wasted a LOT of time fiddling with the REDO logs since I was sure that was the problem e.g. overnight backups = busy SAN = poor REDO flush = waits. I put the REDO on new luns, I brought in the SAN guys, I pulled 30 days worth of sar data, etc. That didn't turn out to be the problem at all though.

We eventually ran the problem down to the RMAN backup *target* being backlogged. RMAN was reading of the SAN and sending it out over the wire to an IBM XIV where the backup set was being written. The XIV was getting more and more overloaded though, and since its a "backup" device the SAN guys more or less shrugged about it.

From what I could observe though, RMAN was reading blocks off the SAN nice and quickly, then pinning them in memory. Then it'd turn around and flush them through to the XIV. The XIV though was so backlogged that it was taking minutes to flush the blocks through, and during that time the blocks were still pinned in memory. Pin the wrong blocks and you effectively put a global lock on the database, which is what we were running into.

Moving the backups off the XIV and onto another, unloaded, backup device fixed the symptom and had the beneficial side effect of making my backups complete in 1/3 the time.

The question I've got for the group though is, has anybody seen anything like this before? I feel like I've got to be missing something here; I mean I used to use RMAN to backup to *tape* for heaven's sake which is slower than even a highly loaded XIV, and I never saw this problem.

What am I missing here? Does anybody have a good theory on why a slow backup target would cause these sorts of symptoms?

I'm relatively confident I fixed the "problem" here, but I really want to understand what went on here in case I run into it somewhere else. Received on Fri Nov 27 2009 - 10:59:52 CST

Original text of this message