Re: really slow RMAN backups

From: Mark Brinsmead <pythianbrinsmead_at_gmail.com>
Date: Mon, 21 Aug 2006 20:45:44 -0600
Message-ID: <cf3341710608211945h3a36ee9ar871d803de84f5cf9@mail.gmail.com>

Steve,

Here are a few thoughts -- for what they're worth. I'm sure others on this list can offer much better feedback.

You did not say how your NetApp storage is connected. I presume NFS, but there are other options...
Aside from mentioning the MTU and suggesting that there are multiple networks in play, you haven't said much about your network config. Bandwidth? NFS mount options? so on... Is your NFS storage distributed across multiple networks (NICs)? (Probably not...)
You mentioned "tape channels", but you haven't told us anything about you backup hardware, media manager, etc. I would imagine that the backups are perfomed across a network (sadly, the norm these days); is it the same network used for NFS? Are you sure?
You're using 10gR2. And a flashback recovery area. Where is it stored? On the NetApp, perhaps?
Are you backing up "directly" to tape, or moving data through the FBRA and then to tape? What about your media manager? Does it stage data to disk first, then "destage" to tape? If so, where is the staging area? Maybe in the NetApp?
What is the actual bandwidth of your TCP network(s)? I don't just mean "are you using 100Mbit or 1000Mbit?", but rather, what kind of actual throughput are you able to achieve, for example using 'dd' to read or write a file on the NetApp filer? (You could find that something like a duplex mismatch or traffic congestion are cutting your actual bandwidth to muchless than you would think it should be.

Okay, so, silly questions out of the way, here are the observations I promised earlier...

You said:
> I don't have any experience with netapp and want to see if there are
> some known issues with it.

One comes to mind. With (redhat) Linux, is not possible to do asynchronous I/O against NetApp storage. Not if you're using NFS, anyway. This can have huge implications to I/O performance, especially if you happen to be assuming that you are (capable of) doing Async I/O...

You said:
> I don't know why they chose directio (1 dbwr) instead of async. they
> may not have anything to do with it, but it's the first time I saw
> them set on a RAC database.

Lack of async I/O could be a major factor here. Here's the bad news: "they" chose not to use Async I/O because it is not available (i.e. not possible) with NFS-on-redhat-linux. Not much of a choice, really...

All of your I/O is being done synchronously. And this can lead to serious bottlenecks. (Mostly on writes, though.)

You said:
> I ran an awr report and "RMAN backup & recovery I/O" was the top
> waiter with an avg wait of 134 ms.

Average wait of 134 ms? That's about 7 (synchronous) I/Os per second. At 8KB per I/O (you didn't tell us DB_BLOCK_SIZE) that's about 56KB/s, or around 200MB/hr. Obviously, you're not bottlenecked (completely) on this all of the time -- your backups would take 2,000+ hours, not 20+ hours.

I don't know much about this particular wait (obviously). I would want to understand what it means a lot better before really running with this, but that 134ms average wait does not sound (at all) promising.

So, you're backing up a 500GB database. To do it in 10 hours (that's a lot) you need to sustain 50GB/hr -- end to end -- just for the backups. That's around 15MB per second. That could mean (something vaguely like) reading from the NetApp at 15MB/s, writing to the flashback recovery area (also on the NetApp?) at 15MB/s, reading again from the flashback recovery area at 15MB/s, transmitting backup over the nework to the media manager at 15MB/s, staging the backup data to disk at 15MB/s, destaging the backup data from disk at 15MB/s, and (finally!) writing to tape at 15MB/s. All concurrently!

So, depending on the answers to the "silly" questions above, I count somwhere up to 6 or 7 traversals of your IP network, for a total of 100MB/s (1000Mbits/s), total NetApp throughput (just for backups) of maybe 90MB/s. How much (sustained) I/O can it do?

You may want to consider DBWR_IO_SLAVES for your database. This is probably not (directly) related to backups, but you didn't tell us what else your database has been waiting on. In any event, environments where ASYNC I/O is unavailable (yours is one) are the rare cases where DBWR_IO_SLAVES can be warranted.

And if you haven't already, you may want to look into TAPE_IO_SLAVES, too...

On 8/21/06, Steve Perry <sperry_at_sprynet.com> wrote:
>
> This was just passed to me, but I thought I'd check with the group to
> see if anyone else has experienced this slowness.
>
>
>
> RMAN backups (2 tape channels) take forever on this system. forever
> means 20+ hours.
>
> the view v$backup_sync_io shows the effective bytes per second at 2
> or 3 MB per second. nothing above 5MB per second.
> v$backup_async_io doesnt' show anything.
>
> Setup.
> 500GB database on a netapp filer (40+ disks, don't know the model)
> with ASM
> 32-bit 10.2.0.1
> 2 - node RAC EE cluster
> rhel3
> 2 cpu
> 1 GB swap
> 4GB ram
> 600 MB SGA (small and uses the automatic memory management)
> flash recovery area is on
> DG is setup for 2 different databases
> mtu sizes of all NICs are set to 1500 (since it's netapp, they might
> prefer something else)
> legato is the media manager
>
> I looked at the init.ora settings and besides the small sga,
> disk_asynch_io = false
> filesystemio_option = directIO
> large_pool_size = 52M
>
> I don't know why they chose directio (1 dbwr) instead of async. they
> may not have anything to do with it, but it's the first time I saw
> them set on a RAC database.
>
> I ran an awr report and "RMAN backup & recovery I/O" was the top
> waiter with an avg wait of 134 ms. the class is "system io".
> other things are an index with 19 million get buffs during 2 hour
> snap shot.
> I see a few slow access times 300ms avg. read time, but there are
> only 200 or so reads against it. Most of the access times are less
> than 20ms.
> I don't know if the problem is contention with other jobs, config
> parameter or hardware.
>
> I checked a similar system (db ver, 2 node rac, asm) that gets
> 80-90MB per second for it's backup.
> it's on the SAN and uses async.
> I haven't looked at the awr report from it.
>
> any suggestions?
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>

-- 
Cheers,
-- Mark Brinsmead
   Staff DBA,
   The Pythian Group
   http://www.pythian.com/blogs

--
http://www.freelists.org/webpage/oracle-l

Received on Mon Aug 21 2006 - 21:45:44 CDT