Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> Re: really slow RMAN backups

Re: really slow RMAN backups

From: Steve Perry <sperry_at_sprynet.com>
Date: Tue, 22 Aug 2006 06:25:19 -0500
Message-Id: <20106A58-53C6-47FB-9493-54BB848AD74C@sprynet.com>


thanks Mark,
I'll try to get more information on the netapp config.

media mgr. is legato with a gigabit connection. it goes to across the network to tape.
I'll reply back with the other information later today.

thanks again,
steve

On Aug 21, 2006, at 09:45 PM, Mark Brinsmead wrote:

> Steve,
>
> Here are a few thoughts -- for what they're worth. I'm sure
> others on this list can offer much better feedback.
>
> 1. You did not say how your NetApp storage is connected. I
> presume NFS, but there are other options...
>
> 2. Aside from mentioning the MTU and suggesting that there are
> multiple networks in play, you haven't said much about your
> network config. Bandwidth? NFS mount options? so on... Is your
> NFS storage distributed across multiple networks (NICs)? (Probably
> not...)
>
> 3. You mentioned "tape channels", but you haven't told us anything
> about you backup hardware, media manager, etc. I would imagine
> that the backups are perfomed across a network (sadly, the norm
> these days); is it the same network used for NFS? Are you sure?
>
> 4. You're using 10gR2. And a flashback recovery area. Where is
> it stored? On the NetApp, perhaps?
>
> 5. Are you backing up "directly" to tape, or moving data through
> the FBRA and then to tape? What about your media manager? Does it
> stage data to disk first, then "destage" to tape? If so, where is
> the staging area? Maybe in the NetApp?
>
> 6. What is the actual bandwidth of your TCP network(s)? I don't
> just mean "are you using 100Mbit or 1000Mbit?", but rather, what
> kind of actual throughput are you able to achieve, for example
> using 'dd' to read or write a file on the NetApp filer? (You could
> find that something like a duplex mismatch or traffic congestion
> are cutting your actual bandwidth to much less than you would think
> it should be.
>
> Okay, so, silly questions out of the way, here are the observations
> I promised earlier...
>
> You said:
> > I don't have any experience with netapp and want to see if there are
> > some known issues with it.
>
> One comes to mind. With (redhat) Linux, is not possible to do
> asynchronous I/O against NetApp storage. Not if you're using NFS,
> anyway. This can have huge implications to I/O performance,
> especially if you happen to be assuming that you are (capable of)
> doing Async I/O...
>
> You said:
> > I don't know why they chose directio (1 dbwr) instead of async. they
> > may not have anything to do with it, but it's the first time I saw
> > them set on a RAC database.
>
> Lack of async I/O could be a major factor here. Here's the bad
> news: "they" chose not to use Async I/O because it is not
> available (i.e. not possible) with NFS-on-redhat-linux. Not much
> of a choice, really...
>
> All of your I/O is being done synchronously. And this can lead to
> serious bottlenecks. (Mostly on writes, though.)
>
> You said:
> > I ran an awr report and "RMAN backup & recovery I/O" was the top
> > waiter with an avg wait of 134 ms.
>
> Average wait of 134 ms? That's about 7 (synchronous) I/Os per
> second. At 8KB per I/O (you didn't tell us DB_BLOCK_SIZE) that's
> about 56KB/s, or around 200MB/hr. Obviously, you're not
> bottlenecked (completely) on this all of the time -- your backups
> would take 2,000+ hours, not 20+ hours.
>
> I don't know much about this particular wait (obviously). I would
> want to understand what it means a lot better before really running
> with this, but that 134ms average wait does not sound (at all)
> promising.
>
> So, you're backing up a 500GB database. To do it in 10 hours
> (that's a lot) you need to sustain 50GB/hr -- end to end -- just
> for the backups. That's around 15MB per second. That could mean
> (something vaguely like) reading from the NetApp at 15MB/s, writing
> to the flashback recovery area (also on the NetApp?) at 15MB/s,
> reading again from the flashback recovery area at 15MB/s,
> transmitting backup over the nework to the media manager at 15MB/s,
> staging the backup data to disk at 15MB/s, destaging the backup
> data from disk at 15MB/s, and (finally!) writing to tape at 15MB/
> s. All concurrently!
>
> So, depending on the answers to the "silly" questions above, I
> count somwhere up to 6 or 7 traversals of your IP network, for a
> total of 100MB/s (1000Mbits/s), total NetApp throughput (just for
> backups) of maybe 90MB/s. How much (sustained) I/O can it do?
>
> You may want to consider DBWR_IO_SLAVES for your database. This is
> probably not (directly) related to backups, but you didn't tell us
> what else your database has been waiting on. In any event,
> environments where ASYNC I/O is unavailable (yours is one) are the
> rare cases where DBWR_IO_SLAVES can be warranted.
>
> And if you haven't already, you may want to look into
> TAPE_IO_SLAVES, too...
>
>
> On 8/21/06, Steve Perry < sperry_at_sprynet.com> wrote:
> This was just passed to me, but I thought I'd check with the group to
> see if anyone else has experienced this slowness.
>
>
>
> RMAN backups (2 tape channels) take forever on this system. forever
> means 20+ hours.
>
> the view v$backup_sync_io shows the effective bytes per second at 2
> or 3 MB per second. nothing above 5MB per second.
> v$backup_async_io doesnt' show anything.
>
> Setup.
> 500GB database on a netapp filer (40+ disks, don't know the model)
> with ASM
> 32-bit 10.2.0.1
> 2 - node RAC EE cluster
> rhel3
> 2 cpu
> 1 GB swap
> 4GB ram
> 600 MB SGA (small and uses the automatic memory management)
> flash recovery area is on
> DG is setup for 2 different databases
> mtu sizes of all NICs are set to 1500 (since it's netapp, they might
> prefer something else)
> legato is the media manager
>
> I looked at the init.ora settings and besides the small sga,
> disk_asynch_io = false
> filesystemio_option = directIO
> large_pool_size = 52M
>
> I don't know why they chose directio (1 dbwr) instead of async. they
> may not have anything to do with it, but it's the first time I saw
> them set on a RAC database.
>
> I ran an awr report and "RMAN backup & recovery I/O" was the top
> waiter with an avg wait of 134 ms. the class is "system io".
> other things are an index with 19 million get buffs during 2 hour
> snap shot.
> I see a few slow access times 300ms avg. read time, but there are
> only 200 or so reads against it. Most of the access times are less
> than 20ms.
> I don't know if the problem is contention with other jobs, config
> parameter or hardware.
>
> I checked a similar system (db ver, 2 node rac, asm) that gets
> 80-90MB per second for it's backup.
> it's on the SAN and uses async.
> I haven't looked at the awr report from it.
>
> any suggestions?
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>
>
>
> --
> Cheers,
> -- Mark Brinsmead
> Staff DBA,
> The Pythian Group
> http://www.pythian.com/blogs

--
http://www.freelists.org/webpage/oracle-l
Received on Tue Aug 22 2006 - 06:25:19 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US