***UNCHECKED*** Re: Long running backups - OK?

From: Tim Gorman <tim.evdbt_at_gmail.com>
Date: Fri, 26 Jan 2018 12:51:11 -0700
Message-ID: <ce24608a-d699-0575-247c-c10799f9a913_at_gmail.com>



Backups are just reads at the physical block level.  Killing a backup does not harm the database directly.

Again, the real risk is to restore/recovery.

A longer-running backup is clearly at higher risk of failure from any origin.  To illustrate, just express it in extremes: more can go wrong if a successful backup takes 100 hours versus 100 nano-seconds.

But a failed backup by itself, regardless of origin, means nothing and harms nothing.  The database keeps running, and applications are not impacted.  The real impact of a failed backup is needing to restore from a prior successful backup, resulting in a much longer roll-forward recovery, and thus a far longer restore/recovery, thus jeopardizing the business that the production database is supporting.


I've said it in the past and it bears repeating:  one of the best things about RMAN is its name.  The name is "Recovery Manager", not "Backup Manager", even though 99.9% of the time it is used for backup operations.  The focus of RMAN is to simplify restore and recovery (as much as something so complex can be simplified).  The syntax of RMAN commands for restore and recovery illustrates this. Using RMAN, backups should really be regarded as operations to populate the RMAN recovery catalog, and the recovery catalog is what provides intelligence to automate restore and recovery operations.

On 1/26/18 12:00, Glenn Travis wrote:
>
> Thanks Tim. Good resources on performance.  And yes, we are all over
> that and have systems, networking and admins all working on all the
> points [and more] you mentioned.  I didn’t want this thread to fall
> into a performance question though. My primary discussion wanted to
> center on the time to take a backup and obviously the shorter the
> better.  But Why?  I said it can’t be good for a database backup to
> run for 12 hours out of every day.  The response was – Why not?  Or
> Who cares?
>
> I understand that the longer the backup, the longer the recovery.  But
> is there any risk to the database itself during a backup?  If I am
> still doing hourly archivelog backups, and the server crashes in hour
> 8 of a 12 hour incremental backup - I can still recover to my last
> archivelog backup, regardless of being in the middle of an
> incremental, correct?
>
> I was just interested in the database vulnerability during a long
> backup as opposed to getting times are short as possible.
>
> Thanks all.
>
> *From:*Tim Gorman [mailto:tim.evdbt_at_gmail.com]
> *Sent:* Friday, January 26, 2018 11:39 AM
> *To:* Glenn Travis <Glenn.Travis_at_sas.com>; oracle-l_at_freelists.org
> *Subject:* Re: Long running backups - OK?
>
> */EXTERNAL/*
>
> Glenn,
>
> Any question about backups should really be converted into a question
> on restore and recovery, because backups don't matter,
> restore/recovery from those backups matters.
>
> So, to your point, longer-running backups result in longer-running
> recoveries.  An inconsistent or "hot" backup copies a baseline image
> of datafiles, but must also capture all redo generated during that
> datafile backup so that a roll-forward recovery after restore can
> produce a consistent image.  If the datafile backups run longer, then
> in most environment this means more redo must be captured.
>
> So, your question about whether longer-running backups matter really
> depends on whether your organization can tolerate longer-running
> recoveries?
>
> ------
>
> To help determine why your backups are taking longer to an NFS mount,
> are the NFS clients (i.e. database servers) configured appropriately
> for NFS activity?  Specifically, assuming your database servers are
> Linux, have you adjusted the TCP kernel settings to increase memory
> buffers for the increased data traffic across the network?
>
> Again, assuming you are on Linux, to determine if there is
> bottlenecking on memory buffers with the NFS client, please consider
> downloading the Python script nfsiostat.py HERE
> <https://fossies.org/linux/nfs-utils/tools/nfs-iostat/nfs-iostat.py>. 
> This script simply calls the "nfsiostat" command from the Linux
> project "nfs-utils", but it reformats the output to be more useful and
> intuitive.  Specifically, it categorizes total NFS time into "average
> queue time" and "average RTT time".  Total NFS time is the average
> elapsed time the application sees for the NFS call.  Average queue
> time is time spent queuing the NFS request internally within the NFS
> client host.  Average RTT time is time spent on the network
> round-trip;  this includes the time spent on the wire and the time
> spent on the NFS performing the underlying I/O.
>
> If Average Queue Time from nfsiostat.py shows as anything but an
> inconsequential component of total NFS time, then it might be useful
> to enlarge the TCP send and receive buffers, which by default are
> insufficient for the heavy volumes of network I/O resulting from NFS.
>
> This article HERE
> <https://wwwx.cs.unc.edu/%7Esparkst/howto/network_tuning.php> provides
> decent explanation of the TCP kernel settings.
>
> This Delphix documentation HERE
> <https://docs.delphix.com/docs/system-administration/performance-tuning-configuration-and-analytics/target-host-os-and-database-configuration-options>
> provides some good recommendations for optimizing NFS clients on
> various OS platforms, such as Solaris, Linux, AIX, and HP-UX.
>
> If Average Queue Time from nfsiostat.py still shows up as a
> substantial portion of total NFS time even after increasing the TCP
> send and receive buffers, then there may be another problem within the
> OS, and it would be worthwhile to open a support case with your vendor.
>
> Average RTT Time covers a great deal of territory, encompassing the
> entire network as well as the performance of the NFS server itself. 
> Diagnosing RTT involves gathering information on the latency and
> throughput of the network, the number of network hops, and whether
> there are intermediate devices that can increase latency and/or reduce
> throughput (i.e. firewalls, etc).  And diagnosing RTT also possibly
> diagnosing the performance of the NFS server and it's underlying storage.
>
> I guess the message here is that tuning NFS involves understanding the
> components and rigorously diagnosing each step.  Obviously this email
> is long enough as it is, and I could go on for hours.
>
> Hope this helps...
>
> -Tim
>
>
> On 1/26/18 07:53, Glenn Travis wrote:
>
> Lively discussion among our team regarding backup run times.  We
> are using RMAN and recently migrated from tape to disk (NFS
> mounted) based backups.  The backups are taking 2-3 times longer
> (up to 5 times longer when concurrent).  Throughput dropped from
> 200-300mb/sec to 50-70mb/sec.  We are investigating the
> performance issues but the discussion changed to ‘Does it really
> matter?’
>
> So I wanted to throw out these questions for your opinions.  If a
> database backup is running and not adversely affecting system (and
> user’s applications’ performance), does it really matter how long
> it runs?
>
> Are there any negatives to having an Oracle backup run for over x
> hours? Say a 5 hour (or longer) backup on an active database?
> What are the ramifications of long-running Oracle database
> backups, if any?
>
> Note we have dozens of databases over 1tb and run fulls weekly,
> cumulatives (inc level 1) daily, archivelogs hourly.
>
> I just can’t deal with a backup running for a quarter of the day. 
> Seems to be a long window of exposure and vulnerability should a
> crash occur.
>
> Thoughts?
>
> *Glenn Travis*
>
> DBA ▪ Database Services
>
> IT Enteprise Solutions
>
> SAS Institute
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Jan 26 2018 - 20:51:11 CET

Original text of this message