RE: Application performance hit when performing archived log backups

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Tue, 17 May 2016 06:12:57 -0400
Message-ID: <025e01d1b024$b0a61280$11f23780$_at_rsiz.com>


  1. Are your applications chatty? If you trace an application, do you see an increase in the sum of the durations of sqlnet waits while the backup is running?
  2. Do the data persistence stacks (memory to i/o controller or network to disk) collide amongst archived log location, archive log destination, database files, and temp locations? If you trace an application, do you see an increase in the sum of durations of database reads?

I would rule those two out first, but please notice that both start with tracing an application that runs differently (faster when not doing backup, slower when doing backup).

So the trace will show what is different. When you have the luxury of ops staff having reported slowdowns of specific things at specific times please enjoy benefit that you can measure instead of guessing quite easily.

mwf

-----Original Message-----

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Ryan January
Sent: Monday, May 16, 2016 1:47 PM
To: Listserv Oracle
Subject: Application performance hit when performing archived log backups

I've got a strange one that, as of yet, I've not been able to resolve. 1/8 rack Exadata v4 (x4-2) , 11.2.0.4, OEL 6. High capacity disks, flash configured as cache, rather than independent disks.

The DB is backing an in-house java/tomcat OLTP-ish application. Ops staff reported slowdowns at specific times. App slow down verified via StatsD collected performance metrics. Watching those metrics we identified the issue to be common across the vast majority of DB calls, and roughly 30 separate application instances. (separate app/web servers per instance, all pointing to differing sets of application schemas within the same DB)

Poking through ASH data the only commonality I found was an increase in system IO, which ultimately ended up being RMAN. We were performing very simple archived log backups via rman with a parallelism of 4. Backup sets are being pushed uncompressed across a 1Gb link to DataDomain mounted via NFS.
App call times went from 1s normally to over 30s at their worst. Backing that parallelism off to 1 resulted in a smaller spike with an increased duration. Application performance falls off very sharply as the backup begins, with a slower 2-3 minute exponential increase in performance as the backup completes. The same execution plan has been verified for frequent SQL statements during both good and bad times. There is nothing specific to individual SQL statements which appears to be problematic, all application calls during the time suffer. I've not yet found any spikes in metrics that explain such a profound performance impact. AWR reports during the time of slowdown seem similar to those when the backup is not being performed. We've not been gathering data long enough to determine if this is entirely new behavior, or if it's been getting progressively worse over time.

Next steps for me: integrating AWR snapshots before and after the backup completes to narrow the window AWR is focused on. I'm also working on automating 10046 trace for further analysis and comparison.

Short of that, has anyone seen similar behavior? I would have never anticipated archived log backups to have this impact, but the timing matches up perfectly with 100% repeatability. If not the root cause, it's at least a contributor that we've identified as a fact. I would appreciate any suggestions of where we should consider continuing our troubleshooting, or guidance on what metrics we should gather.

Thank you,
Ryan--
http://www.freelists.org/webpage/oracle-l

--

http://www.freelists.org/webpage/oracle-l Received on Tue May 17 2016 - 12:12:57 CEST

Original text of this message