Re: Application performance hit when performing archived log backups

From: Ryan January <rjanuary_at_gmail.com>
Date: Tue, 17 May 2016 07:28:37 -0500
Message-Id: <C8D391FA-E0A9-46EE-BD30-73ED45456790_at_gmail.com>



> 1) Are your applications chatty? If you trace an application, do you see an
> increase in the sum of the durations of sqlnet waits while the backup is
> running?
>

I wouldn't have anticipated they were, but I'm really starting to wonder. If that is the case, it's possible the increased network traffic, pushing the backup to NFS, is causing the delay at the network layer. IMO that's one of the only places the time could be "hiding" given what we've already looked at. Trace is the only way we're going to get anything.

> 2) Do the data persistence stacks (memory to i/o controller or network to
> disk) collide amongst archived log location, archive log destination,
> database files, and temp locations? If you trace an application, do you see
> an increase in the sum of durations of database reads?
>

It feels like there has to be a commonality in there, which is something I'm still working to address. From what I've seen so far the answer is no. Oddly, backing up data files only has a slight impact, where backing up archived logs is rather profound. If anything, I would have expected the other way around.

> I would rule those two out first, but please notice that both start with
> tracing an application that runs differently (faster when not doing backup,
> slower when doing backup).
>
> So the trace will show what is different. When you have the luxury of ops
> staff having reported slowdowns of specific things at specific times please
> enjoy benefit that you can measure instead of guessing quite easily.

Suffice it to say you're preaching to the choir ;) It's not an exaggeration to say that I've been beating them over the head for better logging and metric collection for the last few years. The only reason they have the perf data is that the infrastructure team got tired of having nothing to go on and we did it for them. The data has been a blessing and a curse. As soon as we began gathering data we've been surfacing odd trends.

Tracing is one of the next steps. Unfortunately the app is still very lightly instrumented from the DB perspective, JDBC connection pool, the home-grown ORM doesn't use binds in many cases. It all adds up to a great deal of my time to correlate traces. My biggest quandary is that we're not talking about a slowdown in only a few statements. It truly is across the entire app, with very few exceptions. If you have any tips and tricks for time savers there, I'm all ears. At the last job I had a copy of MR Tools, which spoiled me a bit. It may be a time to make an investment in more tooling here as well.

> On May 17, 2016, at 5:12 AM, Mark W. Farnham <mwf_at_rsiz.com> wrote:
>
> 1) Are your applications chatty? If you trace an application, do you see an
> increase in the sum of the durations of sqlnet waits while the backup is
> running?
>
> 2) Do the data persistence stacks (memory to i/o controller or network to
> disk) collide amongst archived log location, archive log destination,
> database files, and temp locations? If you trace an application, do you see
> an increase in the sum of durations of database reads?
>
> I would rule those two out first, but please notice that both start with
> tracing an application that runs differently (faster when not doing backup,
> slower when doing backup).
>
> So the trace will show what is different. When you have the luxury of ops
> staff having reported slowdowns of specific things at specific times please
> enjoy benefit that you can measure instead of guessing quite easily.
>
> mwf
>
> -----Original Message-----
> From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org]
> On Behalf Of Ryan January
> Sent: Monday, May 16, 2016 1:47 PM
> To: Listserv Oracle
> Subject: Application performance hit when performing archived log backups
>
> I've got a strange one that, as of yet, I've not been able to resolve. 1/8
> rack Exadata v4 (x4-2) , 11.2.0.4, OEL 6. High capacity disks, flash
> configured as cache, rather than independent disks.
>
> The DB is backing an in-house java/tomcat OLTP-ish application. Ops staff
> reported slowdowns at specific times. App slow down verified via StatsD
> collected performance metrics. Watching those metrics we identified the
> issue to be common across the vast majority of DB calls, and roughly 30
> separate application instances. (separate app/web servers per instance, all
> pointing to differing sets of application schemas within the same DB)
>
> Poking through ASH data the only commonality I found was an increase in
> system IO, which ultimately ended up being RMAN. We were performing very
> simple archived log backups via rman with a parallelism of 4. Backup sets
> are being pushed uncompressed across a 1Gb link to DataDomain mounted via
> NFS.
> App call times went from 1s normally to over 30s at their worst. Backing
> that parallelism off to 1 resulted in a smaller spike with an increased
> duration. Application performance falls off very sharply as the backup
> begins, with a slower 2-3 minute exponential increase in performance as the
> backup completes. The same execution plan has been verified for frequent
> SQL statements during both good and bad times.
> There is nothing specific to individual SQL statements which appears to be
> problematic, all application calls during the time suffer. I've not yet
> found any spikes in metrics that explain such a profound performance impact.
> AWR reports during the time of slowdown seem similar to those when the
> backup is not being performed. We've not been gathering data long enough to
> determine if this is entirely new behavior, or if it's been getting
> progressively worse over time.
>
> Next steps for me: integrating AWR snapshots before and after the backup
> completes to narrow the window AWR is focused on. I'm also working on
> automating 10046 trace for further analysis and comparison.
>
> Short of that, has anyone seen similar behavior? I would have never
> anticipated archived log backups to have this impact, but the timing matches
> up perfectly with 100% repeatability. If not the root cause, it's at least
> a contributor that we've identified as a fact. I would appreciate any
> suggestions of where we should consider continuing our troubleshooting, or
> guidance on what metrics we should gather.
>
> Thank you,
> Ryan--
> http://www.freelists.org/webpage/oracle-l
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue May 17 2016 - 14:28:37 CEST

Original text of this message