RE: log file sync

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Wed, 30 Sep 2015 18:46:40 -0400
Message-ID: <01f501d0fbd1$e0951e30$a1bf5a90$_at_rsiz.com>



In addition to what Tim wrote, a few more things might be interesting to know:  
  1. Are your CPUs in fact pegged?
  2. If you dump the ps to a file without the grep, what interesting and fun (and potentially harmful) mucking around has someone already done to your FX scheduler priorities? Sorting the output file on the priority column might prove interesting.
  3. If you set up some little program to repeatedly write a small amount of data to the same physical complex your redo logs live on:
    1. what performance occurs with those writes?
    2. does it slow you down?

If 3a is “bad” and 3b is “true” you probably do need to make some restriction in your i/o system more performant. That can range from pegged controllers to interrupted or overloaded communications “wiring” to the disks to overloaded write-through ram on a SAN to competition for the media. You could have a VLAN with improperly configured Quality Of Service, for example, and the people watching movies on your network that are not supposed to affect your network attached storage are affecting your network attached storage (you might have NONE of these, those are just examples).

If 1 is “true” AND it corresponds to lgwr butting heads with other processes at the same priority, then you’re probably out of CPU and need to be looking for ways to use less.

2 is wide open: Priorities inverted compared to what they need to be can really screw the pooch.  

Good luck, Tim’s advice will probably get you directly to identification of your real problem.  

mwf  

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Tim Gorman Sent: Wednesday, September 30, 2015 5:23 PM To: oracle-l_at_freelists.org
Subject: Re: log file sync  

Michael,

Regardless of what anyone else may have done, increasing the priority of LGWR indicates that the problem has been definitively diagnosed as lack of CPU.

Is that true? Has that diagnosis been made?

Or is it just a guess? Because

The LGWR process posts the "log file parallel write" event when it is trying to flush redo from the log buffer to the online redo log files. Is that event showing significant timings?

Be aware that at any point in time, there is only one process posting "log file parallel write" while many processes may be posting "log file sync". So, the cumulative timings for the latter may look higher on an AWR report, but if you correct for concurrency, you might see a strong correlation between "log file sync" and "log file parallel write", which indicates that waits for I/O are the real cause of "log file sync", not waits for CPU.

Please consider trying the following query in ASH...

select timestamp, event, count(*) cnt

from      v$active_session_history
where     event in ('log file sync','log file parallel write')
group by timestamp, event
order by timestamp, event;

From this query, I would be looking for times when "log file sync" occurs *WITHOUT* an accompanying "log file parallel write".

Think of a "log file parallel write" wait by the LGWR as a "stimulus", and a "log file sync" response by one or many background processes as a "response". This is another way of saying that if the LGWR is slow in writing redo down from the Log Buffer to the online redo log files, then it is going to keep the Log Buffer locked for a longer period of time, and foreground server processes are going wait, and while they wait they are going to post "log file sync". So "log file parallel write" and "oog file sync" aren't supposed to occur at the same instant, but maybe we can see them in the same "neighborhood" of time often enough to show correlation?

Now remember that ASH is only sampling every 1000 milli-seconds (i.e. 1 second), and that it is possible for many waits on either of these events to have occurred during the course of a full second without being sampled on the 1000th milli-second by ASH. But this level of granularity is the information that we have, so we have to work with it.

So using this query, try to see if "log file sync" is occurring most often when "log file parallel write" is occurring, or not.

If so, then the problem may not be CPU starvation for LGWR, but slow I/O writing to the online redo log files. Granting higher CPU priority to LGWR is not going to help. In fact, you may hurt things by starving other processes of CPU priority.

If not, then it won't be definitive, but it will seem unlikely that slow I/O to the online redo log files is an issue, and perhaps granting higher CPU priority may help.

Anyway, please let us know what you find?

Happy hunting!

Thanks!

-Tim

On 9/30/15 13:26, Michael Calisi wrote:

I been dealing with log file sync showing up in my top 10 of my awr report.

I made changes to eliminate the log file sync, however there was one additional recommendation that I don't feel comfortable about and can't seem to get clarification from support.

They are recommending to promote the LGWR background process to become a critical thread by placing the thread in the FX scheduling class at priority sixty will cause the thread to be treated as critical.

when i check our current setting. i was already at 60.

QL> host ps -ecf | grep lgwr

  oracle 21338 21337   FX   0 15:22:17 pts/85      0:00 grep lgwr
  oracle 21337 18771   FX   0 15:22:17 pts/85      0:00 /bin/sh -c ps -ecf | grep lgwr
  oracle 25393 25382   FX  60   Aug 29 ?         545:15 ora_lgwr_cg

What I don't understand is how this value is determined can it go higher and what are the consequence of changing this value higher? Not sure how increasing the value help the io issue..

Anyone every make changes to this process.  

--

http://www.freelists.org/webpage/oracle-l Received on Thu Oct 01 2015 - 00:46:40 CEST

Original text of this message