Re: ASM of any significant value when switching to Direct NFS / NetApp / non-RAC?

From: Austin Hackett <hacketta_57_at_me.com>
Date: Mon, 20 Aug 2012 19:28:28 +0100
Message-id: <E284AA64-E3B8-4117-A4F8-2BFFF744EB17_at_me.com>



Hi Jon

I wasn't aware of that parameter, so it's good to hear about it

In our case, the log file sync waits are caused by slow disk I/O; they always correspond to massive log write elapsed warnings in the LGWR trace file - I've seen 90+ seconds to write a just a couple of KB of redo. I've got a couple of leads, like the misconfigured jumbo frames, and also some nfsd.tcp.close.idle.notify:warning messages on the filer that correlate to when the slow write happens and relate to the IP of the NIC on the db host that saw the spike.

We're working on getting some tcpdumps the next time the issues occurs, so those should allow me to validate what the LGWR trace file is telling me.

Thanks

Austin

On 20 Aug 2012, at 17:04, CRISLER, JON A wrote:

> What is your setting for this parameter ?
>
> SQL> alter system set "_high_priority_processes"='LMS*|VKTM|LGWR'
> scope=spfile sid='*';
>
> System altered.
>
> If LGWR is not set to RT priority it might be the reason behind
> higher log file sync times.
>
> -----Original Message-----
> From: Austin Hackett [mailto:hacketta_57_at_me.com]
> Sent: Sunday, August 19, 2012 5:59 AM
> To: CRISLER, JON A
> Cc: Oracle-L_at_freelists.org
> Subject: Re: ASM of any significant value when switching to Direct
> NFS / NetApp / non-RAC?
>
> Hi Jon
>
> Interesting - thanks for the info.
>
> Yes, we also see the those symptoms - a big spike in log file sync,
> accompanied by some GCS waits. When the spikes occur, we did check CPU
> utilization on the storage controller, and it was less than 50%.
> Write latencies, IOPS, and throughput were all within acceptable
> limits, and actually much lower than other periods when performance
> had been fine.
>
> We're using dNFS, so aren't using DM-mutlipath. Indeed, there is
> only a single storage NIC; a decision that precedes me and we're
> working to address. We are on OEL 5.4 which is interesting.
>
> One idea is this could be caused by an incorrect MTU on the storage
> NIC. It's currently set to 8000 (a setting I'm told was inherited
> when they switched from Solaris to Linux a while back), whereas it's
> 9000 on the filer and switch.
>
> Out of curiosity, what has your biggest log write elapsed warning?
> We see 1 or 2 spikes a week and the biggest has been 92 seconds -
> yes, 92 seconds!
>
> On 17 Aug 2012, at 04:35, CRISLER, JON A wrote:
>
>> Austin- we have observed the exact same behavior, and it appears to
>> be
>> periodic spikes on the NetApp controller / cpu utilization in a RAC
>> environment. The info is fuzzy right now but if you have a LGWR
>> delay, it also causes a GCS delay in passing the dirty block to
>> another node that needs it. In our case it's a SAN-ASM-RAC
>> environment, and the NetApp cpu is always churning above 80%. In our
>> case we found that RH tuning, multipath issues contributed to the
>> cause and seems to have been mostly addressed with RH 5.8 (was 5.4).
>> In a FC SAN environment something like Sanscreen that can measure end
>> to end FC response time helped to narrow down some of the
>> contributing
>> factors. You can set a undocumented parameter to allow the gcs dirty
>> block to be passed over to the other nodes while a lgwr wait occurs,
>> but you risk data corruption in the event of a node crash (hence we
>> passed on that tip).
>

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Aug 20 2012 - 13:28:28 CDT

Original text of this message