Re: ASM of any significant value when switching to Direct NFS / NetApp / non-RAC?

From: Austin Hackett <hacketta_57_at_me.com>
Date: Sun, 19 Aug 2012 10:58:58 +0100
Message-id: <19317444-0942-40F2-BAA5-E9B53CB6784C_at_me.com>



Hi Jon

Interesting - thanks for the info.

Yes, we also see the those symptoms - a big spike in log file sync, accompanied by some GCS waits. When the spikes occur, we did check CPU utilization on the storage controller, and it was less than 50%. Write latencies, IOPS, and throughput were all within acceptable limits, and actually much lower than other periods when performance had been fine.

We're using dNFS, so aren't using DM-mutlipath. Indeed, there is only a single storage NIC; a decision that precedes me and we're working to address. We are on OEL 5.4 which is interesting.

One idea is this could be caused by an incorrect MTU on the storage NIC. It's currently set to 8000 (a setting I'm told was inherited when they switched from Solaris to Linux a while back), whereas it's 9000 on the filer and switch.

Out of curiosity, what has your biggest log write elapsed warning? We see 1 or 2 spikes a week and the biggest has been 92 seconds - yes, 92 seconds!

On 17 Aug 2012, at 04:35, CRISLER, JON A wrote:

> Austin- we have observed the exact same behavior, and it appears to
> be periodic spikes on the NetApp controller / cpu utilization in a
> RAC environment. The info is fuzzy right now but if you have a LGWR
> delay, it also causes a GCS delay in passing the dirty block to
> another node that needs it. In our case it's a SAN-ASM-RAC
> environment, and the NetApp cpu is always churning above 80%. In
> our case we found that RH tuning, multipath issues contributed to
> the cause and seems to have been mostly addressed with RH 5.8 (was
> 5.4). In a FC SAN environment something like Sanscreen that can
> measure end to end FC response time helped to narrow down some of
> the contributing factors. You can set a undocumented parameter to
> allow the gcs dirty block to be passed over to the other nodes while
> a lgwr wait occurs, but you risk data corruption in the event of a
> node crash (hence we passed on that tip).

--
http://www.freelists.org/webpage/oracle-l
Received on Sun Aug 19 2012 - 04:58:58 CDT

Original text of this message