RE: ASM of any significant value when switching to Direct NFS / NetApp / non-RAC?

From: CRISLER, JON A <JC1706_at_att.com>
Date: Fri, 17 Aug 2012 03:35:13 +0000
Message-ID: <9F15274DDC89C24387BE933E68BE3FD33933FF_at_MISOUT7MSGUSR9E.ITServices.sbc.com>



Austin- we have observed the exact same behavior, and it appears to be periodic spikes on the NetApp controller / cpu utilization in a RAC environment. The info is fuzzy right now but if you have a LGWR delay, it also causes a GCS delay in passing the dirty block to another node that needs it. In our case it's a SAN-ASM-RAC environment, and the NetApp cpu is always churning above 80%. In our case we found that RH tuning, multipath issues contributed to the cause and seems to have been mostly addressed with RH 5.8 (was 5.4). In a FC SAN environment something like Sanscreen that can measure end to end FC response time helped to narrow down some of the contributing factors. You can set a undocumented parameter to allow the gcs dirty block to be passed over to the other nodes while a lgwr wait occurs, but you risk data corruption in the event of a node crash (hence we passed on that tip).

-----Original Message-----

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Austin Hackett Sent: Wednesday, August 15, 2012 4:52 PM To: Oracle-L_at_freelists.org
Subject: Re: ASM of any significant value when switching to Direct NFS / NetApp / non-RAC?

Hi Dana
This info doesn't exactly relate to ASM, but I hopefully it'll be of use to you in the future...

I've recently started a new role at shop that uses Linux, Direct NFS and NetApp (no ASM) and as others have suggested, the solution does have a number of nice management features.

However, I am finding the apparent lack of read and write latency stats frustrating.

Something I'm currently looking into are occasional spikes in redo log writes. I know these are happening because there are log write elapsed warnings in the LGWR trace file. When these spikes occur, NetApp Ops Manager reports 2 - 3 millisecond write latencies for the volume in question.

What I'd like to be able to do is cross-check these warnings against host-level io stats, but there seems to be no way of achieving this.

Using the standard Linux NFS client, iostat can show you the number of reads, writes etc. but not latencies. With the 2.6.17 kernel it seems that counters are available to report latency information, and there are scripts like nfs-iostat.py out there which will display this info: http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=blob;f=tools/nfs-iostat/nfs-iostat.py;h=9626d42609b9485c7fda0c9ef69d698f9fa929fd;hb=HEAD) .

However, because Direct NFS bypasses the hosts NFS mount points (oracle db processes mount the files directly), it's my understanding the above tools won't include any operations performed by the Direct NFS in their output. There is a post about this here: http://glennfawcett.wordpress.com/2009/11/25/monitoring-direct-nfs-with-oracle-11g-and-solaris-pealing-back-the-layers-of-the-onion/

Now, whilst the v$dnfs_stats view does record the number of reads, writes etc, it doesn't have any latency data. Which just leaves you with usual v$ and AWR views like v$eventmetric, v$system_event etc. And if you're trying to confirm the issue is at the host level and not Oracle, this doesn't help you much. So, at the moment I'm missing being able to run iostat and see svc_t and the like...

Incidentally, Glen Fawcett has nice script here for capturing v $dnfs_stats output: http://glennfawcett.wordpress.com/2010/02/18/simple-script-to-monitor-dnfs-activity

It's also worth being aware of bugs 13043012 and 13647945.

Hope that helps

Austin

--

http://www.freelists.org/webpage/oracle-l

--

http://www.freelists.org/webpage/oracle-l Received on Thu Aug 16 2012 - 22:35:13 CDT

Original text of this message