Making Oracle on NAS more resilient to network failures

From: Daniel Nichols <daniel_at_NOSPAMrdnichols.com>
Date: Tue, 16 Sep 2003 19:54:57 +0000 (UTC)
Message-ID: <gjqemv4qk6j9srhli0lc7f93rh2otqpkdf@4ax.com>

Hi,

I'm using Oracle 8i on Windows NT 4 with NAS (Network Attached Storage) with CIFS network protocol.

We've discovered some interesting findings in testing network failures for this NAS database. These findings are about how the Microsoft OS 59 error can be effectively stopped from happening on a network outage for a period of time longer than 20 seconds. (This is the error you get when Oracle can't see its disks across a network.)

We have found that by changing Windows NT TCP/IP settings (in the registry) we can extend the time that Oracle does not encounter the error OS 59. This means if the outage is only a few seconds Oracle remains up.

If we increase SESSTimeOut key from 300 to 600 the number of packet losses required to fail Oracle goes up from 13 to 15. Q102067 http://support.microsoft.com/default.aspx?scid=kb;en-us;102067

If we increase TcpMaxDataRetransmissions from 5 to 6 (SESSTimeOut not changed) the number of packet losses required to fail Oracle goes up from 11 (which is 13 seconds) to 20 (26 seconds). Q170359 http://support.microsoft.com/?kbid=170359

In addition if we move NetApp Filer and Oracle server into the same VLAN we can decrease the impact of layer 3 network changes or faults.

My question is how can Oracle tolerate a period of 20 seconds without writing to its disks - it does in our testing! I was wondering if Oracle is "stalling" the database. I'm aware of this behaviour when Oracle can not archive redo logs and therefore cannot reuse an online log (though I don't know where this is stated in the manuals). Is this what Oracle is doing when it makes its call (a network call in this case) to write to disk. Does it wait on a lock on all processes until the I/O call succeeds before allowing more work? If we adapted the OS so it waited an age to report its network problems would Oracle really sit and wait? Does Oracle have no timeouts of its own? Is it possible to tell, via online view, whether the database is in this stalled state?

I have a TAR raised on this but no news for a while on these questions.

Thank you,
Daniel. Received on Tue Sep 16 2003 - 14:54:57 CDT