RE: Oracle clusterware related question
Date: Thu, 10 May 2012 07:08:12 +0000
the tests were like this.
- CRS, CRS logs on SAN disks -> unplug cables from HBA -> node does not reset - CRS, CRS logs on local disks -> unplug cables from HBA -> node does reset
In Amir's Setup the CRS and CRS Logs are on nfs but problem is the same here. And you probably will not find anything in the logfiles, if the cluster processes cannot write to them. :-)
as you stated, node could not commit suicide if processes are hanging in an IO path this is of course different in configurations with ipmi.
From: Martin Berger [mailto:martin.a.berger_at_gmail.com] Sent: Tuesday, May 08, 2012 8:03 PM
Cc: tim_at_evdbt.com; Mathias Zarick; oracle-l_at_freelists.org Subject: Re: Oracle clusterware related question
in Oracle Clusterware no node can be evicted by the remote nodes. The 'others' can only exclude any node and hope this one commits suicide.
The problem here, on your hanging node the clusterware processes are hanging in IO to logfiles. As your NFS does not disappear, the filehandles are still open. It seems writing to logfiles is a synchronous task - so when these hang in file-IO, they can not do higher priority tasks as killing the node.
You can try to mount your log-directories 'soft' - maybe this solves the hanging issue. But I don't know which side-effects this might cause!
I am not sure if crs shows the same behavior in case logfile write hangs (as on NFS) or log file write fails (as on "mountpoints disappears as SAN-nwtwork is removed") - Mathias, do you remember the details? But as they where back in 18.104.22.168, I probably should do the testcase again.
I second Mathias, grid-logs (and also grid-binaries) should be local! All others, like rdbms binaries and logs can be on any remote system.
On Tue, May 8, 2012 at 6:11 PM, Hameed, Amir <Amir.Hameed_at_xerox.com> wrote:
> So, if voting disks are not updated by a certain node for any reason
> for an extended period of time, that node would not be evicted by the
> remote nodes from the cluster?
> From: Tim Gorman [mailto:tim_at_evdbt.com]
> Sent: Tuesday, May 08, 2012 12:05 PM
> To: Mathias.Zarick_at_trivadis.com; Hameed, Amir
> Cc: oracle-l_at_freelists.org
> Subject: Re: Oracle clusterware related question
> Mathias hit the nail on the head. Think about it this way: NFS
> errors and disconnects typically do not kill running programs, but
> cause them to hang. If the binaries for the clusterware are
> themselves on NFS, then clearly they are going to hang also.