Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Mailing Lists -> Oracle-L -> RE: ocssd

RE: ocssd

From: Kevin Closson <kevinc_at_polyserve.com>
Date: Fri, 19 May 2006 15:27:06 -0700
Message-ID: <5D2570CAFC98974F9B6A759D1C74BAD0E5A4AE@ex2.ms.polyserve.com>

 >>>

>>>That's what I was told today during our ASM training (or
>>>what I assumed base on information i got). For Oracle, IO
>>>fencing requires at least one voting disk and this is what's
>>>configured with CRS (voting disks). In non-RAC setup, you
>>>don't specify voting disks, thus, no IO fencing. Right? From
>>>my point of view, IO fencing is only needed to support split
>>>brain resolution for clustered setup and to evict nodes
>>>(anything I missed?).

..ok, this makes sense, Alex. But, applying the term "I/O fencing" to the thing that CRS does is a little off base. Not to get crazy about semantics, but "I/O Fencing" is a term applied to technology that isolates a server from I/O. That is, it remains alive, but CANNOT touch storage.

What CRS does is referred to "Server Fencing". The approach CRS takes to Server Fencing is routinely mislabeled STONITH (Shoot The Other Node In The Head). CRS does not, in fact, implement STONITH. What CRS implements is something that still has not been given a term. I call is ATONTRI (Ask The Other Node To Reboot Itself). You can read /etc/init.d/init.cssd on a Linux RAC system to see what I mean. I'm not casting stones by any means. How could I since I'm a nobody. However, I assert that it is of the UTMOST importance that IT professionals involved with a RAC implementation be keenly aware of what they are actually running. It is lack of knowledge regarding the underpinnings that will come back and bite you. I know everyone out there
builds these Linux RAC systems and are generally happy. But that is the extent of it. Generally speaking people are not harsh enough on the technology. To foster assurance that your RAC kit is going to hold together, you need to load a test cluster and torture test it. Yes, that means you will need to physically touch the servers, switches, cables, GBICs, all of it. Inject faults. Inject multiple cascading faults.
Observe what you see. Do you see ancillary failures (e.g., you sever and I/O path from node 3 and boom, node 2 goes down too)? Do you ever see any total outages? Etc. Fault injection testing should be a natural when you shell out for such expensive software (RAC).

Why is ATONTRI interesting? Well, consider what happens when the reason a node is "being fenced" is due to a catatonic situation. That is, a node in the RAC cluster is not performing its checkins to the CSS disk. OK, fine. But what if it isn't checking in because the Kernel is cranky? Uh, have you ever seen a system so overloaded you can't execute a command? Ever seen a system in desperation VM code? Of course you we all have. So ask yourself how in the world /etc/init/d/init.cssd is going to successfully execute the reboot(1) command? How many fork calls is that? The shell forks and execs reboot, reboot is a dynamically linked binary. That means the overloaded or catatonic server needs to be able to allow the reboot command to mmap shared libraries, get file descriptors, etc etc etc... have you ever seen a system where that sort of processing cannot get through the system? Of course you have.

If you've made it this far, ponder for a moment what happens if a RAC node has been told to ATONTRI and it didn't because it couldn't. It is no longer a viable member of the cluster but it sure has a path to storage and it has electricity. What happens if the catatonic state was transient? Maybe, 2 minutes, who knows? Are there I/O requests queued in the SCSI midlayer? Do you think those might be I/Os headed for an Oracle datafile?

This is not FUD. These are real clustering concerns. This is why PolyServe implements assured peer-fencing. There will never be a "missed fencing operation" with our stuff...not down to the very last 2 nodes...and then there is no split-brain because we use a much more sophisticated membership algorithm than simple "who's got more".

Oracle instituted the Clusterware Compatibility Program http://www.oracle.com/technology/software/oce/oce_fact_sheet.htm under which we have been certified to make sure that host clusterware doesn't in fact weaken ATONTRI. In our case, we add value because we run in kernel mode and nodes do not fence themselves. I'm writing a piece to describe the relationship between CRS and non-integrated (compatible) host clusterware and how our two fencing options are more reliable than other fencing technologies. I'll post a URL here when I finish it. I never bothered writing it before because Oracle had no program for certification for us so we only sold to shops that had a mandate to go live and be successful at the high end with Linux clustering and generally failed with other clustering solutions.

>>>
>>>Today, I did a test - I created a second ASM instance on the
>>>same host and databases were not able to register with this
>>>instance unless I register this new ASM instance with CRS
>>>(and consequently CSS is aware). There was some problem with
>>>CRS bahving strangely but this is anothe story.
>>>
>>>In the end, I won't tell you "yes, I am 100% sure" unless I
>>>trace the processes involved. I am positive that this is the
>>>case and I actually can try to trace it down (if I have
>>>enough spare time).
>>>
>>>2006/5/19, Kevin Closson <kevinc_at_polyserve.com>:
>>>> >>>
>>>> >>>It was quite a while since this thread posted but in
>>>the meantime I
>>>> >>>figured out that ASM needs ocssd daemon because this is
>>>the way of
>>>> >>>establishing communications between database instances and ASM
>>>> >>>instance.
>>>>
>>>> Alex, are you sure that is why it needs ocssd and not just for
>>>> "fencing" functionality?
>>>> --
>>>> http://www.freelists.org/webpage/oracle-l
>>>>
>>>>
>>>>
>>>
>>>
>>>--
>>>Best regards,
>>>Alex Gorbachev
>>>
>>>http://oracloid.blogspot.com
>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri May 19 2006 - 17:27:06 CDT

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US