Oracle-L: RE: Can't fire up 2nd ORACM of 2-node Linux RAC

From: Nick Wagner <NWagner_at_goldengate.com>
Date: Fri, 29 Aug 2003 09:59:27 -0800
Message-ID: <F001.005CE016.20030829095927@fatcity.com>

by any chance do you have an Oracle 7, or 8 instance (or even listener) running on that second machine? I think I've seen something similar to this when we had the wrong libraries linked to the database on one of the nodes.

Nick

-----Original Message-----
Sent: Friday, August 29, 2003 10:39 AM
To: Multiple recipients of list ORACLE-L

Hey all,

We're testing the useability of 9iRAC. Our test hardware is a 2-node Intel RedHat9 "cluster" (AS2.1 won't recognize our hardware) with OCFS on a shared SCSI drive, setup using the "How to Build a $1000 RAC" from www.bradmark.com Following the "Step-By-Step Installation of 9.2.0.4 RAC on Linux" MetaLink doc (with adjustments for doc inaccuracies), I'm about ready to create the DB. However, when I try to fire up the ORACM on the second node, it errors out. The OCFS mount point is "/spice" and all seems to be OK with it from the OS level of either node. The start of the cm.log is:

oracm, version[ 9.2.0.2.0.47 ] started {Fri Aug 29 09:43:35 2003 } KernelModuleName is hangcheck-timer {Fri Aug 29 09:43:35 2003 }

OemNodeConfig(): Network Address of node0: 192.168.1.241 (port 9998)
 {Fri Aug 29 09:43:35 2003 }
OemNodeConfig(): Network Address of node1: 192.168.1.242 (port 9998)
 {Fri Aug 29 09:43:35 2003 }

>WARNING: OemInit2: Opened file(/spice/RAC_quorum.dbf 8), tid = main:16384
file = oem.c, line = 491 {Fri Aug 29 09:43:35 2003 } Debug Hang : ClusterListener (PID=22194) Registered withwatchdog daemon. {Fri Aug 29 09:43:37 2003 }
InitializeCM: ModuleName = hangcheck-timer {Fri Aug 29 09:43:37 2003 } InitializeCM: Kernel module hangcheck-timer is already loaded {Fri Aug 29 09:43:37 2003 }
Debug Hang : CmConnectListener (PID=22195):Registered with watchdog daemon. {Fri Aug 29 09:43:37 2003 }
Debug Hang :StartNMMon (PID=22187) Registered with watchdog daemon. {Fri Aug 29 09:43:37 2003 }
CreateLocalEndpoint(): Network Address: 192.168.1.242 {Fri Aug 29 09:43:37 2003 }
Debug Hang :PollingThread (PID=135159137): Registered with {Fri Aug 29 09:43:37 2003 }
Debug Hang : DiskPingThread (PID=135159137): Registered with {Fri Aug 29 09:43:37 2003 }
Debug Hang :SendingThread (PID=135159137): Registered with {Fri Aug 29 09:43:37 2003 }
--- DUMP GROUP STATE DB ---
--- END OF GROUP STATE DUMP --- All looks OK there. At least it looks the same as the first node that was successful in starting. The trace part is long and wouldn't look nice here, but here's the end of it (hopefully the pertinent part):

>TRACE: SendingThread: Spawned with tid 0x1c008, 0x0., tid = 114696 file
= nmmember.c, line = 511 {Fri Aug 29 09:43:37 2003 } Debug Hang :SendingThread (PID=135159137): Registered with {Fri Aug 29 09:43:37 2003 }
>TRACE: SendingThread (pid=22198, tid=114696): Registered with watchdog
daemon., tid = 114696 file = nmmember.c, line = 576 {Fri Aug 29 09:43:37 2003 }
>TRACE: HandleJoin(): src[1] dest[1] dom[0] seq[1] sync[0], tid =
ClusterListener:49156 file = nmlisten.c, line = 346 {Fri Aug 29 09:43:37 2003 }
>TRACE: HandleJoin(): JOIN from node(1)->(1), tid = ClusterListener:49156
file = nmlisten.c, line = 362 {Fri Aug 29 09:43:37 2003 }
>TRACE: HandleSync(): src[0] dest[1] dom[0] seq[6] sync[1], tid =
ClusterListener:49156 file = nmlisten.c, line = 506 {Fri Aug 29 09:43:37 2003 }
>TRACE: SendAck(): node(0) domain(0) syncSeqNo(1) type(11), tid =
ClusterListener:49156 file = nmmember.c, line = 1913 {Fri Aug 29 09:43:37 2003 }
>TRACE: HandleVote(): src[0] dest[1] dom[0] seq[7] sync[1], tid =
ClusterListener:49156 file = nmlisten.c, line = 643 {Fri Aug 29 09:43:38 2003 }
>TRACE: SendVoteInfo(): node(0) domain(0) syncSeqNo(1), tid =
ClusterListener:49156 file = nmmember.c, line = 1727 {Fri Aug 29 09:43:38 2003 }
>TRACE: HandleShutdown(): src[0] dest[1] dom[0] seq[0] sync[1] type[4],
tid = ClusterListener:49156 file = nmlisten.c, line = 1087 {Fri Aug 29 09:43:39 2003 }
>TRACE: IncrementEventValue: *(80f2900) = (1, 1), tid =
ClusterListener:49156 file = unixinc.c, line = 253 {Fri Aug 29 09:43:39 2003 }
--- End Dump ---

There's no ERROR or WARNING listed in the trace part. Hmmmm. Also, here's my cmcfg.ora:

HeartBeat=15000
KernelModuleName=hangcheck-timer
ClusterName=Oracle Cluster Manager, version 9i PollInterval=1000
MissCount=210
PrivateNodeNames=rac1-private rac2-private PublicNodeNames=rac1 rac2
ServicePort=9998
#WatchdogSafetyMargin=5000
#WatchdogTimerMargin=60000
CmDiskFile=/spice/RAC_quorum.dbf
HostName=rac1

I've installed and verified the hangcheck-timer kernel mod in favor of the Watchdog timer, as the docs say to do. I've tried blowing away the shared quorum file, recreating it with touch, recreating it with dd, and opening up security on the file and directory to no avail. The one problem I know I had was that I had the local node aliased to localhost in my /etc/hosts. Everything seemed to work, but instead of having a "cluster", I had two separate nodes sharing a disk. Once I changed /etc/hosts, I started getting this problem. There are a few MetaLink Forum posts just like this, but none have been resolved (and I just lost my net connection).

Anyone care to take a stab at it? TIA!
Rich

Rich Jesse                           System/Database Administrator
rjesse_at_qtiworld.com                  Quad/Tech Inc, Sussex, WI USA

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.net
-- 
Author: Jesse, Rich
  INET: Rich.Jesse_at_qtiworld.com

Fat City Network Services    -- 858-538-5051 http://www.fatcity.com
San Diego, California        -- Mailing list and web hosting services
---------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.net
-- 
Author: Nick Wagner
  INET: NWagner_at_goldengate.com

Fat City Network Services    -- 858-538-5051 http://www.fatcity.com
San Diego, California        -- Mailing list and web hosting services
---------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Fri Aug 29 2003 - 12:59:27 CDT