Need Expert Help - dbca fails (ORA-27047) on raw vol. 9201 SuSE SLES8

From: gdb_dsr <ge_2000_2001_at_yahoo.com>
Date: 27 Sep 2003 20:46:52 -0700
Message-ID: <6299d152.0309271946.66e52644@posting.google.com>

Hi,
I have done this many times on Aix, Solaris, Relient Unix, HP and Linux
almost all the time with third party clusters but this time I have to work with oracle clusters, looks like this is not stable, I am trying to
find out the root cause.

Problem is dbca fails at 37%

create/cloneDBCreation.log:
ORACLE instance started.
Total System Global Area 252776588 bytes

Fixed Size                   450700 bytes
Variable Size             218103808 bytes
Database Buffers           33554432 bytes
Redo Buffers                 667648 bytes

Create controlfile reuse set database rac *
ERROR at line 1:
ORA-01503: CREATE CONTROLFILE failed
ORA-01565: error in identifying file
'/opt/oracle/oradata/rac/cwmlite01.dbf'
ORA-27047: unable to read the header block of file Linux Error: 4: Interrupted system call

alert_rac1.log:

'/opt/oracle/oradata/rac/undotbs01.dbf' ,
'/opt/oracle/oradata/rac/users01.dbf' ,
'/opt/oracle/oradata/rac/xdb01.dbf'

LOGFILE GROUP 1 ('/opt/oracle/oradata/rac/redo01.log') SIZE 102400K REUSE,
GROUP 2 ('/opt/oracle/oradata/rac/redo02.log') SIZE 102400K REUSE RESETLOGS
Fri Sep 26 13:21:13 2003
lmon registered with NM - instance id 1 (internal mem no 0) Fri Sep 26 13:21:13 2003
Reconfiguration started
List of nodes: 0,
Global Resource Directory frozen
one node partition
Communication channels reestablished
Master broadcasted resource hash value bitmaps

List of nodes always 0, irrespective of number of nodes up

rac1_diag_18951.trc:
*** SESSION ID:(2.1) 2003-09-26 13:21:09.847 CMCLI WARNING: CMInitContext: init ctx(0xabba1fc) kjzcprt:rcv port created
Node id: 0
List of nodes: 0,
*** 2003-09-26 13:21:09.853
Reconfiguration starts [incarn=0]
I'm the master node
*** 2003-09-26 13:21:09.853
Reconfiguration completes [incarn=1]
CMCLI WARNING: ReadCommPort: received error=104 on recv(). kjzmpoll: slos err[12 CMGroupGetList 2 RPC failed status(-1) respMsg->status(0) 0]
[kjzmpoll1]: Error [category=12] is encountered CMCLI ERROR: OpenCommPort: connect failed with error 111. kjzmdreg1: slos err[12 CMGroupExit 2 RPC failed status(1) 0] [kjzmleave1]: Error [category=12] is encountered error 32700 detected in background process OPIRIP: Uncaught error 447. Error stack:

ORA-00447: fatal error in background process
ORA-32700: error occurred in DIAG Group Service
ORA-27300: OS system dependent operation:CMGroupExit failed with

status: 0

ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: 2 
ORA-27303: additional information: RPC failed status(1)
ORA-32700: error occurred in DIAG Group Service 
ORA-27300: OS system dependent operation:CMGroupGetList failed with

status: 0

ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: 2
ORA-27303: additional information: RPC failed status(-1)

respMsg->status(0)

according to the posting
http://lists.suse.com/archive/suse-oracle/2002-Feb/0058.html all logical volumes should not start at 0 but all my lvms start at 0 end
at 124 according to yast2
it is by any chance some thing to do with my shared harddrive setup. I really appreciate any of your commnets and ideas.

Here is my 9201 RAC setup
HW: Two Dellpoweredge 1600SC (two Xeon CPUs 2.40GHz),1Gbit Eth inter connect.
shared disk: Adaptec 29160 Ultra160 SCSI adapters from both nodes connected to SEAGATE ST373307LW externally - (I dont know any tools to
make sure this setup is OK - or must I go for certified shared storage)
SW: SuSE SLSE8 (/etc/SuSE-release: SuSE SLES-8 (i386) VERSION = 8.1) 2.4.19-64GB-SMP #1 SMP )applied k_smp-2.4.19-196.i586.rpm patch (Oracle certified on this OS)

I have setup raw partitions with lvm and bind to /dev/raw/raw*
$ ls -dl /dev/oracle

drwxrwxrwx 2 root root 4096 2003-09-27 14:33 /dev/oracle
$ ls -dl /dev/oracle/lvol1

brw-rw---- 1 oracle dba 58, 0 2003-09-27 14:33 /dev/oracle/lvol1
... up to 25 vols
$ ls -ld /dev/raw

drwxrwxrwx 2 root root 4096 2003-09-19 17:39 /dev/raw
$ ls -ld /dev/raw/raw1

crw------- 1 oracle dba 162, 1 2003-09-19 17:39 /dev/raw/raw1
... upto 25 volumes bind with

Cluster manager is the one which uses one of the shared disks first. I started
getting errors when I start dbca, seems to be oracm buggy. Deleted totally previous install, installed 9201 cluster manager and applied 9203 cluster manager patch. Installed oracle successfully. started ocmstart.sh
cm.log:
oracm, version[ 9.2.0.2.0.41 ] started {Fri Sep 26 14:51:10 2003 } KernelModuleName is hangcheck-timer {Fri Sep 26 14:51:10 2003 }

OemNodeConfig(): Network Address of node0: 192.168.1.1 (port 9998)
 {Fri Sep 26 14:51:10 2003 }
OemNodeConfig(): Network Address of node1: 192.168.1.2 (port 9998)
 {Fri Sep 26 14:51:10 2003 }

>WARNING: OemInit2: Opened file(/dev/raw/raw1 8), tid = main:1024 file = oem.c, line = 491 {Fri Sep 26 14:51:10 2003 } InitializeCM: ModuleName = hangcheck-timer {Fri Sep 26 14:51:10 2003 }
InitializeCM: Kernel module hangcheck-timer is already loaded {Fri Sep 26 14:51:10 2003 }
ClusterListener (pid=1553, tid=3076): Registered with watchdog daemon. {Fri Sep 26 14:51:10 2003 }
CreateLocalEndpoint(): Network Address: 192.168.1.1 {Fri Sep 26 14:51:10 2003 }
UpdateNodeState(): node(0) added udpated {Fri Sep 26 14:51:13 2003 } HandleUpdate(): SYNC(1) from node(1) completed {Fri Sep 26 14:51:13 2003 }

HandleUpdate(): NODE(0) IS ACTIVE MEMBER OF CLUSTER, INCARNATION(2)
{Fri Sep 26 14:51:13 2003 }
HandleUpdate(): NODE(1) IS ACTIVE MEMBER OF CLUSTER, INCARNATION(1)
{Fri Sep 26 14:51:13 2003 }

NMEVENT_RECONFIG [00][00][00][00][00][00][00][03] {Fri Sep 26 14:51:13 2003 }
Successful reconfiguration, 2 active node(s) node 1 is the master, my node num is 0 (reconfig 2) {Fri Sep 26 14:51:14 2003 } >WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:11273 file = unixinc.c, line = 754 {Fri Sep 26 14:51:32 2003 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = ClientProcListener:12297 file = unixinc.c, line = 754 {Fri Sep 26 14:51:32 2003 }

lsmod shows:
hangcheck-timer 1248 0 (unused) that 0 doesnt seems to be right
even after the 9203 patch watchdogd is still part of the ocmstart.sh not sure
if I have to comment it.

lsnodes output is wrong (oracm started only on one node)
$ lsnodes -l

nd1
$ lsnodes (this is wrong, ocms not running on the other node)
nd1
nd2
$ lsnodes -n

nd1 0
nd2 1
$ lsnodes -v

CMCLI WARNING: CMInitContext: init ctx(0x804ad00)

nd1
nd2
CMCLI WARNING: CommonContextCleanup: closing comm port $ Received on Sat Sep 27 2003 - 22:46:52 CDT