Re: Need Expert Help - dbca fails (ORA-27047) on raw vol. 9201 SuSE SLES8

From: gdb_dsr <ge_2000_2001_at_yahoo.com>
Date: 1 Oct 2003 12:56:39 -0700
Message-ID: <6299d152.0310011156.7c8c39dc@posting.google.com>

ge_2000_2001_at_yahoo.com (gdb_dsr) wrote in message news:<6299d152.0309271946.66e52644_at_posting.google.com>...
> Hi,
> I have done this many times on Aix, Solaris, Relient Unix, HP and
> Linux
> almost all the time with third party clusters but this time I have to
> work with oracle clusters, looks like this is not stable, I am trying
> to
> find out the root cause.
>
> Problem is dbca fails at 37%
>
> create/cloneDBCreation.log:
> ORACLE instance started.
> Total System Global Area 252776588 bytes
> Fixed Size 450700 bytes
> Variable Size 218103808 bytes
> Database Buffers 33554432 bytes
> Redo Buffers 667648 bytes
> Create controlfile reuse set database rac
> *
> ERROR at line 1:
> ORA-01503: CREATE CONTROLFILE failed
> ORA-01565: error in identifying file
> '/opt/oracle/oradata/rac/cwmlite01.dbf'
> ORA-27047: unable to read the header block of file
> Linux Error: 4: Interrupted system call
>
>
> alert_rac1.log:
> '/opt/oracle/oradata/rac/undotbs01.dbf' ,
> '/opt/oracle/oradata/rac/users01.dbf' ,
> '/opt/oracle/oradata/rac/xdb01.dbf'
> LOGFILE GROUP 1 ('/opt/oracle/oradata/rac/redo01.log') SIZE 102400K
> REUSE,
> GROUP 2 ('/opt/oracle/oradata/rac/redo02.log') SIZE 102400K REUSE
> RESETLOGS
> Fri Sep 26 13:21:13 2003
> lmon registered with NM - instance id 1 (internal mem no 0)
> Fri Sep 26 13:21:13 2003
> Reconfiguration started
> List of nodes: 0,
> Global Resource Directory frozen
> one node partition
> Communication channels reestablished
> Master broadcasted resource hash value bitmaps
>
> List of nodes always 0, irrespective of number of nodes up
>
>
> rac1_diag_18951.trc:
> *** SESSION ID:(2.1) 2003-09-26 13:21:09.847
> CMCLI WARNING: CMInitContext: init ctx(0xabba1fc)
> kjzcprt:rcv port created
> Node id: 0
> List of nodes: 0,
> *** 2003-09-26 13:21:09.853
> Reconfiguration starts [incarn=0]
> I'm the master node
> *** 2003-09-26 13:21:09.853
> Reconfiguration completes [incarn=1]
> CMCLI WARNING: ReadCommPort: received error=104 on recv().
> kjzmpoll: slos err[12 CMGroupGetList 2 RPC failed status(-1)
> respMsg->status(0) 0]
> [kjzmpoll1]: Error [category=12] is encountered
> CMCLI ERROR: OpenCommPort: connect failed with error 111.
> kjzmdreg1: slos err[12 CMGroupExit 2 RPC failed status(1) 0]
> [kjzmleave1]: Error [category=12] is encountered
> error 32700 detected in background process
> OPIRIP: Uncaught error 447. Error stack:
> ORA-00447: fatal error in background process
> ORA-32700: error occurred in DIAG Group Service
> ORA-27300: OS system dependent operation:CMGroupExit failed with
> status: 0
> ORA-27301: OS failure message: Error 0
> ORA-27302: failure occurred at: 2
> ORA-27303: additional information: RPC failed status(1)
> ORA-32700: error occurred in DIAG Group Service
> ORA-27300: OS system dependent operation:CMGroupGetList failed with
> status: 0
> ORA-27301: OS failure message: Error 0
> ORA-27302: failure occurred at: 2
> ORA-27303: additional information: RPC failed status(-1)
> respMsg->status(0)
>
> according to the posting
> http://lists.suse.com/archive/suse-oracle/2002-Feb/0058.html
> all logical volumes should not start at 0 but all my lvms start at 0
> end
> at 124 according to yast2
> it is by any chance some thing to do with my shared harddrive setup.
> I really appreciate any of your commnets and ideas.
>
>
> Here is my 9201 RAC setup
> HW: Two Dellpoweredge 1600SC (two Xeon CPUs 2.40GHz),1Gbit Eth inter
> connect.
> shared disk: Adaptec 29160 Ultra160 SCSI adapters from both nodes
> connected to SEAGATE ST373307LW externally - (I dont know any tools
> to
> make sure this setup is OK - or must I go for certified shared
> storage)
> SW: SuSE SLSE8 (/etc/SuSE-release: SuSE SLES-8 (i386) VERSION = 8.1)
> 2.4.19-64GB-SMP #1 SMP )applied k_smp-2.4.19-196.i586.rpm patch
> (Oracle certified on this OS)
>
> I have setup raw partitions with lvm and bind to /dev/raw/raw*
> $ ls -dl /dev/oracle
> drwxrwxrwx 2 root root 4096 2003-09-27 14:33
> /dev/oracle
> $ ls -dl /dev/oracle/lvol1
> brw-rw---- 1 oracle dba 58, 0 2003-09-27 14:33
> /dev/oracle/lvol1
> ... up to 25 vols
> $ ls -ld /dev/raw
> drwxrwxrwx 2 root root 4096 2003-09-19 17:39 /dev/raw
> $ ls -ld /dev/raw/raw1
> crw------- 1 oracle dba 162, 1 2003-09-19 17:39
> /dev/raw/raw1
> ... upto 25 volumes bind with
>
> Cluster manager is the one which uses one of the shared disks first. I
> started
> getting errors when I start dbca, seems to be oracm buggy. Deleted
> totally previous install, installed 9201 cluster manager and applied
> 9203 cluster manager patch. Installed oracle successfully.
> started ocmstart.sh
> cm.log:
> oracm, version[ 9.2.0.2.0.41 ] started {Fri Sep 26 14:51:10 2003 }
> KernelModuleName is hangcheck-timer {Fri Sep 26 14:51:10 2003 }
> OemNodeConfig(): Network Address of node0: 192.168.1.1 (port 9998)
> {Fri Sep 26 14:51:10 2003 }
> OemNodeConfig(): Network Address of node1: 192.168.1.2 (port 9998)
> {Fri Sep 26 14:51:10 2003 }
> >WARNING: OemInit2: Opened file(/dev/raw/raw1 8), tid = main:1024
> file = oem.c, line = 491 {Fri Sep 26 14:51:10 2003 }
> InitializeCM: ModuleName = hangcheck-timer {Fri Sep 26 14:51:10 2003
> }
> InitializeCM: Kernel module hangcheck-timer is already loaded {Fri Sep
> 26 14:51:10 2003 }
> ClusterListener (pid=1553, tid=3076): Registered with watchdog daemon.
> {Fri Sep 26 14:51:10 2003 }
> CreateLocalEndpoint(): Network Address: 192.168.1.1
> {Fri Sep 26 14:51:10 2003 }
> UpdateNodeState(): node(0) added udpated {Fri Sep 26 14:51:13 2003 }
> HandleUpdate(): SYNC(1) from node(1) completed {Fri Sep 26 14:51:13
> 2003 }
> HandleUpdate(): NODE(0) IS ACTIVE MEMBER OF CLUSTER, INCARNATION(2)
> {Fri Sep 26 14:51:13 2003 }
> HandleUpdate(): NODE(1) IS ACTIVE MEMBER OF CLUSTER, INCARNATION(1)
> {Fri Sep 26 14:51:13 2003 }
> NMEVENT_RECONFIG [00][00][00][00][00][00][00][03] {Fri Sep 26 14:51:13
> 2003 }
> Successful reconfiguration, 2 active node(s) node 1 is the master, my
> node num is 0 (reconfig 2) {Fri Sep 26 14:51:14 2003 }
> >WARNING: ReadCommPort: socket closed by peer on recv()., tid =
> ClientProcListener:11273 file = unixinc.c, line = 754 {Fri Sep 26
> 14:51:32 2003 }
> >WARNING: ReadCommPort: socket closed by peer on recv()., tid =
> ClientProcListener:12297 file = unixinc.c, line = 754 {Fri Sep 26
> 14:51:32 2003 }
>
> lsmod shows:
> hangcheck-timer 1248 0 (unused)
> that 0 doesnt seems to be right
> even after the 9203 patch watchdogd is still part of the ocmstart.sh
> not sure
> if I have to comment it.
>
> lsnodes output is wrong (oracm started only on one node)
> $ lsnodes -l
> nd1
> $ lsnodes (this is wrong, ocms not running on the other node)
> nd1
> nd2
> $ lsnodes -n
> nd1 0
> nd2 1
> $ lsnodes -v
> CMCLI WARNING: CMInitContext: init ctx(0x804ad00)
>
> nd1
> nd2
> CMCLI WARNING: CommonContextCleanup: closing comm port
> $

Applied 9203 cluster patch and created database manually. Received on Wed Oct 01 2003 - 14:56:39 CDT