99% IOWAIT with Oracle RAC 10g (10.1.0.4) on Linux

From: <mccmx_at_hotmail.com>
Date: 10 Jun 2005 07:25:49 -0700
Message-ID: <1118413549.907628.251850@o13g2000cwo.googlegroups.com>

Oracle 10.1.0.4 EE running on 2 node RHEL 3 cluster (Oracle Firewire Kernel)
Shared Storage : Maxtor One Touch II

It seems that periodically the I/O to the shared device seems to 'hang up' (i.e. 99% I/O Wait in 'top') for exactly 1 minute when both instances are booted.

At first I suspected that this was just a 'top' reporting anomoly, so I traced a SQL statement which runs for approx 30 seconds with only one instance started.

I then traced the session with both instances running and the execution time jumped to 90 seconds, which corresponds to the normal 30 secs plus this strange 60 second timeout. When I tkprof'd the trace file, I can see that of the 90 seconds response time, 1 individual 'db file scattered read' took 59.8 seconds. This is highly unusual for one multi block read:

Elapsed times include waiting on following events:

  Event waited on                  Times   Max. Wait  Total Waited
-------------------------------   Waited  ----------  ------------
  SQL*Net message to client            2       0.00         0.00
  db file scattered read            6954       59.8        82.42
  SQL*Net message from client          2     276.12       276.12

This issue is easily repeatable.

The thing that makes me think that this is a I/O problem to the shared disk is that we had to increase the CSS misscount to 120 seconds because of repeated "Voting Disk timeout" errors which used to crash CRS on one of the nodes.

Anyone have any idea how to diagnose the source of this I/O hang.

When I run iostat during this period of 99% IOWAIT, there is no activity to the shared disk at all. 0 bytes read, 0 bytes written.

Matt Received on Fri Jun 10 2005 - 09:25:49 CDT