Re: CPU waiting for... what? (mistery)

From: David Sharples <david.sharples3_at_ntlworld.com>
Date: Fri, 11 Apr 2003 00:05:36 +0100
Message-ID: <5Nmla.595$vG4.269@newsfep4-glfd.server.ntli.net>

give statspack a whirl

"Rick Denoire" <100.17706_at_germanynet.de> wrote in message news:ogpb9vkamsftl1apont1bq7j0ceb9l4s09_at_4ax.com...
> Today, I recreated an index of our Oracle DB (8.1.7):
> alter index <index_name> rebuild tablespace X storage (initial 32M
> next 32M) nologging;
>
> I have being struggling with slow performance on this server (Sun
> E3500/Solaris 2.7) for a long time now, so I thoroughly looked at
> this process, using a lot of tools: sar, iostat, Adrian Crockcoft's
> zoom tool, proctool, Oracle Performance Manager (Diagnostics Pack of
> the Oracle Enterprise Manager), and the built-in software
> ("Navisphere") of the Clariion Disk Array (EMC CX-400). Believe me, I
> spent hours investigating a lot of different metrics.
>
> I got desperated. I just don't understand what happens when the CPUs
> (4x) work at about 5%, and the disks are almost idle! Shouldn't any
> process run at maximum *possible* speed, if it is not being
> artificially slowed down? Well, the CPU state was seldom less than 20%
> wait, at times even 70% wait, and that is usually the case when heavy
> I/O operations take place. But on the Raid side, the storage
> processors were saying: Almost nothing to do here.
>
> In this case, sort operations were taking place for quite a long time.
> I found out that the outstanding wait event for this session was
> "direct path write". File systems containing the DB files are mounted
> with the option "forcedirectio" to avoid OS buffering.
>
> When reading sequentially, a transfer rate of up to 45 MB/s has been
> observed here in other oportunities, but in this case, I got confused
> because the Performance Manager was showing 100% full table scans.
> Buffer Cache Hit Rate was less than 1% over almost the whole time
> period (it is usually at about 99% otherwise), which is typical for
> random I/O.
>
> The problem ist that I can't identify the I/O bottleneck. This storage
> device is quite a modern one, connected redundantly with two 2 GB
> Fibre Channel cables, has a battery powered write cache (almost 400
> MB, and about 100 MB read cache), uses 33 GB HDs with 15000 RPM, which
> is the best available for DB work. The service time showed by iostat
> is a few milliseconds at most. But sar is constantly showing 1 or 2
> processes in the wait queue (which?).
>
> In short: How can I find out which process is responsible for the CPU
> wait states and why? I tried to use the utility "etruss" from System
> Internals. The result was that the process to be traced stopped and
> had to be killed; does not seem to work right with Solaris 2.7.
>
> Perhaps I can at last remove the brakes in this machine. I've always
> suspected that something is plain wrong with this host, I just don't
> know how to identify the cause and correct it. A kernel parameter or
> something.
>
> By the way, the system was not swapping, slowdown was not due to
> network activity (the job was started on the host), and there was no
> other active user connected at the time this job was running.
>
> As I said, mistery.
>
> Bye
> Rick Denoire
Received on Thu Apr 10 2003 - 18:05:36 CDT