Re: CPU waiting for... what? (mistery)

From: Darren Dunham <ddunham_at_redwood.taos.com>
Date: Thu, 10 Apr 2003 23:13:19 GMT
Message-ID: <jUmla.1393$Ma6.120681674@newssvr14.news.prodigy.com>

In comp.unix.solaris Rick Denoire <100.17706_at_germanynet.de> wrote:

> I got desperated. I just don't understand what happens when the CPUs
> (4x) work at about 5%, and the disks are almost idle!

That says to me that the box is idle...

> Shouldn't any process run at maximum *possible* speed, if it is not
> being artificially slowed down?

In general, yes. You need to identify what you mean by a "process" though. An individual database process may not require (much) CPU. It may be waiting on other processes or connections.

> Well, the CPU state was seldom less than 20% wait, at times even 70%
> wait, and that is usually the case when heavy I/O operations take
> place. But on the Raid side, the storage processors were saying:
> Almost nothing to do here.

CPU wait states say almost nothing about I/O. All it really says is that the CPU is somewhat idle. I/Owait + idle equals true CPU idle. So a box with 70% io wait has lots of CPU for processing.

All I/O wait means is that at least one I/O is outstanding, and a CPU is idle. Since a database tends to be more I/O intensive than CPU intensive, this is normal.

> In this case, sort operations were taking place for quite a long time.
> I found out that the outstanding wait event for this session was
> "direct path write". File systems containing the DB files are mounted
> with the option "forcedirectio" to avoid OS buffering.

> When reading sequentially, a transfer rate of up to 45 MB/s has been
> observed here in other oportunities, but in this case, I got confused
> because the Performance Manager was showing 100% full table scans.
> Buffer Cache Hit Rate was less than 1% over almost the whole time
> period (it is usually at about 99% otherwise), which is typical for
> random I/O.

> The problem ist that I can't identify the I/O bottleneck. This storage
> device is quite a modern one, connected redundantly with two 2 GB
> Fibre Channel cables, has a battery powered write cache (almost 400
> MB, and about 100 MB read cache), uses 33 GB HDs with 15000 RPM, which
> is the best available for DB work. The service time showed by iostat
> is a few milliseconds at most. But sar is constantly showing 1 or 2
> processes in the wait queue (which?).

"wait queue"? What figures are you using that show you that? Are you talking about the "b" column from vmstat?

How do you know you have an I/O bottleneck at all? You mention that you've seen 45MB/s transfer rates. That sounds good. As long as iostat is showing that the service times are small, then I think you've proved that the I/O is okay.

> In short: How can I find out which process is responsible for the CPU
> wait states and why? I tried to use the utility "etruss" from System
> Internals. The result was that the process to be traced stopped and
> had to be killed; does not seem to work right with Solaris 2.7.

Difficult. I think you'd have to write a utility that looked at the status of all the processes on the machine and recorded if any of them were in the "blocked" state. I have some information from someone else about the general aspects, but I've never attempted to code it up.

-- 
Darren Dunham                                           ddunham_at_taos.com
Unix System Administrator                    Taos - The SysAdmin Company
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Received on Thu Apr 10 2003 - 18:13:19 CDT