Re: Responsiveness of Server at high CPU load

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Wed, 17 Dec 2003 08:24:59 -0000
Message-ID: <brp3sr$hco$1$830fa7a5@news.demon.co.uk>

Since you have two CPU's, you can only run at 99% if you have at least two statements (or two copies of the same statement) running concurrently - are you running with parallel execution.

I believe Oracle 9.2 has introduced some new strategies (on some platforms) for reducing loss of response time due to latching. Instead of sleeping after a failed spin, you session yields on (probably just) the first failure - in other words it stays running but puts itself to the bottom of the run queue. A consequence of this is that it is possible for processes which used to leave spare CPU capacity suddenly soak the CPU because they come flying back to the top of the run-queue as soon as every other runnable process has given up, rather than waiting for the normal 'oracle sleep' time. I have a test case on a 2-cpu box where the figures for two concurrent 'bad' queries are (roughly, and doctored to make the point):

    8.1.7        2 mins 45 sec - of which 15 seconds sleep time
                    box running at peak 92% CPU
                    wait time - latch waits 15 seconds per session
                    latch sleeps        1,500 per session
                    CPU used by this session 2 mins 30 sec.

    9.2.0.4    2 mins 45 sec - of which 0.04 sec sleep
                    box running at 99% CPU
                    wait time - latch waits 0.04 seconds per session
                    latch sleeps        4 or 5 per session
                    CPU used by this session 2 mins 45 sec.

I guess that Oracle is doing something differently with latches on your box - and there is a reason why the O/S is not then scheduling time out for other processes to connect. (Remember that a simple sql*plus connect takes around 20 round-trips to log on, so any queuing problem becomes a major headache when you can't get a time-slice - 20 seconds would not be surprising, 1 hour is extreme).

One other detail - are you running with resource manager activated, this gave me a lot of trouble in earlier versions of 9 (but not 9.2, and I thought it was an HP issue at the time).

Finally - The documentation for Oracle 9 installs (the last time I read it) said something about letting Oracle handle the scheduling by giving the oracle account the rtprio and a couple of other o/s privileges. If you have done this, AND have a bad SQL problem AND a latch-related bug then it might explain why things are grinding to a halt.

-- 
Regards

Jonathan Lewis
http://www.jlcomp.demon.co.uk

  The educated person is not the person
  who can answer the questions, but the
  person who can question the answers -- T. Schick Jr


One-day tutorials:
http://www.jlcomp.demon.co.uk/tutorial.html


Three-day seminar:
see http://www.jlcomp.demon.co.uk/seminar.html
____UK___November


The Co-operative Oracle Users' FAQ
http://www.jlcomp.demon.co.uk/faq/ind_faq.html


"Rick Denoire" <100.17706_at_germanynet.de> wrote in message
news:bptutv4qmvps97fbo234irlh7dhd8pqg4m_at_4ax.com...


> (Excerpt from a TAR - still open)

>

> From time to time, our Oracle test server (9.2.0.4 on Intel/Linux, 2

> CPUs) got unusuable at CPU load of 99% as shown by top; in this state,

> nothing else could be done with Oracle, even trying to connect via

> sqlplus took about 1 hour (assuming one would wait that long).

> Processes running were Oracle processes and kswap (meaning that

> swapping was heavily taking place).

>

> Users complain in such a situation and my only remedy has been to

> reboot the server. pstack and oradebug could not be used. After

> analyzing lots of things we found out that nothing seems to be wrong

> with the database - it is just that a very inefficient query is

> running which blocks the Oracle server and avoids any other activity.

> Well, one message was found in the alert log, saying

> ksbsrv: No startup acknowledgement from forked process after 30

> seconds

> but no ORA- error appears.

> Statspack Reports revealed a unusuable high "process startup" wait

> time.

>

> According to my experience under the Sun/Solaris platform, even if the

> 4 CPUs of our E3500 are at maximum load (showing an average idle of

> 0%), the Oracle (8.1.7) server is still available for new sessions

> (which run of course slower than usual). This happens quite often by

> the way, so it is a reliable experience.

>

> Assuming that the situation is caused by a bad query, I am concerned

> about the limited responsiveness of the server, since most of our

> queries are of batch type and run hours in the production platform,

> which is Sun/Solaris 7. If we transfer the production DB to the new,

> much faster Intel/Linux platform, we could have heavy trouble when

> such batch job run. They would be served in a first-in first-out base

> serialized one after one (limited by the number of CPUs available).

>

> Is there a way to adjust priorities or something to guarantee an even

> distribution of computing power of the Oracle server? Is this more a

> operating system problem than it is an oracle one? (Note: at the OS

> level, reactivity is much better). We use RedHat Linux AS 2.1 with

> asynch_io=true. This is supposed to be a certified environment (Dell

> Power Edge 2650) for enterprise use of Oracle.

>

> Oracle Corp. is quite clueless until now, so my question to the forum.

>

> Thanks in advance

>

> Rick Denoire

>

Received on Wed Dec 17 2003 - 02:24:59 CST