Oracle-L: Re: HIGH CPU WITH MULTIPLE CONCURRENT USERS (long)

From: Paul Vallee <dbalist_at_pythian.com>
Date: Thu, 18 Apr 2002 06:43:27 -0800
Message-ID: <F001.00447DA1.20020418064327@fatcity.com>

Hi Richard,

Consider hiring Steve Adams (Ixora) to work for you. His rates are reasonable and he has worked problems exactly like this dozens of times. He is the most competent troubleshooter of these types of problems I have ever worked with and I highly recommend you consider retaining his services.

That being said... :-)

Your shared pool size of 1G is ridiculous. Try 150M, unset spin_count, throw the rest in db_block_buffers, and please show us the statspack output after that.

Also, go to www.ixora.com.au, and download a few scripts: latch_sleeps.sql, latch_where.sql. The output might be interesting.

Finally, run a select * from v$sqlarea where version_count > 100. Any sql statement in that list is probably part of your problem, and finding out why it's getting invalidated will be part of your solution.

HTH,
Paul

---
www.pythian.com -- vallee_at_pythian.com -- 877-PYTHIAN
Smarter than adding another team member, Pythian has new services for
supplementing DBAs: get our help with monitoring, 24x7 on-call, daily
verifications, storage management, performance and more.

----- Original Message -----
From: Paul Troiano
To: Multiple recipients of list ORACLE-L
Sent: Wednesday, April 17, 2002 11:48 PM
Subject: Re: HIGH CPU WITH MULTIPLE CONCURRENT USERS (long)

Never mind. I just saw that oracle was able to reproduce it internally.
----- Original Message -----
From: Richard Eastham
To: Multiple recipients of list ORACLE-L
Sent: Tuesday, April 16, 2002 11:58 PM
Subject: HIGH CPU WITH MULTIPLE CONCURRENT USERS (long)

A co-worker is having a fairly serious issue with performance tuning of a system. The system is in the stress testing phase prior to rolling out into production. I have not included all the information as so far they have exceeded three TARs and are working on the fourth one right now. Oracle has become fairly heavily involved and is sending in the Advanced services team is now involved. He has identified that the main issue is a wait after the parsing of the SQL and during the fetch portion of the execution. The short version is running the same SQL statement ( basically nothing more than a simple query against a single table) the machine starts bogging down with a simulated 20+ users sessions and the system starts to choke at 100+ user sessions. We are talking a fairly decent midrange system. The query is a select with 5 columns extracted and a where clause that uses the in clause to select the same rows for each query. The question is has anyone seen this type of behavior before? If you have seen this before what was the root cause? Did you find a solution?

Oracle acknowledges that the scenario is reproducible within their test environment, but the core team is stating that it is working as designed. Oracle is working with us, but why not check with other sources.

----------------------------------------------------------------------------

A summary of where we are at: (4th TAR)

1.. We tried to simulate the same performance degradation on an entirely different environment. We have been able to do the same.
2.. We had requested Oracle to simulate the test case in their environment. They have been able simulate the performance degradation. Their analysis is also provided in this attachment.
3.. To summarize, they have simulated where 1 user query runs in 2 seconds and 10-user query takes 7 seconds on a 4-processor server.
4.. The development team of Oracle has answered to this degradation as normal and as designed. However, the degradation is very high and is in contrast with their alleged benchmark results (67000 transactions per minute on a 8 processor hardware). For us the degradation is so high that we are not able to run 150 transactions per minute on a 4-processor server. The simulation within oracle also supports this degradation

----------------------------------------------------------------------------

15-APR-02 22:09:08 GMT

Pasting information into the tar on bug:2321553 <Abstract: HIGH CPU WITH MULTIPLE CONCURRENT USERS> since currently unavailable on MetaLink:

"PROBLEM:
--------
Customer has a production database that was installed on a Sun Solaris 2.8.
The Solaris was a fresh install. The database was a fresh install. Customer
is having the following problems:

1. Performance problems with multiple users - more users more performance
problems
2. The query runs fine, explain plan runs fine, query just takes moer time
with more users - same query
3. Customer tested multi-user connection from the box via sqlplus ( no
network ) - same issue
4. Customer removed the application from the env and ran multi-user test -
same problem.
5. Customer loaded data in another 8.1.7 database on Win 2000 - same
performance problem with more users.

DIAGNOSTIC ANALYSIS:
--------------------
Analysis by NSAWYER.US
-----------------------
I have looked over the RDA and have found nothing that would cause the CPU to
run high. - the more I review the issue the more it seems to be less a tuning
issue and more a memory leak problem. But cannot find any report of it in the
trace.

Looking for any bugs that could be related to this type of behavior. - I have
found none.

Analysis
------------------------
Reviewed the statpack and noticed that a query parses and retrieves for a split second, but the data is not return for 10 to 20 seconds. What is going during that extra time. This might be the place to concentrate on.

WORKAROUND:
-----------
None

INFORMATION:
SERVER: E4500, 4 CPU 400MHZ, 4GB RAM (Oracle SGA 1.6GB).

PROBLEM STATEMENT: Per customer,
There is a definite overload on the CPU when we run 10, 20, 100 user loads. In

fact during this test of 20 simultaneous users executing this simple one table

query, the processes waiting went way high. I can see the CPU trashing and 0%
idle, 99% user, 1% kernel almost all the way.

IMPORTANT: Customer observation is that when 100 connection started exactly at the same time, response varies, but recursive connections show same resp.

----------------------------------------------------------------------------

Environment

The server is E4500, 4CPU of 450MHZ, 4GB memory with Hitachi SAN Storage array 9500 and 2GB cache.
Sun Solaris 2.8 32 bit although this has also been tried on Intel Windows 2000
Oracle 8.1.7.3.0 although this has been tried on several other versions ranging from Oracle 8.1.7 - 9.0.1

11-APR-02 12:27:03 GMT

Can you easily recover from, bypass or work around the problem? = NO
Does your system or application continue normally after the problem occurs? =
NO
Are the standard features of the system or application still available; is the
loss of service minor? = NO

### Platform and O/S version, including patchset or service pack level? ###
Sun Solaris 2.8 32 bit

### What version and patchset level of the database are you running? ###
Oracle 8.1.7.3.0

### Please describe your problem: ###
We have a database that has been developed and configured. We are in the stage
of performing stress and volume testing. The database is an OLTP database with
at least 30 transactions (inserts, updates and deletes) coming from the various
application servers. These are very discreet transactions.
We also have a web server that accesses this database for reporting. The
reporting are huge select statements that run against these live tables.
When we run around 5 users, everything is fine and no problems. As we increase
the users to 50 and 100 and 300, we are seeing an appreciable degradation in
performance. The timings are like for 5 users the web select queries take under
5 seconds, while at 100 users it takes 80 seconds and at 300 users it takes 240
seconds.
I have stressed the database both in dedicated and MTS mode and the effect is
the same.
We have to complete stress testing at 1000 users and at this rate we will not
be able to make it. I am reading all the documentation in Metalink and I see
similar tars, but no answers.
I would appreciate it if you can provide us with some guidance as this will make
or break the project. I am attaching files of sar, vmstat, top, init and
statspack report for idle and load activity.
As you will be able to see, the CPU utlization is tremendous and even
unrealistic.

### Did the error generate a trace file? ###
Does not apply

### Please list all files that you plan to upload: ###
initedbcip1.ora - Init file
idle_top.txt - top at idle time.
idle_vmstat.txt - vmstat at idle time.
idle_saru.txt - sar - at idle time.
idle_sarb.txt - sar -b at idle time.
idle_sarw.txt - sar -w at idle time.
stats_edbcip1.2002-04-14.0500 - statspack report during load.
load_vmstat.txt - vmstat at load
load_top.txt - top on load
load_saru.txt - sar on load.
load_sarb.txt - sar on load.

### What was being done at time of error? Any changes since this last worked? ##
We were simulating and increasing the number of users connecting and accessing the database. This was part of the stress testing. The server is E4500, 4CPU of 450MHZ, 4GB memory with Hitachi SAN Storage array 9500 and 2GB cache.

### What is the frequency of the error? ###
Consistently

### What is the impact to your business because of this problem? ###
This will make or break the project. I am sure there are some configurations
that I have set, that might need to be tweaked. We can also buy additional CPU
and memory, only after we tweak the configuration. This seems to be a common
issue and no clear cut solution. Please help us debug as I am running out of
ideas.

11-APR-02 16:44:10 GMT

Update:
-----------

Init.ora file shows:

db_files = 1024
open_cursors = 100
max_enabled_roles = 30
db_block_buffers = 76800
shared_pool_size = 1073741824
shared_pool_reserved_size = 107374182
cursor_sharing = EXACT
session_cached_cursors = 0
_sqlexec_progression_cost = 0
large_pool_size = 209715200
java_pool_size = 20971520
buffer_pool_keep = (buffers:9600, lru_latches:32)
log_checkpoint_interval = 10000
log_checkpoint_timeout = 7200
processes = 1000
log_buffer = 2097152
audit_trail = db
audit_file_dest = /edbcip1/db14/security
timed_statistics = true # if you want timed statistics
max_dump_file_size = 100000 # limit trace file size to 5M each
oracle_trace_enable = true
oracle_trace_collection_path=/db/app/oracle/product/8.1.7/otrace/admin/cdf
oracle_trace_facility_path=/db/app/oracle/product/8.1.7/otrace/admin/fdf
db_block_size = 16384
job_queue_processes = 20
job_queue_interval = 60
distributed_transactions = 10
open_links = 4
compatible = "8.1.7"
sort_area_size = 65536
sort_area_retained_size = 65536
disk_asynch_io = FALSE
db_writer_processes = 4 #Set to the number of CPU and less than db_block_lru
db_block_lru_latches = 288 # 2 * no of cpu as per oracle. did not work with 8
_spin_count = 40 # Used in latch contention, multiple of cpu.

mts_servers = 300
mts_max_servers = 800
mts_max_dispatchers = 300
mts_dispatchers = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=cipedb1)
(QUEUESIZE=32)))
(DISPATCHERS=100)
(POOL=ON)
(TICK=1)
(CONNECTIONS=10)
(SESSIONS=10)"
mts_listener_address = "(ADDRESS=(PROTOCOL=TCP)(HOST=cipedb1)(PORT=1521))"

11-APR-02 16:48:16 GMT

Update:
------------

STATSPACK snapshot shows the following:

Top 5 Wait Events
~~~~~~~~~~~~~~~~~ Wait % Total
Event Waits Time (cs) Wt Time
-------------------------------------------- ------------ ------------ -------
latch free 49 286 26.05
log file parallel write 12 252 22.95
db file sequential read 122 181 16.48
sort segment request 1 109 9.93
refresh controlfile command 10 75 6.83

--------------------------------------------------------------------------

and second snapshot shows:

Top 5 Wait Events
~~~~~~~~~~~~~~~~~ Wait % Total
Event Waits Time (cs) Wt Time
-------------------------------------------- ------------ ------------ -------
latch free 60 379 29.06
log file parallel write 21 280 21.47
db file sequential read 123 183 14.03
sort segment request 1 109 8.36
control file parallel write 95 95 7.29

11-APR-02 19:49:47 GMT

Update:
-------------

1) latch free
A wait for a latch free occurs when a server requests a latch and is unable to immediately acquire that latch.
Review Note # 34576.1 titled:
WAITEVENT: "latch free" Reference Note

2) log file parallel write
Writing redo records to the redo log files from the log buffer.

Wait Time: Time it takes for the I/Os to complete. Even though redo records are written in parallel, the parallel write is not complete until the last I/O is on disk.

3) db file sequential read
This indicates a single block read is occurring
The session waits while a sequential read from the database is performed. This event is also used for rebuilding the controlfile, dumping datafile headers, and getting the database file headers.

Wait Time: The wait time is the actual time it takes to do the I/O.

11-APR-02 19:54:44 GMT

ACTION PLAN:
-----------------------

1) To rule out the possibility that MTS is causing a problem,
please comment out ALL of the mts parameters in the init.ora file.
We do not want to use MTS.

Then run your stress tests.

Generate a new statspack snapshot if performance times are bad.

2) Are you using connection pooling on these web servers?

3) What type of application/web servers are you using?

4) If you run these queries via sqlplus locally on the database server,
what is performance like?

12-APR-02 20:05:49 GMT

Ct called in live.

They are now at the point where this issue has caused them to stop all forward progress.
Ct has requested that the severity be raised to 1.
Ct needs a resolution as soon as possible.

SUMMARY:
-----------------

They are using a third party app that uses oci to access oracle.
This app makes a sqlplus connection to the oracle database.
They are firing a simple select statement, select * from <tablename>.
They notice that the response time increases tremondously, after
running this select statement 50 times consecutively.

They will be using 1000+ connections, so this increase in response time
is going to hurt them.

They are, at the moment, NOT using MTS.

12-APR-02 20:18:14 GMT

Ct was running 10 threads on the db server using sqlplus
He saw 8 processes waiting, and the response time was 2.94 sec.

When he ran the query individually, it is taking 1.01seconds.

Ct also wanted to mention that TOP shows:
Memory: 4096M real, 63M free, 3198M swap in use, 3895M swap free

PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
11640 root 8 59 -20 7608K 5816K cpu/5 56:06 19.01% caiLogA2
10586 oracle 1 48 0 2512M 2489M run 0:22 0.94% oracle
10453 oracle 1 49 0 2513M 2493M run 0:25 0.87% oracle
10480 oracle 1 58 0 2513M 2493M run 0:23 0.87% oracle
10406 oracle 1 58 0 2512M 2490M run 0:26 0.87% oracle
10571 oracle 1 58 0 2513M 2493M run 0:21 0.87% oracle
10458 oracle 1 49 0 2513M 2492M run 0:24 0.86% oracle
10413 oracle 1 54 0 2513M 2492M run 0:26 0.82% oracle
10465 oracle 1 54 0 2513M 2492M run 0:24 0.79% oracle
10441 oracle 1 39 0 2512M 2489M run 0:24 0.78% oracle
10485 oracle 1 58 0 2513M 2493M run 0:24 0.77% oracle
10473 oracle 1 58 0 2513M 2492M run 0:22 0.77% oracle
10550 oracle 1 58 0 2513M 2492M run 0:23 0.76% oracle
10455 oracle 1 58 0 2512M 2490M run 0:24 0.75% oracle
10468 oracle 1 490 2513M 2492M run 0:23 0.74% oracle

12-APR-02 21:08:16 GMT

New info : Please find attached the statspack report for 1 user and 50 user. We
also executed 10 simultaneous user requests from sql plus on the database
server and the response time jumped from 1 second for 1 sql plus session to 3
seconds for 10 simultaneous sessions.

12-APR-02 23:58:15 GMT

For starters, the 50 users report was run for 6.35 minutes while the other report was only run for 3.67 minutes. So ratio numbers and percentage should still compare but raw numbers may not.

Per transaction 50 1
Logical reads 322,555.50 48,387
block changes 514 593

Interesting to note that the above is per transaction, many more reads with more users but same number of block changes.

Wait event info:
Buffer busy waits 5,264 (66%) 440

The client also mentioned cpu usage. The cpu user per session is substantially higher in the 50 user test.

There is so much information in the trace files it is impossible to summarize here.

The client has simplified this to a question:

If you run the same select statement in 50 different sessions at the exact same time, should the first one be substantially faster then the last?

Considering the last one does not have to parse the statement and the answer is all in memory at this point, the client believes that it should not be slower, or at least not much.

Is there some other resource being waited on when 50 sessions are running identical statements at the same time?

We should attempt to reduce the buffer busy waits and go from there I think.

They also want to know why running this 50 user tests causes one of their cpus to peg at 100% usage.

Problem
=======
By reviewing the statspack report (stats_50user.txt), buffer busy waits, was reported as the culprit (highest) wait time.

Diagnostic Analysis
===================
Buffer busy waits for data blocks are often due to several processes repeatedly reading the same blocks (eg: if lots of people scan the same index) - the first session processes the blocks that are in the buffer cache quickly but then a block has to be read from disk - the other sessions (scanning the same index) quickly 'catch up' and want the block which is currently being read from disk - they wait for the buffer as someone is already reading the block in.

A large DB_BLOCK_SIZE with default initrans can be a problem.

Please review the following document, which has a section on buffer busy waits, and potential solutions. This should provide some assistance.

Doc ID: 62161.1
Subject: Systemwide Tuning using UTLESTAT Reports in Oracle7/8

Next plan of action
===================
Determine why there is high cpu utilization on the machine that is caused by Oracle. Also, provide recommendations of what values to set to reduce the number of buffer busy waits, when the 50 users execute the same select statement.

@WIP

13-APR-02 04:31:39 GMT

Having Vivek provide sql_trace for the batch job. One for one user, one for
multiple users (20).

Export table and send to us with sql so we may attempt a test on our side.
This may be useful so we may see if we can reproduce the problem here.

13-APR-02 05:08:24 GMT

Spoke with RHACHEM on the issue and he mentioned that the
db_block_buffers: 28400 (is puny)

Recommend increasing this value to at least 40k, if not doubling it.
This is what is probably causing the:
buffer is not pinned count: 884,397

By increasing the number of db_block_buffers, we should see a
significant performance increase.

13-APR-02 05:44:47 GMT

Already discussed the above traces with Vivek.

Bottom line is change the value of the DB_BLOCK_BUFFERS
to a value much higher than the current setting of 28400.
I recommend doubling it.

13-APR-02 05:54:33 GMT

Will begin putting together a test case, but truly, the number
of DB_BLOCK_BUFFERS is much too low at this point.

Please increase the value to at least 50000.

13-APR-02 06:29:55 GMT

Total buffer size was DB_BLOCK_BUFFERS = 76800
SORT_AREA_SIZE was set to 1000000000 (1G)

The values have subsequentlly been decreased to 28400
and 500(meg) respectively.

Having Vivek alter the values to 76800 and 300(meg)
respectively.

He is bouncing the DB and will attempt the execution again.

13-APR-02 07:29:01 GMT

New info : Please note that SORT_AREA_SIZE was set to 1M and not 1G as
mentioned.

13-APR-02 08:04:01 GMT

New info : We have provided you with table exports (oracle.dmp). Please use
these table and do some testing on your side. I have the same database
structures configured on an NT server and I ran 1 thread and 20 threads. I saw
the same performance degradation. 1 thread came back in 11 sec and 20
simultaneous threads came back in 83 secs.

13-APR-02 08:10:47 GMT

If you monitor the session level waits, then you are focussing on what the problem is. Statspack is too general.
I take it that the ct runs 1 piece of sql and it takes X time, but when he runs several sessions (with the same SQL??) , it takes ages. In that case, we need to get 10046 level 8 on the session(s). That way we will see what each one is waiting for. Get timed_statistics set on at DB level so that we can see the wait times. I think the ct mentioned something like 20 sessions, but really he needs to minimize that until he thinks there is a problem. What we will see , is waits appearing in the 'many' sessions which do not appear when we run a single session. Whatever those waits are will lead us to the resolution.

13-APR-02 15:04:01 GMT

New info : Have I been switchd analyst again. Just top keep count, following
are the analysts I have been bounced to:
1: Gus
2: Peter
3: Taylor
4: Michelle.
Have I been bounced again. Looks like it is taking each analyst to review the
tar and by the time any suggestions are due, it is time for the next analyst.
Is there some analysts with performance experience who can address this issue.
The scenario is as follows:
We have a E4550, 4CPU and 4GB memory database server. We have not been able to
process 20 concurrent requests without degradation in performance. It looks
like that oracle is serializing the requests even though they are the same
requests issued multiple times and they are simple select queries. Oracle does
not seem to be reading from data buffer cache in memory or is locking even the
memory segment. This server was supposed to run 1000 users.
After various stats pack reports and trace files, none of the suggestions have
helped us so far. Suggestions include
1: Increase SORT_AREA_SIZE.
2: INCREASE INITRANS on the table and index.
3: Reduce the number of rows in the block.
4: Increase the data_block_Buffers.
5: Pin the objects (table and index) into memory (buffer pool keep).
The last suggestion was to set events 10046 for 1 user and 20 users. The trace
files was supposed to provide clues of what wait events are there in addition
to the ones for the 1 user. Not heard any feedback on the same.
I have been able simulate the same situation in a Windows environment. I had
uploaded the structures of the tables involved. Have you been able to simulate
the same scenario in your lap.

Richard Eastham

--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author: Paul Vallee
INET: dbalist_at_pythian.com

Fat City Network Services -- (858) 538-5051 FAX: (858) 538-5051
San Diego, California -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from). You may
also send the HELP command for other information (like subscribing).

Received on Thu Apr 18 2002 - 09:43:27 CDT