RE: SOS Alert

From: VIVEK_SHARMA <VIVEK_SHARMA_at_infy.com>
Date: Wed, 20 Dec 2000 12:54:35 +0530
Message-Id: <10716.124987@fatcity.com>

CASE In a Bank , Interest Calculation Batch Processing Unable to go beyond 500 A/cs per minute

ACTION Moved the Database to Another IBM Machine ( 4 CPU , 4 GB RAM ) with HARDWARE Mirroring & 64 K Stripe Size

NOTE - On the Under-performing DB Server S80 Model of IBM) there exists Software Mirroring (By the AIX O.S.) & a Stripe Size of 4K exist

Firing Off the Interest Calculation Runs from 2 APP Servers Giving an Overall Total Processing of Number of 3000 A/cs per minute ( 1500 from each )

Qs. Which of the 2 i.e. Hardware Mirroring OR 64 K Striping , do you think is the Cause of the Increase in performance as we put BOTH features Together at the SAME Time OR do Both Contribute Equally ?

Thanks to Vivek for everything

-----Original Message-----

	From:	VIVEK_SHARMA 
	Sent:	Thursday, December 07, 2000 6:35 PM
	Subject:	RE: SOS Alert 



	ACTION Identified the 2 indexes having the waits of "db file
sequential read"
	Created 2 NEW Tablespaces Each containing 1 Index 
	Also Specified parameter log_small_entry_max_size = 0
	NOTE - ORA 7.3.4.5

	RESULT Performance thruput increased to 1500 A/cs per minute from

1000 A/cs per minute

	Allow me to begin :-
	There were 2 routes in between the APP & DB Servers .
	One had a Bandwidth of 10 MBPS while Another had a Bandwidth of 100
MBPS
	The 10 MBPS route was being used all this while . On Enabling the
100 MBPS route
	(by Specifying the I.P. Address Instead of Hostname in the
tnsnames.ora & listener.ora)
	the performance has jumped to 1100 records per minute (from 500
records per minute)
	i.e. Almost Doubled .
	100 MBPS is the MAX Possible Bandwidth Possible as per the SA 

	Qs. Your Views on this please

	Additional responses Follow your Views BELOW :-



	-----Original Message-----

	From:	Vivek [SMTP:Vivek_at_1800FLOWERS.com]
	Sent:	Tuesday, November 28, 2000 6:30 PM
	To:	'VIVEK_SHARMA'
	Cc:	Arun Chakrapani
	Subject:	RE: SOS Alert 

	Vivek:

	  Sorry for not replying to you yesterday itself. A Busy day. Took a
look at
	you report.txt. They were having 2 different profiles. Since You had
	mentioned that it also stalled in  Test DB I concentrated more on
the test
	report.txt. One thing which stood out on the other one (live) on
opening you
	report.txt is the SQL _AREA GET_HIT_RATIO is very poor. I don't know
whether
	you have already pinned your SYS objects and other objects that are
executed
	often in shared pool. Please use DBMS_SHARED_POOL.keep to pin
	package,procedure,function,sequence,cursors, triggers that are used

often.

Ans OK will do so

So now I will concentrate on the Test DB.

Even though the SQL_AREA GET_HIT_RATIO is higer(82%) when compared to live this is not good enough. It should be close to 100%(1 some where like .98 alteast). Please pin the SYS objects to start with.

	OK
	Allow me to say that the shared_pool_size has by experience been
kept at bare minimum required
	for we did experience performance poblems with a shred_pool_size of
300 MB 
	( Though on Another machine on SUN OS )
	It is recommended to keep shared_pool_size at rock bottom just above
the Limit 
	where ORA-4031 out of shared pool segments occurs


	2. On seeing you system wide wait events the following stands out.

	    1.log file sync                    2863        83922      29.31
	    2.write complete waits             905         71401      78.9
	    3.free buffer waits                60          4718        78.63

	    The first event log file sync event mostly comes when there is a
disk
	i/o bottleneck for the online redo logs.This happens when the
processes have
	to wait for the LGWR to sync a particular REDO log. The p1 parameter
in a
	Log sync in the corresponding LOg Buffer. 


	The Write complete waits comes
	when the user process is waiting for the LGWR to complete the
writing of the
	redo information.  So Please take a look at the placing of the
online redo
	log. In which disk have you placed this. Also if it has been place
in its
	own file system it doesn't mean that it is not going to a busy disk.
Since
	you have 0+1, The slices may belong to a HOT disk. What you need to
do is
	ask your SA to monitor the disk I/O usage and see whats happening to
the
	disk containing the redo log file. Also you can use iostat -xPnce 5
5 (I
	think its available in AIX). the output will look something like

	us sy wt id
	 23  7 49 21
	                              extended device statistics ---- errors
---
	  r/s  w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn
tot
	device
	  0.0  0.0    0.0    0.0  0.0  0.0    0.0   13.4   0   0   0   0   0
0
	c1t10d0s0
	  0.0  0.0    0.0    0.0  0.0  0.0    0.0    2.4   0   0   0   0   0
0
	c1t10d0s1
	  0.0  0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   0   0
0
	c1t10d0s2
	  0.0  1.3    0.3   10.1  0.0  0.1    4.6  105.8   0   1   0   0   0
0
	c1t10d0s3
	Watch out for %b column. If its somewhere around 60 then you may be
	experiencing contention on those slices. Ask you SA by giving
him/her the
	device name and they will be able to give you what the FS is. 

	NOTE - 4 Datafiles Containing a Particular Table & it's
Corresponding indexes (to Which INSERTS 
	happen ) were Moved out of the RAID onto a Local Disk whereafter
performance Jumped from 
	850 per minute to 1100  per minute

	Ans - OK . Will Issue the Command 

	Also I see 
	Free Buffer waits. This is caused because there are no free buffers
when
	they are requested - pointing directly to the inefficiency of the
DBWR.
	Again the DBWR may be slow because you have a I/O bottleneck for the
DB
	files and indirectly because of the Write complete waits.

	Qs. Would you recommend Setting Multiple db_writer processes 
	even though async_io is set to true ?


		" Thus a Volume Group Contains 8 Disks with 4 Disks 
		> Stripe Size = 4K "


	[Vivek]  Please confirm me that the Stripe size is 4K. Usually you
will have
	a stripe size of 32K,64K,128K,256K .... If you are having a stripe
size of
	4K and your db_block_size as 8K, It is going to take 2 reads to
satisfy one
	request for a db_file_sequential_read (Usually requests 1 Oracle
block)
	[Vivek]  . This doesn't sound good if thats true. 

	NOTE - The "test" Database is a 4K DB_BLOCK_SIZE while the Live
Database is 
	an 8K DB_BLOCK_SIZE , But the SAME performance Limits were Observed

on Both

So if possible could you please do the following.

Please shutdown the Test db and start fresh.
Run the same process again and also run the iostat -xPnce 60 10000 for the duration of the run (output this to a file)
Run vmstat and see whats the CPU usage for USER and SYSTEM
When the 500 A/c per sec reaches - Please run

	set lines 132
	select sid, substr(event,1,40),p1,p2,p3 from v$session_wait
	where event in ('db file scattered read','db file sequential
read','latch
	free','buffer busy waits','log file sync','db file scattered
	read','enqueue'); 

	and give me the output.  

	OK - Will Action & revert to you 

	5. IF may waits are showing as "latch free" in the event column of
	v$session_wait, Please query v$latch for
	latch#,name,level,GETS,MISSES,SLEEPS IMMEDIATE_GETS
	,WAITS_HOLDING_LATCH, SPIN_GETS, SLEEP1. 

	Ans OK 

	6. Preserve the output for v$latch_children.

	Ans OK

	Qs. Are their any IBM OS Specific Parameters which are considered

benefitial for performance ?

I will take a look at both the report.txt in more detail today and get back

to you.

	Thanks Again so much 
	Where are you Located  / Phone No ? if you please
	We are Currently Located in Bangalore & Addressing The Problem at
Mumbai over a WAN
	Phone 91 80 6588668

	HTH

	Vivek

	> -----Original Message-----
	> From:	VIVEK_SHARMA [SMTP:VIVEK_SHARMA_at_infy.com]
	> Sent:	Monday, November 27, 2000 11:16 AM
	> To:	'Vivek'
	> Cc:	Arun Chakrapani
	> Subject:	RE: SOS Alert 
	> 
	> THANKS so much for the association in this issue 
	> 
	> Attached 2 report.txt each taken for 1 Hour approx . Is it
Sufficient ?
	> 
	> Attached  report.txt of  Problem on the "Live" Database 
	> 
	>  <<report.txt_Live>> 
	> NOTE - A "test" Database  Created on the DB Server Itself  in the
Same
	> Partitions as the "Live" Database
	> Runs Done on this "test" Database Also failing to go beyond 500
A/cs per
	> minute
	> 
	> 
	>  <<report.txt_TEST>> 
	> 
	> NOTE - Interest Runs were Done on the "Test" Database During
Down-Time
	> Periods of the "Live" Database
	> 
	> You Wrote :-
	> 	Since there is some write complete waits Please take a look
at the
	> place the on line redo logs are and what is the read, write on
these
	> disks.
	> 
	> Ans Online Redo logs ALSO present on the RAID 0+1 Disks & are
Multiplexed 
	> Size of Online Redo logfile on the "Live" Database = 30 MB
	> 
	> NOTE - On the test Database we increased the Size of the Online
Redo
	> logfile
	> to 100MB but with NO benefit
	> 
	> You wrote :-
	> Please give me the DB Version as well
	> 
	> Ans Database Oracle  version = 7.3.4.5.0
	> 
	> You wrote :-
	> 	I see some waits on More data from client. 
	> Qs. What is to be done about the Same ?
	> 
	> NOTE - Will Run the 2 SQL Scripts Suggested by you & revert back
	> 
	> Thanks Again
	> 
	> 
	> > -----Original Message-----
	> > From:	Vivek [SMTP:Vivek_at_1800FLOWERS.com]
	> > Sent:	Monday, November 27, 2000 7:44 PM
	> > To:	'VIVEK_SHARMA_at_infy.com'
	> > 
	> > Cc:	Arun Chakrapani
	> > Subject:	SOS Alert 
	> > 
	> > Hi Vivek:
	> > 
	> > Will it be possible to send the whole report.txt to me. Since
you have
	> not
	> > given for how long this was taken (time from utlbstat and
utlestat) it
	> is
	> > more difficult to say anything.  Based on the report.txt snippet
you
	> have
	> > given there seems to me there are some issues with the enqueues
and
	> latch
	> > free waits. Please run the following SQL as SYS so bet' 2
different
	> times
	> > (
	> > say 1/2 apart) and send me both the values(It will be great if
you could
	> > run
	> > this when the batch processing is running). 
	> > 
	> > select KSQSTTYP,KSQSTGET,KSQSTWAT from x$ksqst where KSQSTGET>0
and
	> > KSQSTWAT
	> > >0;
	> > 
	> > The delta bet' the 2 results will give you information on what
is the
	> > enqueue that we see more waits on. Also Please run this sql
given below
	> in
	> > a
	> > cron every minute or so and output this to a file
	> > 
	> > SELECT p1 "File", p2 "Block", p3 "Reason"  FROM v$session_wait
WHERE
	> event
	> > like  'buffer busy waits'  ;
	> > 
	> > This will give you and idea about the buffer busy waits. Since
there is
	> > some
	> > write complete waits Please take a look at the place the on line
redo
	> logs
	> > are and what is the read, write on these disks. Also Please give
me
	> > infomation on the relog file size. Please give me the DB Version
as
	> well.
	> > I
	> > see some waits on More data from client. Again since I don't
know for
	> how
	> > long the stats were run can't say whether they are significant
or not.
	> > 
	> > HTH
	> > 
	> > Vivek
	> > 
	> > 
	> > 
	> > From: VIVEK_SHARMA <VIVEK_SHARMA_at_infy.com>
	> >  Date: Mon, 27 Nov 2000 12:11:59 +0530
	> >  Subject: SOS Alert 
	> > 
	> > CASE In a Bank , Interest Calculation Batch Processing Unable to
go
	> beyond
	> > 500 A/cs per minute 
	> > 
	> > CAUSE of the Problem is UNKNOWN 
	> > 
	> > ORACLE 7.3.4.5 on AIX 4.3.3
	> > DB Server - IBM S80 Model - 12 CPUs , 3 GB RAM 
	> > APP Server 1 - IBM S80 Model - 6 CPUs , 2 GB RAM
	> > APP Servers 2 & 3 - IBM Thin Nodes model - 4 CPUs , 1 GB RAM
	> > 
	> > Storage Box :-
	> > ===========
	> > SS Class Storage 
	> > RAID 0+1 - (First Striped & then Mirrored) 
	> > NOTE - 2 Storage Boxes Exist , one being the Mirror of the Other
	> > Striping exists across a set 4 Disks (in one Box) with another 4
being
	> > it's
	> > mirror 
	> > (in another Box). 
	> > Thus a Volume Group Contains 8 Disks with 4 Disks 
	> > Stripe Size = 4K
	> >  
	> > NOTE - Runs Tried from BOTH the APP Servers 1 OR 2 ,But with the
SAME
	> MAX
	> > of
	> > 500 A/cs 
	> > processed per minute
	> > 
	> > CPU Utilizations on BOTH APP & DB Server = 40 %
	> > wio% on BOTH APP & DB Servers = 35 % 
	> > No paging happening on Both APP & DB Servers
	> > 
	> > - Oracle Contention Values Seem Small to us as shown Below or so
they
	> seem
	> > to us :-
	> > 
	> > SVRMGR> Rem System wide wait events for non-background processes
(PMON, 
	> > SVRMGR> Rem SMON, etc).  Times are in hundreths of seconds.
Each one of
	> 
	> > SVRMGR> Rem these is a context switch which costs CPU time.  By
looking
	> at
	> > SVRMGR> Rem the Total Time you can often determine what is the
	> bottleneck 
	> > SVRMGR> Rem that processes are waiting for.  This shows the
total time
	> > spent
	> > SVRMGR> Rem waiting for a specific event and the average time
per wait
	> on 
	> > SVRMGR> Rem that event.
	> > SVRMGR> select 	n1.event "Event Name", 
	> >      2>        	n1.event_count "Count",
	> >      3> 	n1.time_waited "Total Time",
	> >      4> 	round(n1.time_waited/n1.event_count, 2) "Avg Time"
	> >      5>    from stats$event n1
	> >      6>    where n1.event_count > 0
	> >      7>    order by n1.time_waited desc;
	> > Event Name                       Count         Total Time    Avg
Time
	> 
	> > -------------------------------- ------------- -------------
	> -------------
	> > SQL*Net message from client           10856276      31977110
	> 2.95
	> > enqueue                                   1295        374980
	> 289.56
	> > db file sequential read                3614044        303848
	> .08
	> > write complete waits                      5812        295937
	> 50.92
	> > latch free                             5045060        242170
	> .05
	> > SQL*Net more data from client            13939        165275
	> 11.86
	> > log file sync                            12794        146409
	> 11.44
	> > buffer busy waits                       100443         92477
	> .92
	> > 
	> > 
	> > 
	> > - ALL Literal SQLs were Converted to using Bind variables 
	> > - ALL Tables Running on Indexes Without Any FULL Scans happening
.
	> > 
	> > - All the Literal SQLs (Dynamic) Converted to using Bind
variables
	> (Static
	> > Queries)
	> > 
	> > - event="10181 trace name context forever, level 1000" 
	> > NOTE - Set nevertheless , None of the SQLs taking Excessive time
to
	> parse
	> > though 
	> > 
	> > - NO statistics are Analyzed

Received on Wed Dec 20 2000 - 01:24:35 CST