RE: SOS Alert

From: VIVEK_SHARMA <VIVEK_SHARMA_at_infy.com>
Date: Thu, 7 Dec 2000 18:35:24 +0530
Message-Id: <10703.123922@fatcity.com>

ACTION Identified the 2 indexes having the waits of "db file sequential read"
Created 2 NEW Tablespaces Each containing 1 Index Also Specified parameter log_small_entry_max_size = 0 NOTE - ORA 7.3.4.5 RESULT Performance thruput increased to 1500 A/cs per minute from 1000 A/cs per minute

> Allow me to begin :-
> There were 2 routes in between the APP & DB Servers .
> One had a Bandwidth of 10 MBPS while Another had a Bandwidth of 100 MBPS
> The 10 MBPS route was being used all this while . On Enabling the 100 MBPS
> route
> (by Specifying the I.P. Address Instead of Hostname in the tnsnames.ora &
> listener.ora)
> the performance has jumped to 1100 records per minute (from 500 records
> per minute)
> i.e. Almost Doubled .
> 100 MBPS is the MAX Possible Bandwidth Possible as per the SA
>
> Qs. Your Views on this please
>
> Additional responses Follow your Views BELOW :-
>
> -----Original Message-----
> From: Vivek [SMTP:Vivek_at_1800FLOWERS.com]
> Sent: Tuesday, November 28, 2000 6:30 PM
> To: 'VIVEK_SHARMA'
> Cc: Arun Chakrapani
> Subject: RE: SOS Alert
>
> Vivek:
>
> Sorry for not replying to you yesterday itself. A Busy day. Took a look
> at
> you report.txt. They were having 2 different profiles. Since You had
> mentioned that it also stalled in Test DB I concentrated more on the test
> report.txt. One thing which stood out on the other one (live) on opening
> you
> report.txt is the SQL _AREA GET_HIT_RATIO is very poor. I don't know
> whether
> you have already pinned your SYS objects and other objects that are
> executed
> often in shared pool. Please use DBMS_SHARED_POOL.keep to pin
> package,procedure,function,sequence,cursors, triggers that are used often.
>
> Ans OK will do so
>
> So now I will concentrate on the Test DB.
>
> 1. Even though the SQL_AREA GET_HIT_RATIO is higer(82%) when compared to
> live this is not good enough. It should be close to 100%(1 some where like
> .98 alteast). Please pin the SYS objects to start with.
>
> OK
> Allow me to say that the shared_pool_size has by experience been kept at
> bare minimum required
> for we did experience performance poblems with a shred_pool_size of 300 MB
>
> ( Though on Another machine on SUN OS )
> It is recommended to keep shared_pool_size at rock bottom just above the
> Limit
> where ORA-4031 out of shared pool segments occurs
>
>
> 2. On seeing you system wide wait events the following stands out.
>
> 1.log file sync 2863 83922 29.31
> 2.write complete waits 905 71401 78.9
> 3.free buffer waits 60 4718 78.63
>
> The first event log file sync event mostly comes when there is a disk
> i/o bottleneck for the online redo logs.This happens when the processes
> have
> to wait for the LGWR to sync a particular REDO log. The p1 parameter in a
> Log sync in the corresponding LOg Buffer.
>
>
> The Write complete waits comes
> when the user process is waiting for the LGWR to complete the writing of
> the
> redo information. So Please take a look at the placing of the online redo
> log. In which disk have you placed this. Also if it has been place in its
> own file system it doesn't mean that it is not going to a busy disk. Since
> you have 0+1, The slices may belong to a HOT disk. What you need to do is
> ask your SA to monitor the disk I/O usage and see whats happening to the
> disk containing the redo log file. Also you can use iostat -xPnce 5 5 (I
> think its available in AIX). the output will look something like
>
> us sy wt id
> 23 7 49 21
> extended device statistics ---- errors ---
> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot
> device
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 13.4 0 0 0 0 0 0
> c1t10d0s0
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.4 0 0 0 0 0 0
> c1t10d0s1
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0
> c1t10d0s2
> 0.0 1.3 0.3 10.1 0.0 0.1 4.6 105.8 0 1 0 0 0 0
> c1t10d0s3
> Watch out for %b column. If its somewhere around 60 then you may be
> experiencing contention on those slices. Ask you SA by giving him/her the
> device name and they will be able to give you what the FS is.
>
> NOTE - 4 Datafiles Containing a Particular Table & it's Corresponding
> indexes (to Which INSERTS
> happen ) were Moved out of the RAID onto a Local Disk whereafter
> performance Jumped from
> 850 per minute to 1100 per minute
>
> Ans - OK . Will Issue the Command
>
> Also I see
> Free Buffer waits. This is caused because there are no free buffers when
> they are requested - pointing directly to the inefficiency of the DBWR.
> Again the DBWR may be slow because you have a I/O bottleneck for the DB
> files and indirectly because of the Write complete waits.
>
> Qs. Would you recommend Setting Multiple db_writer processes
> even though async_io is set to true ?
>
>
> " Thus a Volume Group Contains 8 Disks with 4 Disks
> > Stripe Size = 4K "
>
>
> [Vivek] Please confirm me that the Stripe size is 4K. Usually you will
> have
> a stripe size of 32K,64K,128K,256K .... If you are having a stripe size of
> 4K and your db_block_size as 8K, It is going to take 2 reads to satisfy
> one
> request for a db_file_sequential_read (Usually requests 1 Oracle block)
> [Vivek] . This doesn't sound good if thats true.
>
> NOTE - The "test" Database is a 4K DB_BLOCK_SIZE while the Live Database
> is
> an 8K DB_BLOCK_SIZE , But the SAME performance Limits were Observed on
> Both
>
> So if possible could you please do the following.
>
> 1. Please shutdown the Test db and start fresh.
> 2. Run the same process again and also run the iostat -xPnce 60 10000 for
> the duration of the run (output this to a file)
> 3. Run vmstat and see whats the CPU usage for USER and SYSTEM
> 4. When the 500 A/c per sec reaches - Please run
>
> set lines 132
> select sid, substr(event,1,40),p1,p2,p3 from v$session_wait
> where event in ('db file scattered read','db file sequential read','latch
> free','buffer busy waits','log file sync','db file scattered
> read','enqueue');
>
> and give me the output.
>
> OK - Will Action & revert to you
>
> 5. IF may waits are showing as "latch free" in the event column of
> v$session_wait, Please query v$latch for
> latch#,name,level,GETS,MISSES,SLEEPS IMMEDIATE_GETS
> ,WAITS_HOLDING_LATCH, SPIN_GETS, SLEEP1.
>
> Ans OK
>
> 6. Preserve the output for v$latch_children.
>
> Ans OK
>
> Qs. Are their any IBM OS Specific Parameters which are considered
> benefitial for performance ?
>
>
> I will take a look at both the report.txt in more detail today and get
> back
> to you.
>
> Thanks Again so much
> Where are you Located / Phone No ? if you please
> We are Currently Located in Bangalore & Addressing The Problem at Mumbai
> over a WAN
> Phone 91 80 6588668
>
> HTH
>
> Vivek
>
> > -----Original Message-----
> > From: VIVEK_SHARMA [SMTP:VIVEK_SHARMA_at_infy.com]
> > Sent: Monday, November 27, 2000 11:16 AM
> > To: 'Vivek'
> > Cc: Arun Chakrapani
> > Subject: RE: SOS Alert
> >
> > THANKS so much for the association in this issue
> >
> > Attached 2 report.txt each taken for 1 Hour approx . Is it Sufficient ?
> >
> > Attached report.txt of Problem on the "Live" Database
> >
> > <<report.txt_Live>>
> > NOTE - A "test" Database Created on the DB Server Itself in the Same
> > Partitions as the "Live" Database
> > Runs Done on this "test" Database Also failing to go beyond 500 A/cs per
> > minute
> >
> >
> > <<report.txt_TEST>>
> >
> > NOTE - Interest Runs were Done on the "Test" Database During Down-Time
> > Periods of the "Live" Database
> >
> > You Wrote :-
> > Since there is some write complete waits Please take a look at the
> > place the on line redo logs are and what is the read, write on these
> > disks.
> >
> > Ans Online Redo logs ALSO present on the RAID 0+1 Disks & are
> Multiplexed
> > Size of Online Redo logfile on the "Live" Database = 30 MB
> >
> > NOTE - On the test Database we increased the Size of the Online Redo
> > logfile
> > to 100MB but with NO benefit
> >
> > You wrote :-
> > Please give me the DB Version as well
> >
> > Ans Database Oracle version = 7.3.4.5.0
> >
> > You wrote :-
> > I see some waits on More data from client.
> > Qs. What is to be done about the Same ?
> >
> > NOTE - Will Run the 2 SQL Scripts Suggested by you & revert back
> >
> > Thanks Again
> >
> >
> > > -----Original Message-----
> > > From: Vivek [SMTP:Vivek_at_1800FLOWERS.com]
> > > Sent: Monday, November 27, 2000 7:44 PM
> > > To: 'VIVEK_SHARMA_at_infy.com'
> > >
> > > Cc: Arun Chakrapani
> > > Subject: SOS Alert
> > >
> > > Hi Vivek:
> > >
> > > Will it be possible to send the whole report.txt to me. Since you have
> > not
> > > given for how long this was taken (time from utlbstat and utlestat) it
> > is
> > > more difficult to say anything. Based on the report.txt snippet you
> > have
> > > given there seems to me there are some issues with the enqueues and
> > latch
> > > free waits. Please run the following SQL as SYS so bet' 2 different
> > times
> > > (
> > > say 1/2 apart) and send me both the values(It will be great if you
> could
> > > run
> > > this when the batch processing is running).
> > >
> > > select KSQSTTYP,KSQSTGET,KSQSTWAT from x$ksqst where KSQSTGET>0 and
> > > KSQSTWAT
> > > >0;
> > >
> > > The delta bet' the 2 results will give you information on what is the
> > > enqueue that we see more waits on. Also Please run this sql given
> below
> > in
> > > a
> > > cron every minute or so and output this to a file
> > >
> > > SELECT p1 "File", p2 "Block", p3 "Reason" FROM v$session_wait WHERE
> > event
> > > like 'buffer busy waits' ;
> > >
> > > This will give you and idea about the buffer busy waits. Since there
> is
> > > some
> > > write complete waits Please take a look at the place the on line redo
> > logs
> > > are and what is the read, write on these disks. Also Please give me
> > > infomation on the relog file size. Please give me the DB Version as
> > well.
> > > I
> > > see some waits on More data from client. Again since I don't know for
> > how
> > > long the stats were run can't say whether they are significant or not.
> > >
> > > HTH
> > >
> > > Vivek
> > >
> > >
> > >
> > > From: VIVEK_SHARMA <VIVEK_SHARMA_at_infy.com>
> > > Date: Mon, 27 Nov 2000 12:11:59 +0530
> > > Subject: SOS Alert
> > >
> > > CASE In a Bank , Interest Calculation Batch Processing Unable to go
> > beyond
> > > 500 A/cs per minute
> > >
> > > CAUSE of the Problem is UNKNOWN
> > >
> > > ORACLE 7.3.4.5 on AIX 4.3.3
> > > DB Server - IBM S80 Model - 12 CPUs , 3 GB RAM
> > > APP Server 1 - IBM S80 Model - 6 CPUs , 2 GB RAM
> > > APP Servers 2 & 3 - IBM Thin Nodes model - 4 CPUs , 1 GB RAM
> > >
> > > Storage Box :-
> > > ===========
> > > SS Class Storage
> > > RAID 0+1 - (First Striped & then Mirrored)
> > > NOTE - 2 Storage Boxes Exist , one being the Mirror of the Other
> > > Striping exists across a set 4 Disks (in one Box) with another 4 being
> > > it's
> > > mirror
> > > (in another Box).
> > > Thus a Volume Group Contains 8 Disks with 4 Disks
> > > Stripe Size = 4K
> > >
> > > NOTE - Runs Tried from BOTH the APP Servers 1 OR 2 ,But with the SAME
> > MAX
> > > of
> > > 500 A/cs
> > > processed per minute
> > >
> > > CPU Utilizations on BOTH APP & DB Server = 40 %
> > > wio% on BOTH APP & DB Servers = 35 %
> > > No paging happening on Both APP & DB Servers
> > >
> > > - Oracle Contention Values Seem Small to us as shown Below or so they
> > seem
> > > to us :-
> > >
> > > SVRMGR> Rem System wide wait events for non-background processes
> (PMON,
> > > SVRMGR> Rem SMON, etc). Times are in hundreths of seconds. Each one
> of
> >
> > > SVRMGR> Rem these is a context switch which costs CPU time. By
> looking
> > at
> > > SVRMGR> Rem the Total Time you can often determine what is the
> > bottleneck
> > > SVRMGR> Rem that processes are waiting for. This shows the total time
> > > spent
> > > SVRMGR> Rem waiting for a specific event and the average time per wait
> > on
> > > SVRMGR> Rem that event.
> > > SVRMGR> select n1.event "Event Name",
> > > 2> n1.event_count "Count",
> > > 3> n1.time_waited "Total Time",
> > > 4> round(n1.time_waited/n1.event_count, 2) "Avg Time"
> > > 5> from stats$event n1
> > > 6> where n1.event_count > 0
> > > 7> order by n1.time_waited desc;
> > > Event Name Count Total Time Avg Time
> >
> > > -------------------------------- ------------- -------------
> > -------------
> > > SQL*Net message from client 10856276 31977110
> > 2.95
> > > enqueue 1295 374980
> > 289.56
> > > db file sequential read 3614044 303848
> > .08
> > > write complete waits 5812 295937
> > 50.92
> > > latch free 5045060 242170
> > .05
> > > SQL*Net more data from client 13939 165275
> > 11.86
> > > log file sync 12794 146409
> > 11.44
> > > buffer busy waits 100443 92477
> > .92
> > >
> > >
> > >
> > > - ALL Literal SQLs were Converted to using Bind variables
> > > - ALL Tables Running on Indexes Without Any FULL Scans happening .
> > >
> > > - All the Literal SQLs (Dynamic) Converted to using Bind variables
> > (Static
> > > Queries)
> > >
> > > - event="10181 trace name context forever, level 1000"
> > > NOTE - Set nevertheless , None of the SQLs taking Excessive time to
> > parse
> > > though
> > >
> > > - NO statistics are Analyzed
> > > - 7 tables involved in the interest Calc.
> > > inserts to 2 Tables
Received on Thu Dec 07 2000 - 07:05:24 CST