RE: SOS Alert

From: VIVEK_SHARMA <VIVEK_SHARMA_at_infy.com>
Date: Fri, 22 Dec 2000 03:12:19 +0530
Message-Id: <10718.125239@fatcity.com>

BIG THANKS TO VIVEK , MOHAN ROSS , DAVID & ALL
> -----Original Message-----
> From: Mohan, Ross [SMTP:MohanR_at_STARS-SMI.com]
> Sent: Thursday, December 21, 2000 1:51 AM
> To: Multiple recipients of list ORACLE-L
> Subject: RE: SOS Alert
>
> 4K striping is a major Major MAJOR mistake.
>
> Doing it in software makes it at least 3x as
> bad.
>
> Either fix alone would have provided at least a
> 2-4x improvement.
>
> Given a choice, I'd fix the stripe size first, then
> convert the mirroring to hardware.
>
> hth
>
> imho
>
> ymmv
>
> afaik,
>
> etc.
>
> - Ross Mohan
>
> -----Original Message-----
> From: VIVEK_SHARMA [ <mailto:VIVEK_SHARMA_at_infy.com>]
> Sent: Wednesday, December 20, 2000 11:13 AM
> To: Multiple recipients of list ORACLE-L
> Subject: RE: SOS Alert
>
>
> CASE In a Bank , Interest Calculation Batch Processing Unable to go beyond
>
> 500 A/cs per minute
>
> ACTION Moved the Database to Another IBM Machine ( 4 CPU , 4 GB RAM ) with
>
> HARDWARE Mirroring & 64 K Stripe Size
>
> NOTE - On the Under-performing DB Server S80 Model of IBM) there exists
> Software Mirroring (By the AIX O.S.) & a Stripe Size of 4K exist
>
> Firing Off the Interest Calculation Runs from 2 APP Servers Giving an
> Overall Total Processing of Number of 3000 A/cs per minute ( 1500 from
> each
> )
>
> Qs. Which of the 2 i.e. Hardware Mirroring OR 64 K Striping , do you think
>
> is the Cause of the Increase in performance as we put BOTH features
> Together
> at the SAME Time OR do Both Contribute Equally ?
>
> Thanks to Vivek for everything
>
>
> -----Original Message-----
> From: VIVEK_SHARMA
> Sent: Thursday, December 07, 2000 6:35 PM
> Subject: RE: SOS Alert
>
>
>
> ACTION Identified the 2 indexes having the waits of "db file
> sequential read"
> Created 2 NEW Tablespaces Each containing 1 Index
> Also Specified parameter log_small_entry_max_size = 0
> NOTE - ORA 7.3.4.5
>
> RESULT Performance thruput increased to 1500 A/cs per minute from
> 1000 A/cs per minute
>
>
> ==============================================
> Allow me to begin :-
> There were 2 routes in between the APP & DB Servers .
> One had a Bandwidth of 10 MBPS while Another had a Bandwidth of
> 100
> MBPS
> The 10 MBPS route was being used all this while . On Enabling the
> 100 MBPS route
> (by Specifying the I.P. Address Instead of Hostname in the
> tnsnames.ora & listener.ora)
> the performance has jumped to 1100 records per minute (from 500
> records per minute)
> i.e. Almost Doubled .
> 100 MBPS is the MAX Possible Bandwidth Possible as per the SA
>
> Qs. Your Views on this please
>
> Additional responses Follow your Views BELOW :-
>
> -----Original Message-----
> From: Vivek [SMTP:Vivek_at_1800FLOWERS.com]
> Sent: Tuesday, November 28, 2000 6:30 PM
> To: 'VIVEK_SHARMA'
> Cc: Arun Chakrapani
> Subject: RE: SOS Alert
>
> Vivek:
>
> Sorry for not replying to you yesterday itself. A Busy day. Took
> a
> look at
> you report.txt. They were having 2 different profiles. Since You
> had
> mentioned that it also stalled in Test DB I concentrated more on
> the test
> report.txt. One thing which stood out on the other one (live) on
> opening you
> report.txt is the SQL _AREA GET_HIT_RATIO is very poor. I don't
> know
> whether
> you have already pinned your SYS objects and other objects that
> are
> executed
> often in shared pool. Please use DBMS_SHARED_POOL.keep to pin
> package,procedure,function,sequence,cursors, triggers that are
> used
> often.
>
> Ans OK will do so
>
> So now I will concentrate on the Test DB.
>
> 1. Even though the SQL_AREA GET_HIT_RATIO is higer(82%) when
> compared to
> live this is not good enough. It should be close to 100%(1 some
> where like
> .98 alteast). Please pin the SYS objects to start with.
>
> OK
> Allow me to say that the shared_pool_size has by experience been
> kept at bare minimum required
> for we did experience performance poblems with a shred_pool_size
> of
> 300 MB
> ( Though on Another machine on SUN OS )
> It is recommended to keep shared_pool_size at rock bottom just
> above
> the Limit
> where ORA-4031 out of shared pool segments occurs
>
>
> 2. On seeing you system wide wait events the following stands out.
>
>
> 1.log file sync 2863 83922
> 29.31
> 2.write complete waits 905 71401 78.9
>
> 3.free buffer waits 60 4718
> 78.63
>
> The first event log file sync event mostly comes when there is
> a
> disk
> i/o bottleneck for the online redo logs.This happens when the
> processes have
> to wait for the LGWR to sync a particular REDO log. The p1
> parameter
> in a
> Log sync in the corresponding LOg Buffer.
>
>
> The Write complete waits comes
> when the user process is waiting for the LGWR to complete the
> writing of the
> redo information. So Please take a look at the placing of the
> online redo
> log. In which disk have you placed this. Also if it has been place
>
> in its
> own file system it doesn't mean that it is not going to a busy
> disk.
> Since
> you have 0+1, The slices may belong to a HOT disk. What you need
> to
> do is
> ask your SA to monitor the disk I/O usage and see whats happening
> to
> the
> disk containing the redo log file. Also you can use iostat -xPnce
> 5
> 5 (I
> think its available in AIX). the output will look something like
>
> us sy wt id
> 23 7 49 21
> extended device statistics ----
> errors
> ---
> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w
> trn
> tot
> device
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 13.4 0 0 0 0
> 0
> 0
> c1t10d0s0
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.4 0 0 0 0
> 0
> 0
> c1t10d0s1
> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0
> 0
> 0
> c1t10d0s2
> 0.0 1.3 0.3 10.1 0.0 0.1 4.6 105.8 0 1 0 0
> 0
> 0
> c1t10d0s3
> Watch out for %b column. If its somewhere around 60 then you may
> be
> experiencing contention on those slices. Ask you SA by giving
> him/her the
> device name and they will be able to give you what the FS is.
>
> NOTE - 4 Datafiles Containing a Particular Table & it's
> Corresponding indexes (to Which INSERTS
> happen ) were Moved out of the RAID onto a Local Disk whereafter
> performance Jumped from
> 850 per minute to 1100 per minute
>
> Ans - OK . Will Issue the Command
>
> Also I see
> Free Buffer waits. This is caused because there are no free
> buffers
> when
> they are requested - pointing directly to the inefficiency of the
> DBWR.
> Again the DBWR may be slow because you have a I/O bottleneck for
> the
> DB
> files and indirectly because of the Write complete waits.
>
> Qs. Would you recommend Setting Multiple db_writer processes
> even though async_io is set to true ?
>
>
> " Thus a Volume Group Contains 8 Disks with 4 Disks
> > Stripe Size = 4K "
>
>
> [Vivek] Please confirm me that the Stripe size is 4K. Usually you
>
> will have
> a stripe size of 32K,64K,128K,256K .... If you are having a stripe
>
> size of
> 4K and your db_block_size as 8K, It is going to take 2 reads to
> satisfy one
> request for a db_file_sequential_read (Usually requests 1 Oracle
> block)
> [Vivek] . This doesn't sound good if thats true.
>
> NOTE - The "test" Database is a 4K DB_BLOCK_SIZE while the Live
> Database is
> an 8K DB_BLOCK_SIZE , But the SAME performance Limits were
> Observed
> on Both
>
> So if possible could you please do the following.
>
> 1. Please shutdown the Test db and start fresh.
> 2. Run the same process again and also run the iostat -xPnce 60
> 10000 for
> the duration of the run (output this to a file)
> 3. Run vmstat and see whats the CPU usage for USER and SYSTEM
> 4. When the 500 A/c per sec reaches - Please run
>
> set lines 132
> select sid, substr(event,1,40),p1,p2,p3 from v$session_wait
> where event in ('db file scattered read','db file sequential
> read','latch
> free','buffer busy waits','log file sync','db file scattered
> read','enqueue');
>
> and give me the output.
>
> OK - Will Action & revert to you
>
> 5. IF may waits are showing as "latch free" in the event column of
>
> v$session_wait, Please query v$latch for
> latch#,name,level,GETS,MISSES,SLEEPS IMMEDIATE_GETS
> ,WAITS_HOLDING_LATCH, SPIN_GETS, SLEEP1.
>
> Ans OK
>
> 6. Preserve the output for v$latch_children.
>
> Ans OK
>
> Qs. Are their any IBM OS Specific Parameters which are considered
> benefitial for performance ?
>
>
> I will take a look at both the report.txt in more detail today and
>
> get back
> to you.
>
> Thanks Again so much
> Where are you Located / Phone No ? if you please
> We are Currently Located in Bangalore & Addressing The Problem at
> Mumbai over a WAN
> Phone 91 80 6588668
>
> HTH
>
> Vivek
>
> > -----Original Message-----
> > From: VIVEK_SHARMA [SMTP:VIVEK_SHARMA_at_infy.com]
> > Sent: Monday, November 27, 2000 11:16 AM
> > To: 'Vivek'
> > Cc: Arun Chakrapani
> > Subject: RE: SOS Alert
> >
> > THANKS so much for the association in this issue
> >
> > Attached 2 report.txt each taken for 1 Hour approx . Is it
> Sufficient ?
> >
> > Attached report.txt of Problem on the "Live" Database
> >
> > <<report.txt_Live>>
> > NOTE - A "test" Database Created on the DB Server Itself in
> the
> Same
> > Partitions as the "Live" Database
> > Runs Done on this "test" Database Also failing to go beyond 500
> A/cs per
> > minute
> >
> >
> > <<report.txt_TEST>>
> >
> > NOTE - Interest Runs were Done on the "Test" Database During
> Down-Time
> > Periods of the "Live" Database
> >
> > You Wrote :-
> > Since there is some write complete waits Please take a
> look
> at the
> > place the on line redo logs are and what is the read, write on
> these
> > disks.
> >
> > Ans Online Redo logs ALSO present on the RAID 0+1 Disks & are
> Multiplexed
> > Size of Online Redo logfile on the "Live" Database = 30 MB
> >
> > NOTE - On the test Database we increased the Size of the Online
> Redo
> > logfile
> > to 100MB but with NO benefit
> >
> > You wrote :-
> > Please give me the DB Version as well
> >
> > Ans Database Oracle version = 7.3.4.5.0
> >
> > You wrote :-
> > I see some waits on More data from client.
> > Qs. What is to be done about the Same ?
> >
> > NOTE - Will Run the 2 SQL Scripts Suggested by you & revert back
>
> >
> > Thanks Again
> >
> >
> > > -----Original Message-----
> > > From: Vivek [SMTP:Vivek_at_1800FLOWERS.com]
> > > Sent: Monday, November 27, 2000 7:44 PM
> > > To: 'VIVEK_SHARMA_at_infy.com'
> > >
> > > Cc: Arun Chakrapani
> > > Subject: SOS Alert
> > >
> > > Hi Vivek:
> > >
> > > Will it be possible to send the whole report.txt to me. Since
> you have
> > not
> > > given for how long this was taken (time from utlbstat and
> utlestat) it
> > is
> > > more difficult to say anything. Based on the report.txt
> snippet
> you
> > have
> > > given there seems to me there are some issues with the
> enqueues
> and
> > latch
> > > free waits. Please run the following SQL as SYS so bet' 2
> different
> > times
> > > (
> > > say 1/2 apart) and send me both the values(It will be great if
>
> you could
> > > run
> > > this when the batch processing is running).
> > >
> > > select KSQSTTYP,KSQSTGET,KSQSTWAT from x$ksqst where
> KSQSTGET>0
> and
> > > KSQSTWAT
> > > >0;
> > >
> > > The delta bet' the 2 results will give you information on what
>
> is the
> > > enqueue that we see more waits on. Also Please run this sql
> given below
> > in
> > > a
> > > cron every minute or so and output this to a file
> > >
> > > SELECT p1 "File", p2 "Block", p3 "Reason" FROM v$session_wait
>
> WHERE
> > event
> > > like 'buffer busy waits' ;
> > >
> > > This will give you and idea about the buffer busy waits. Since
>
> there is
> > > some
> > > write complete waits Please take a look at the place the on
> line
> redo
> > logs
> > > are and what is the read, write on these disks. Also Please
> give
> me
> > > infomation on the relog file size. Please give me the DB
> Version
> as
> > well.
> > > I
> > > see some waits on More data from client. Again since I don't
> know for
> > how
> > > long the stats were run can't say whether they are significant
>
> or not.
> > >
> > > HTH
> > >
> > > Vivek
> > >
> > >
> > >
> > > From: VIVEK_SHARMA <VIVEK_SHARMA_at_infy.com>
> > > Date: Mon, 27 Nov 2000 12:11:59 +0530
> > > Subject: SOS Alert
> > >
> > > CASE In a Bank , Interest Calculation Batch Processing Unable
> to
> go
> > beyond
> > > 500 A/cs per minute
> > >
> > > CAUSE of the Problem is UNKNOWN
> > >
> > > ORACLE 7.3.4.5 on AIX 4.3.3
> > > DB Server - IBM S80 Model - 12 CPUs , 3 GB RAM
> > > APP Server 1 - IBM S80 Model - 6 CPUs , 2 GB RAM
> > > APP Servers 2 & 3 - IBM Thin Nodes model - 4 CPUs , 1 GB RAM
> > >
> > > Storage Box :-
> > > ===========
> > > SS Class Storage
> > > RAID 0+1 - (First Striped & then Mirrored)
> > > NOTE - 2 Storage Boxes Exist , one being the Mirror of the
> Other
> > > Striping exists across a set 4 Disks (in one Box) with another
> 4
> being
> > > it's
> > > mirror
> > > (in another Box).
> > > Thus a Volume Group Contains 8 Disks with 4 Disks
> > > Stripe Size = 4K
> > >
> > > NOTE - Runs Tried from BOTH the APP Servers 1 OR 2 ,But with
> the
> SAME
> > MAX
> > > of
> > > 500 A/cs
> > > processed per minute
> > >
> > > CPU Utilizations on BOTH APP & DB Server = 40 %
> > > wio% on BOTH APP & DB Servers = 35 %
> > > No paging happening on Both APP & DB Servers
> > >
> > > - Oracle Contention Values Seem Small to us as shown Below or
> so
> they
> > seem
> > > to us :-
> > >
> > > SVRMGR> Rem System wide wait events for non-background
> processes
> (PMON,
> > > SVRMGR> Rem SMON, etc). Times are in hundreths of seconds.
> Each one of
> >
> > > SVRMGR> Rem these is a context switch which costs CPU time.
> By
> looking
> > at
> > > SVRMGR> Rem the Total Time you can often determine what is the
>
> > bottleneck
> > > SVRMGR> Rem that processes are waiting for. This shows the
> total time
> > > spent
> > > SVRMGR> Rem waiting for a specific event and the average time
> per wait
> > on
> > > SVRMGR> Rem that event.
> > > SVRMGR> select n1.event "Event Name",
> > > 2> n1.event_count "Count",
> > > 3> n1.time_waited "Total Time",
> > > 4> round(n1.time_waited/n1.event_count, 2) "Avg Time"
>
> > > 5> from stats$event n1
> > > 6> where n1.event_count > 0
> > > 7> order by n1.time_waited desc;
> > > Event Name Count Total Time
> Avg
> Time
> >
> > > -------------------------------- ------------- -------------
> > -------------
> > > SQL*Net message from client 10856276 31977110
> > 2.95
> > > enqueue 1295 374980
> > 289.56
> > > db file sequential read 3614044 303848
> > .08
> > > write complete waits 5812 295937
> > 50.92
> > > latch free 5045060 242170
> > .05
> > > SQL*Net more data from client 13939 165275
> > 11.86
> > > log file sync 12794 146409
> > 11.44
> > > buffer busy waits 100443 92477
> > .92
> > >
> > >
> > >
> > > - ALL Literal SQLs were Converted to using Bind variables
> > > - ALL Tables Running on Indexes Without Any FULL Scans
> happening
> .
> > >
> > > - All the Literal SQLs (Dynamic) Converted to using Bind
> variables
> > (Static
> > > Queries)
> > >
> > > - event="10181 trace name context forever, level 1000"
> > > NOTE - Set nevertheless , None of the SQLs taking Excessive
> time
> to
> > parse
> > > though
> > >
> > > - NO statistics are Analyzed
> > > - 7 tables involved in the interest Calc.
> > > inserts to 2 Tables
> > > selects , updates in 5 tables
> > >
> > > Qs. Are there Any BASIC O.S./Hardware Features to Enhance
> Performance on
> > > IBM
> > > AIX Systems ?
> > >
> > > Qs. Is Any Other portion of the report.txt required for
> perusal
> ?
> > >
> > > Please Revert to me for any further info / Clarifications ?
> > >
> > >
> > >
> > > << File: report.txt_Live >> << File: report.txt_TEST >>
>
> --
> Please see the official ORACLE-L FAQ: <http://www.orafaq.com>
> --
> Author: VIVEK_SHARMA
Received on Thu Dec 21 2000 - 15:42:19 CST