RE: SOS Alert

From: VIVEK_SHARMA <VIVEK_SHARMA_at_infy.com>
Date: Wed, 29 Nov 2000 20:16:27 +0530
Message-Id: <10695.123189@fatcity.com>

> My replies Follow your points Below
>
> -----Original Message-----
> From: Vivek [SMTP:Vivek_at_1800FLOWERS.com]
> Sent: Wednesday, November 29, 2000 11:00 AM
> To: 'VIVEK_SHARMA'
> Cc: Arun Chakrapani
> Subject: RE: SOS Alert
>
> Hi Vivek:
>
> My mistake in assuming the batch process was running in the server
> itself.
> Yes that would explain a lot of things interms of both the servers
> going out
> at 500 p/m. We had some issues like this at NY when we had to
> transfer some
> archivelogs for hot standby. We found that we were using the 10MBps.
> When
> the network guys after a lot of pestering did something and it came
> to
> normal. After some time we found that we were using the wrong one.
> But
> mentioning the IP instead of the hostname should not make a
> difference
> unless the ip corresponding to the host in /etc/hosts is different.
> I hope
> you have 2 NIC cards in you server and they are connected to the
> network
> with the same bandwidth.
>
> Qs Lost you . What does "2 NIC cards in the server being connected
> to the network
> with the same bandwidth" Mean ?
>
> Qs. In Such Kinds of Setup with 3 APP & 1 DB (2 S80 IBM machines &
> 2 Thin Nodes)
> what is the Link Bandwidth Speed you would consider advisable ?
>
>
> Allow me to say that the shared_pool_size has by experience
> been
> kept at bare minimum required for we did experience performance
> poblems with
> a shred_pool_size of 300 MB ( Though on Another machine on SUN OS )
> It
> is recommended to keep shared_pool_size at rock bottom just above
> the Limit where ORA-4031 out of shared pool segments occurs.
>
>
> The shared pool problem could have been caused because of
> fragmentation. This happens when there are lots of literal SQL s
> (little use
> of bind variables) and reloading and invalidations. This also
> happens when
> you use lots of cursors and the cursor body are going out and in
> frequently.
> We flush the shared pool once a night (midnight) every day even
> though we
> are in a higher version as part of a cron.
>
> Please look at v$db_object_cache
> to identify those objects that are needed to be pinned. You can load
> the
> objects immediately after the DB comes up . But you can even pin
> now, If the
> DB has been up for some time you will also be able to identify the
> objects
> that needs pinning.
>
> Qs. Which Field(s)/ Values indicate that an Object requires Pinning
> in v$db_object_cache ?
>
>
> This one can be used everytime the DB is bounced. Also I
> don't know whether you use sequences. If you do, Please cache enough
> based
> on your usage and pin them as well.
>
> NOTE - NO FULL Scans are Occuring hence Caching of Any table may not
> be useful
>
>
>
> We are in the SUN OS (3500 users , 120G
> OLTP system with 200 tps. We have here a 700 G DW as well). Our
> shared pool
> in OLTP is 300M. We pin very aggressively, Pl/SQL all over the
> place. The
> get hit ratio will be terribly down if you are using lots of
> unreused sqls.
> You have to isolate one by one. Good starting point is pinning SYS.
> We have
> some home grown sqls for finding out whole bunch of things. There
> are couple
> of us here (another gentleman named Arun Chakrapani - From Blore who
> does
> research on Oracle as well).
>
> NOTE - Banking Application Product is in pro*c
> Qs Could you possibly pass Any Home Grown SQL that could help us
> finding the Same ?
>
>
> Regarding ASYNC_IO and db_writer_processes - No I wouldn't
> recommed
> mutiple db_writer_processes under 7.3.4 vs ASYNC_IO. The one thing
> which you
> have not mentioned (Or I over looked) is are you on raw or cooked
> filesystem. If you are on cooked filesystem are you using any
> filesystem
> software like Veritas (VxFS) or something else or good old ufs. You
> would
> like to make sure true Asynchronous io is supported by IBM on
> cooked
> filesystems (if you are on one).
>
> Ans We are using cooked filesystems - JFS . The Storage Box is using
> a Cluster belonging to IBM itself
> 1 of the APP Servers (Also S80 machine) will function as the
> failover server in case the DB Server Fails
>
>
> According to my recolletion (2 years old)
> IBM was not supporting true ASYNC. They were simulating one. I think
> you
> should have a command called truss(SUN OS - used to debug a
> processes system
> calls). Atleast in Sun the Async io is a simulation with the
> libaio.h
> libraries on cooked. If IBM doesn't support or truss or one such
> command
> shows that the ASYNC IO is not happening I would definetely disable
> AYSNC_IO
> and enable multiple db_writer_processes.
>
> Qs. Would there be any Concern if BOTH aync_io & Multiple db_writers
> are enabled ?
>
>
> Regarding the stripe size, as I had mentioned the stripe
> size
> usually is 64K or 128K in OLTP DB s. With a 8K block and 16
> Multiblock read
> the database would be read in one read access from disk for a DB
> FILE
> scattered read.
>
> Qs. Lost you. Could you explain in some more detail your point Above
> ?
> NOTE As of NOW Both OS Block Size & Stripe Size are = 4K
> Also the SA Claims that a MAX 4K OS Block Size is possible on IBM
>
>
> Currently I am working more on Sun Solaris systems. I would
> be able
> to give more specifics on the minute details on this one. But let me
> go and
> search some of my notes.
>
> To be brutually honest to you I seldom look at the
> report.txt file.
> I am more into the finer details - digging and digging for ever. But
> I will
> definitely take a closer look in a couple of days and get back to
> you.
> Please send me report.txt that is taken during the peak of your
> processing
> timings.
>
> Attached 2 report.txt Files for your perusal
>
> << File: report.txt_test_28Nov00_1 >> << File:
> report.txt_test_IDTBackonRaid_28Nov00_2 >>
>
>
>
> Thanks
>
> Vivek
>
>
> > -----Original Message-----
> > From: VIVEK_SHARMA [SMTP:VIVEK_SHARMA_at_infy.com]
> > Sent: Tuesday, November 28, 2000 9:15 PM
> > To: 'Vivek'
> > Cc: 'ArunC_at_1800FLOWERS.com'
> > Subject: RE: SOS Alert
> >
> >
> > Allow me to begin :-
> > There were 2 routes in between the APP & DB Servers .
> > One had a Bandwidth of 10 MBPS while Another had a Bandwidth of
> 100 MBPS
> > The 10 MBPS route was being used all this while . On Enabling the
> 100 MBPS
> > route
> > (by Specifying the I.P. Address Instead of Hostname in the
> tnsnames.ora &
> > listener.ora)
> > the performance has jumped to 1100 records per minute (from 500
> records
> > per
> > minute)
> > i.e. Almost Doubled .
> > 100 MBPS is the MAX Possible Bandwidth Possible as per the SA
> >
> > Qs. Your Views on this please
> >
> > Additional responses Follow your Views BELOW :-
> >
> > > -----Original Message-----
> > > From: Vivek [SMTP:Vivek_at_1800FLOWERS.com]
> > > Sent: Tuesday, November 28, 2000 6:30 PM
> > > To: 'VIVEK_SHARMA'
> > > Cc: Arun Chakrapani
> > > Subject: RE: SOS Alert
> > >
> > > Vivek:
> > >
> > > Sorry for not replying to you yesterday itself. A Busy day.
> Took a
> > look
> > > at
> > > you report.txt. They were having 2 different profiles. Since You
> had
> > > mentioned that it also stalled in Test DB I concentrated more
> on the
> > test
> > > report.txt. One thing which stood out on the other one (live) on
> opening
> > > you
> > > report.txt is the SQL _AREA GET_HIT_RATIO is very poor. I don't
> know
> > > whether
> > > you have already pinned your SYS objects and other objects that
> are
> > > executed
> > > often in shared pool. Please use DBMS_SHARED_POOL.keep to pin
> > > package,procedure,function,sequence,cursors, triggers that are
> used
> > often.
> > >
> > Ans OK will do so
> >
> > > So now I will concentrate on the Test DB.
> > >
> > > 1. Even though the SQL_AREA GET_HIT_RATIO is higer(82%) when
> compared to
> > > live this is not good enough. It should be close to 100%(1 some
> where
> > like
> > > .98 alteast). Please pin the SYS objects to start with.
> > >
> > OK
> > Allow me to say that the shared_pool_size has by experience
> been
> > kept at bare minimum required
> > for we did experience performance poblems with a
> shred_pool_size of
> > 300 MB
> > ( Though on Another machine on SUN OS )
> > It is recommended to keep shared_pool_size at rock bottom
> just above
> > the Limit
> > where ORA-4031 out of shared pool segments occurs
> >
> >
> > > 2. On seeing you system wide wait events the following stands
> out.
> > >
> > > 1.log file sync 2863 83922
> 29.31
> > > 2.write complete waits 905 71401
> 78.9
> > > 3.free buffer waits 60 4718
> 78.63
> > >
> > > The first event log file sync event mostly comes when there
> is a
> > disk
> > > i/o bottleneck for the online redo logs.This happens when the
> processes
> > > have
> > > to wait for the LGWR to sync a particular REDO log. The p1
> parameter in
> > a
> > > Log sync in the corresponding LOg Buffer.
> > >
> > >
> > > The Write complete waits comes
> > > when the user process is waiting for the LGWR to complete the
> writing of
> > > the
> > > redo information. So Please take a look at the placing of the
> online
> > redo
> > > log. In which disk have you placed this. Also if it has been
> place in
> > its
> > > own file system it doesn't mean that it is not going to a busy
> disk.
> > Since
> > > you have 0+1, The slices may belong to a HOT disk. What you need
> to do
> > is
> > > ask your SA to monitor the disk I/O usage and see whats
> happening to the
> > > disk containing the redo log file. Also you can use iostat
> -xPnce 5 5 (I
> > > think its available in AIX). the output will look something like
> > >
> > > us sy wt id
> > > 23 7 49 21
> > > extended device statistics ----
> errors ---
> > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w
> trn tot
> > > device
> > > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 13.4 0 0 0 0
> 0 0
> > > c1t10d0s0
> > > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.4 0 0 0 0
> 0 0
> > > c1t10d0s1
> > > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0
> 0 0
> > > c1t10d0s2
> > > 0.0 1.3 0.3 10.1 0.0 0.1 4.6 105.8 0 1 0 0
> 0 0
> > > c1t10d0s3
> > > Watch out for %b column. If its somewhere around 60 then you may
> be
> > > experiencing contention on those slices. Ask you SA by giving
> him/her
> > the
> > > device name and they will be able to give you what the FS is.
> > >
> > NOTE - 4 Datafiles Containing a Particular Table & it's
> > Corresponding indexes (to Which INSERTS
> > happen ) were Moved out of the RAID onto a Local Disk
> whereafter
> > performance Jumped from
> > 850 per minute to 1100 per minute
> >
> > Ans - OK . Will Issue the Command
> >
> > > Also I see
> > > Free Buffer waits. This is caused because there are no free
> buffers when
> > > they are requested - pointing directly to the inefficiency of
> the DBWR.
> > > Again the DBWR may be slow because you have a I/O bottleneck for
> the DB
> > > files and indirectly because of the Write complete waits.
> > >
> > Qs. Would you recommend Setting Multiple db_writer processes
>
> > even though async_io is set to true ?
> >
> >
> > > " Thus a Volume Group Contains 8 Disks with 4 Disks
> > > > Stripe Size = 4K "
> > >
> > >
> > > [Vivek] Please confirm me that the Stripe size is 4K. Usually
> you will
> > > have
> > > a stripe size of 32K,64K,128K,256K .... If you are having a
> stripe size
> > of
> > > 4K and your db_block_size as 8K, It is going to take 2 reads to
> satisfy
> > > one
> > > request for a db_file_sequential_read (Usually requests 1 Oracle
> block)
> > > [Vivek] . This doesn't sound good if thats true.
> > >
> > NOTE - The "test" Database is a 4K DB_BLOCK_SIZE while the
> Live
> > Database is
> > an 8K DB_BLOCK_SIZE , But the SAME performance Limits were
> Observed
> > on Both
> >
> > > So if possible could you please do the following.
> > >
> > > 1. Please shutdown the Test db and start fresh.
> > > 2. Run the same process again and also run the iostat -xPnce 60
> 10000
> > for
> > > the duration of the run (output this to a file)
> > > 3. Run vmstat and see whats the CPU usage for USER and SYSTEM
> > > 4. When the 500 A/c per sec reaches - Please run
> > >
> > > set lines 132
> > > select sid, substr(event,1,40),p1,p2,p3 from v$session_wait
> > > where event in ('db file scattered read','db file sequential
> > read','latch
> > > free','buffer busy waits','log file sync','db file scattered
> > > read','enqueue');
> > >
> > > and give me the output.
> > >
> > OK - Will Action & revert to you
> >
> > > 5. IF may waits are showing as "latch free" in the event column
> of
> > > v$session_wait, Please query v$latch for
> > > latch#,name,level,GETS,MISSES,SLEEPS IMMEDIATE_GETS
> > > ,WAITS_HOLDING_LATCH, SPIN_GETS, SLEEP1.
> > >
> > Ans OK
> >
> > > 6. Preserve the output for v$latch_children.
> > >
> > Ans OK
> >
> > Qs. Are their any IBM OS Specific Parameters which are
> considered
> > benefitial for performance ?
> >
> >
> > > I will take a look at both the report.txt in more detail today
> and get
> > > back
> > > to you.
> > >
> > Thanks Again so much
> > Where are you Located / Phone No ? if you please
> > We are Currently Located in Bangalore & Addressing The
> Problem at
> > Mumbai over a WAN
> > Phone 91 80 6588668
> >
> > > HTH
> > >
> > > Vivek
> > >
> > > > -----Original Message-----
> > > > From: VIVEK_SHARMA [SMTP:VIVEK_SHARMA_at_infy.com]
> > > > Sent: Monday, November 27, 2000 11:16 AM
> > > > To: 'Vivek'
> > > > Cc: Arun Chakrapani
> > > > Subject: RE: SOS Alert
> > > >
> > > > THANKS so much for the association in this issue
> > > >
> > > > Attached 2 report.txt each taken for 1 Hour approx . Is it
> Sufficient
> > ?
> > > >
> > > > Attached report.txt of Problem on the "Live" Database
> > > >
> > > > <<report.txt_Live>>
> > > > NOTE - A "test" Database Created on the DB Server Itself in
> the Same
> > > > Partitions as the "Live" Database
> > > > Runs Done on this "test" Database Also failing to go beyond
> 500 A/cs
> > per
> > > > minute
> > > >
> > > >
> > > > <<report.txt_TEST>>
> > > >
> > > > NOTE - Interest Runs were Done on the "Test" Database During
> Down-Time
> > > > Periods of the "Live" Database
> > > >
> > > > You Wrote :-
> > > > Since there is some write complete waits Please take a look
> at the
> > > > place the on line redo logs are and what is the read, write on
> these
> > > > disks.
> > > >
> > > > Ans Online Redo logs ALSO present on the RAID 0+1 Disks & are
> > > Multiplexed
> > > > Size of Online Redo logfile on the "Live" Database = 30 MB
> > > >
> > > > NOTE - On the test Database we increased the Size of the
> Online Redo
> > > > logfile
> > > > to 100MB but with NO benefit
> > > >
> > > > You wrote :-
> > > > Please give me the DB Version as well
> > > >
> > > > Ans Database Oracle version = 7.3.4.5.0
> > > >
> > > > You wrote :-
> > > > I see some waits on More data from client.
> > > > Qs. What is to be done about the Same ?
> > > >
> > > > NOTE - Will Run the 2 SQL Scripts Suggested by you & revert
> back
> > > >
> > > > Thanks Again
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Vivek [SMTP:Vivek_at_1800FLOWERS.com]
> > > > > Sent: Monday, November 27, 2000 7:44 PM
> > > > > To: 'VIVEK_SHARMA_at_infy.com'
> > > > >
> > > > > Cc: Arun Chakrapani
> > > > > Subject: SOS Alert
> > > > >
> > > > > Hi Vivek:
> > > > >
> > > > > Will it be possible to send the whole report.txt to me.
> Since you
> > have
> > > > not
> > > > > given for how long this was taken (time from utlbstat and
> utlestat)
> > it
> > > > is
> > > > > more difficult to say anything. Based on the report.txt
> snippet you
> > > > have
> > > > > given there seems to me there are some issues with the
> enqueues and
> > > > latch
> > > > > free waits. Please run the following SQL as SYS so bet' 2
> different
> > > > times
> > > > > (
> > > > > say 1/2 apart) and send me both the values(It will be great
> if you
> > > could
> > > > > run
> > > > > this when the batch processing is running).
> > > > >
> > > > > select KSQSTTYP,KSQSTGET,KSQSTWAT from x$ksqst where
> KSQSTGET>0 and
> > > > > KSQSTWAT
> > > > > >0;
> > > > >
> > > > > The delta bet' the 2 results will give you information on
> what is
> > the
> > > > > enqueue that we see more waits on. Also Please run this sql
> given
> > > below
> > > > in
> > > > > a
> > > > > cron every minute or so and output this to a file
> > > > >
> > > > > SELECT p1 "File", p2 "Block", p3 "Reason" FROM
> v$session_wait WHERE
> > > > event
> > > > > like 'buffer busy waits' ;
> > > > >
> > > > > This will give you and idea about the buffer busy waits.
> Since there
> > > is
> > > > > some
> > > > > write complete waits Please take a look at the place the on
> line
> > redo
> > > > logs
> > > > > are and what is the read, write on these disks. Also Please
> give me
> > > > > infomation on the relog file size. Please give me the DB
> Version as
> > > > well.
> > > > > I
> > > > > see some waits on More data from client. Again since I don't
> know
> > for
> > > > how
> > > > > long the stats were run can't say whether they are
> significant or
> > not.
> > > > >
> > > > > HTH
> > > > >
> > > > > Vivek
> > > > >
> > > > >
> > > > >
> > > > > From: VIVEK_SHARMA <VIVEK_SHARMA_at_infy.com>
> > > > > Date: Mon, 27 Nov 2000 12:11:59 +0530
> > > > > Subject: SOS Alert
> > > > >
> > > > > CASE In a Bank , Interest Calculation Batch Processing
> Unable to go
> > > > beyond
> > > > > 500 A/cs per minute
> > > > >
> > > > > CAUSE of the Problem is UNKNOWN
> > > > >
> > > > > ORACLE 7.3.4.5 on AIX 4.3.3
> > > > > DB Server - IBM S80 Model - 12 CPUs , 3 GB RAM
> > > > > APP Server 1 - IBM S80 Model - 6 CPUs , 2 GB RAM
> > > > > APP Servers 2 & 3 - IBM Thin Nodes model - 4 CPUs , 1 GB RAM
> > > > >
> > > > > Storage Box :-
> > > > > ===========
> > > > > SS Class Storage
> > > > > RAID 0+1 - (First Striped & then Mirrored)
> > > > > NOTE - 2 Storage Boxes Exist , one being the Mirror of the
> Other
> > > > > Striping exists across a set 4 Disks (in one Box) with
> another 4
> > being
> > > > > it's
> > > > > mirror
> > > > > (in another Box).
> > > > > Thus a Volume Group Contains 8 Disks with 4 Disks
> > > > > Stripe Size = 4K
> > > > >
> > > > > NOTE - Runs Tried from BOTH the APP Servers 1 OR 2 ,But with
> the
> > SAME
> > > > MAX
> > > > > of
> > > > > 500 A/cs
> > > > > processed per minute
> > > > >
> > > > > CPU Utilizations on BOTH APP & DB Server = 40 %
> > > > > wio% on BOTH APP & DB Servers = 35 %
> > > > > No paging happening on Both APP & DB Servers
> > > > >
> > > > > - Oracle Contention Values Seem Small to us as shown Below
> or so
> > they
> > > > seem
> > > > > to us :-
> > > > >
> > > > > SVRMGR> Rem System wide wait events for non-background
> processes
> > > (PMON,
> > > > > SVRMGR> Rem SMON, etc). Times are in hundreths of seconds.
> Each
> > one
> > > of
> > > >
> > > > > SVRMGR> Rem these is a context switch which costs CPU time.
> By
> > > looking
> > > > at
> > > > > SVRMGR> Rem the Total Time you can often determine what is
> the
> > > > bottleneck
> > > > > SVRMGR> Rem that processes are waiting for. This shows the
> total
> > time
> > > > > spent
> > > > > SVRMGR> Rem waiting for a specific event and the average
> time per
> > wait
> > > > on
> > > > > SVRMGR> Rem that event.
> > > > > SVRMGR> select n1.event "Event Name",
> > > > > 2> n1.event_count "Count",
> > > > > 3> n1.time_waited "Total Time",
> > > > > 4> round(n1.time_waited/n1.event_count, 2) "Avg
> Time"
> > > > > 5> from stats$event n1
> > > > > 6> where n1.event_count > 0
> > > > > 7> order by n1.time_waited desc;
> > > > > Event Name Count Total Time
> Avg
> > Time
> > > >
> > > > > -------------------------------- ------------- -------------
> > > > -------------
> > > > > SQL*Net message from client 10856276 31977110
> > > > 2.95
> > > > > enqueue 1295 374980
> > > > 289.56
> > > > > db file sequential read 3614044 303848
> > > > .08
> > > > > write complete waits 5812 295937
> > > > 50.92
> > > > > latch free 5045060 242170
> > > > .05
> > > > > SQL*Net more data from client 13939 165275
> > > > 11.86
> > > > > log file sync 12794 146409
> > > > 11.44
> > > > > buffer busy waits 100443 92477
> > > > .92
> > > > >
> > > > >
> > > > >
> > > > > - ALL Literal SQLs were Converted to using Bind variables
> > > > > - ALL Tables Running on Indexes Without Any FULL Scans
> happening .
> > > > >
> > > > > - All the Literal SQLs (Dynamic) Converted to using Bind
> variables
> > > > (Static
> > > > > Queries)
> > > > >
> > > > > - event="10181 trace name context forever, level 1000"
> > > > > NOTE - Set nevertheless , None of the SQLs taking Excessive
Received on Wed Nov 29 2000 - 08:46:27 CST