Oracle-L: RE: Houston, do I have a problem?

From: Cary Millsap <cary.millsap_at_hotsos.com>
Date: Sun, 11 Aug 2002 22:38:18 -0800
Message-ID: <F001.004B1ADE.20020811223818@fatcity.com>

This is an interesting report. I think the responses to it are even more interesting. One response admits confusion (which I think is a completely fair reaction). Another zeroes in knowingly on some specific details. If everyone had time to respond, I would expect a rash of differing opinions about what you should do first to fix this system... This kind of game is a fundamental part of using system-wide performance data. (The various ratio problems are just as relevant for system-wide data collected from the "wait interface" as they are from v$sys<anything-else>.)

Don't lose hope if you look at Kirti's note and wonder, "so what's the point?" You cannot see everything that's wrong with a system from a report like this. I think in fact that you can know only two things from a v$system_event report:

If you know the "secret constants" (see www.hotsos.com/dnloads/1/constants), then you can see whether the database is spending heinously longer than "normal systems" at doing things. In this report, I would propose that an average single-block read latency of 9.7 seconds (977.107332 centiseconds), for example, is "heinously longer than normal."
If you know the "secret list" of things that databases should and shouldn't do, then you can see whether a database is doing a lot of things that it "shouldn't be doing." Databases, for example, shouldn't need to wait very often for 'buffer busy waits' waits, 'enqueue' waits, or 'latch free' waits. (Where's the url for *this* secret list? It's so simple that you don't really need one. Database should spend most of their time either idle, providing CPU service, or doing physical I/O. Not much else.)

Sure, knowing these two things is worth something, but it leaves lots of good questions unanswered (*essential* questions, actually):

Even if an Oracle kernel event is consuming "heinously longer-than-normal" elapsed times, or even if it is called "heinously too often," does it really matter? What if the event is called predominantly by unimportant business processes, and the long latencies don't impact anything important? Then you would be wasting your time fixing it (instead of fixing something important first). If you assume an event is important because it's prominent in a system-wide data collection and you then fix a huge performance problem, then you were actually just lucky. It won't happen this way every time.
What if the database is providing the "right kinds of service in the right proportions?" How can you tell whether it's spending more time than it *could* have spent? For example, just because a program spends 90% of its response time on the CPU and 10% on a disk (kind of a "normal, healthy" profile), it is *not* okay if the response time is 10 hours when it should be 6 seconds. It's not the proportions that are important; it's the absolute response time.

So... Is the HDS disk array a problem? Probably. But, it's possible--*likely*, actually--that an analyst could fix all the problems shown here and still have really slow applications. Why? Because several essential-but-slow programs on this system might not spend significant amounts of their response time waiting on any of the top 10 events in this list. We see it pretty often: people fix their system's "worst performance problems" and then find out that their work really didn't make a noticeable end-user impact. (I'm confident that Kirti won't end up in this trap, but that's because I trust him to exercise intuition and experience far beyond the scope of what can be learned from his v$system_event data.)

The "wait interface" is an important tool, because it "finally" (well, since over ten years ago) allows us to see where a program spends its time. But to use that tool to see how a whole system has spent its time since instance startup has the same limitations as any other method that relies upon system-wide aggregated data.

So, what should you look at to avoid performance improvement project ambiguities? Session-level data. Which session? As I mentioned last week, I believe the analyst should focus first upon sessions whose performance improvement would most significantly improve the business. That, in my opinion, is The Big Secret.

Cary Millsap
Hotsos Enterprises, Ltd.
http://www.hotsos.com

Upcoming events:
- Hotsos Clinic, Oct 1-3 San Francisco, Oct 15-17 Dallas, Dec 9-11 Honolulu
- 2003 Hotsos Symposium on OracleR System Performance, Feb 9-12 Dallas - Next event: NCOAUG Training Day, Aug 16 Chicago

-----Original Message-----
Kirti
Sent: Sunday, August 11, 2002 5:43 PM
To: Multiple recipients of list ORACLE-L

This is not a joke.....!!!

This is from a business critical production database that I was asked to 'review' past Friday.

The report is from v$system_event taken at 10:30am, Aug 9, 2002. The server (and database) was bounced on Aug 4, 2002 at 9:20am.

This was the 1st time I was logging into this database.

SQL> /

EVENT                               TOTAL_WAITS TOTAL_TIMEOUTS
TIME_WAITED
AVERAGE_WAIT                                             
----------------------------------- ----------- --------------
-----------
------------                                             
control file parallel write              143933              0
4080356626
28349.0001                                             
db file scattered read                 12540695              0
1.2254E+10
977.107332                                             
buffer busy waits                      10740450             36
8193235928
762.839167                                             
SQL*Net message from client           180769027              0
9.9561E+10
550.761199                                             
db file sequential read               298968127              0
1.1839E+11
395.99129                                             
enqueue                                   13500           6435
2036785
150.872963                                             
SQL*Net more data from client          52227948              0
4093231165
78.3724294                                             
free buffer waits                            16              4
795
49.6875                                             
log file switch completion                  804             43
16263
20.2276119                                             
log buffer space                            977              0
5409
5.53633572                                             
control file single write                    17              0
51
3                                             
db file parallel write                  1749695              0
2935317
1.67761638                                             
db file parallel read                      8149              0
13484
1.65468156                                             
log file single write                      1024              0
701


.684570313                                             

latch free                              2007034        1616763
1054137


.525221297                                             

log file sync                           1366242            560
526049


.385033545                                             

SQL*Net message from dblink             1514480              0
451351


.298023744                                             

log file sequential read                 405415              0
82877


.204425095                                             

SQL*Net break/reset to dblink                10              0
2


.2                                             

log file parallel write                 2025192              7
293332


.144841576                                             

SQL*Net break/reset to client             28113              0
3221


.114573329                                             

db file single write                        320              0
36


.1125                                             

SQL*Net more data from dblink            447044              0
11375


.025444923                                             

SQL*Net more data to client            11770996              0
75680


.006429362                                             

control file sequential read             554851              0
3261


.005877254                                             

SQL*Net more data to dblink                1076              0
5


.00464684                                             

buffer deadlock                            1045           1029
1


.000956938                                             

SQL*Net message to dblink               1514485              0
456


.000301092                                             

SQL*Net message to client             180769119              0
48736


.000269604

29 rows selected.

SQL> Here is the environment:
1)all the file systems for the database, including dump directories are in a
single disk volume group, 2) all redo logs and control files are spread among all the other database files, 3) Hitachi array is in use with nothing
but RAID-5 for all files (redo as well), 4) the real hard drives within the
array are either shared with other databases on the same server or with other servers, 5) redo logs are of 100MB size and switch 20+ times/hour when
some of the batch processes run in the evening, 6) no changes are allowed to
any SQL code, Pro*COBOL code that use 'COPYBOOKs' (Remember those?) to interact with tables at single row level (no array processing) using routines with bunch of parameters (call insert... call update... call delete...), 7) the array has 32GB of NV cache and that's the max it can have
(the DB is 180GB, there are 3 other similar ones from just this server).

Now the 'icing on the cake':
The server has 3 other critical databases. All 4 running in archive log mode. All share the *same* archive log destination. And all databases are
expected to have same amount of batch processing. The archive log destination is 8GB in size on the 2nd VG. The DB in question, generated 1.8GB to 2+GB of logs in less than an hour during batch processing. At times
our automated archived log siphoning process encounters some bottlenecks from our single IBM/Tivoli TSM Server where the logs are deposited before
those are purged from archive log destination...

I was also informed that I will not have much chance to bring about any changes in the environment described above. Because, I was told, ...it is
the corporate decision to use RAID-5 with HDS array and it is 'the most cost
effective way to address our storage needs'.... and a single VG per database
helps UNIX support to implement HACMP with much ease... and we can not meet
our published deadlines if we made any changes and spent time in testing those unscheduled changes...... yadi yadi yada....

Kirti -- Please see the official ORACLE-L FAQ: http://www.orafaq.com -- Author: Deshpande, Kirti INET: kirti.deshpande_at_verizon.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------

To REMOVE yourself from this mailing list, send an E-Mail message to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in the message BODY, include a line containing: UNSUB ORACLE-L (or the name of mailing list you want to be removed from). You may also send the HELP command for other information (like subscribing).

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Cary Millsap
  INET: cary.millsap_at_hotsos.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Mon Aug 12 2002 - 01:38:18 CDT