Is it really necessary to reduce ITL Waits to near zero?

From: Taft, David <TaftD_at_saic-dc.com>
Date: Tue, 8 May 2007 14:53:04 -0400
Message-ID: <DCE76463749C64499892A0DB3AF05AC60EF45FFD@challenger.vta.saic-dc.com>

All,

Recently I had to investigate if the frequent deadlocks in one database could be due to insufficient ITL slots. It was determined that none of the deadlocks were due to ITL Waits. However, since there are ITL Waits in the database, this raised the question if this a problem worth fixing? I tried to figure out what is an acceptable number of ITL Waits before the table or index should be reorged/rebuilt with a larger INITRANS value. I couldn't find any hard answers on the net, so I decided to do some analysis. I chose to use the YAPP method. The final analysis was that eliminating all ITL Waits would only have increased response time by 0.03%, which doesn't justify reorging tables or rebuilding indexes.

The system is 9.2.0.7 w/RAC on AIX5L & HACMP 5.2. What follows is the breakdown of that analysis. My main concern is if this analysis is sound. I think it is, but would appreciate a second opinion before submitting my report. Thanks.

David Taft

Begin analysis:

Checked the lmd trace file. No deadlocks were due to insufficient ITL slots, i.e. no TX mode-4 locks. All were due to TX mode-5 locks. Example:

2007-04-30 09:02:09.340 Global Wait-For-Graph(WFG) at ddTS[0.7a6] : BLOCKED 7000001dad4c140 5 [0x150004][0x5e710a],[TX] [65556,576166] 0 BLOCKER 7000001dad4cf68 5 [0x150004][0x5e710a],[TX] [65620,7413] 0 BLOCKED 7000001dad4d3a0 5 [0x70008][0x5dabb1],[TX] [65620,7413] 0 BLOCKER 7000001dad4aa90 5 [0x70008][0x5dabb1],[TX] [65556,576166] 0

Just as an FYI, I have seen deadlocks on this system where both TX mode 3 and TX mode 5 were involved. Just mentioning this because I have seen threads where others were wondering if the LMD trace ever lists anything other than mode 5.

Since no deadlocks were due to insufficient ITL slots, on to the questions if the ITL Waits that do exists are enough to justify reorging tables and/or rebuilding indexes.

First I searched the alert log to find a time period with multiple deadlocks:

  alert_log=$1
  lines=`grep -n "Deadlock detected" $alert_log|awk -F: '{print $1}'`   for line in $lines
  do
    beforeline=`expr $line - 1`
    datestamp=`sed -n "${beforeline}p" $alert_log|awk '{print $3 $2 $5"."$4}'`

echo $datestamp
done

  30Apr2007.09:02:09
  30Apr2007.09:02:21
  30Apr2007.09:18:02
  30Apr2007.09:18:14
  30Apr2007.09:19:16

There were five between the hours of 9am and 10am. This system is set up to take a level 7 statspack snapshot at the top of every hour, so I pulled the report covering that time period and broke down the response time as follows:

CPU used by this session: 240937

Convert from centiseconds to seconds to get Service Time.

Service Time: 240937/100 = 2409

Note, the top five timed events are reported in seconds, while "CPU used by this session" appears to be in centiseconds.

Estimate the total wait time, divide the "Time (s)" by the "% Total Ela Time" for the top event in the "Top 5 Wait Events".

  Event                                               Waits    Time (s) Ela

Time

------------ -----------
db file sequential read 2,841,497 30,837 68.55

Wait Time: 30837 /.6855 = 44985

Add "Wait Time" to "Service Time" to estimate "Response Time".

Response Time: 2409 + 44985 = 47394

Subtract "Wait Time" from "Response Time", to estimate "CPU Time".

CPU Time: 47394 - 44985 = 2409

Not sure why the above was necessary since it matches Service Time, but I do seem to recall that was not always the case. At least that is what I seem to remember when using the YAPP method with 8i databases. I think 8i reported the top 5 wait events in centiseconds and not seconds, so that may account for the difference. Some sort of rounding issue?

Calculate the percentage of "CPU Time" and "Wait Time" by dividing the value of each by "Response Time".

"%CPU Time": 2409 / 47394 = .0508 ( 5.08%) "%Wait Time": 44985 / 47394 = .9491 (94.91%)

Note, 95% of total response time was due to waits. This system is heavy on IO waits and the application needs to reduce its' overall IO.

Divide the enqueue time by "Response Time" to get estimated percent of enqueue time.

                                                     Total Wait   wait
Waits
  Event                             Waits   Timeouts   Time (s)   (ms)

/txn

------------ ---------- ---------- ------
enqueue 13,371 1,437 721 54 0.2

%enqueue Time: 721 / 47394 = .0152 (1.52%)

Enqueues account for only 1.52% of the total response time. Some tiny percentage of those enqueue waits were due to ITL Waits in TX mode-4. OK, so this is where I begin to reduce the analysis to the ridiculous.

Estimate the total ITL Waits by dividing "ITL Waits" by %Total for the top event in the "Top 5 ITL Waits per Segment".

                                           Subobject  Obj.           ITL
  Owner      Tablespace Object Name        Name       Type         Waits

%Total

---------- ------------------ ---------- ----- ------------
ADADCC ADA_INDEX1 DCC_OBSV_FK_I INDEX 26 9.03

ITL Waits: 26 / .0903 = 288

Average the time that each enqueue wait represents.

Average Enqueue Time: 721 / 13371 = .054 centiseconds (cs)

Multiply the "ITL Waits" by the "Average Enqueue Time" to get the ITL wait time.

ITL Wait Time: 288 * .054 = 15.552 (s)

Divide "ITL Wait Time" by "Response Time" to get percent of ITL waits as part of the overall response time.

%ITL Wait Time: 15.552 / 47394 = .0003 (0.03%)

So if I had eliminated all ITL Waits, I would have reduced the average response time by 0.03% for no perceptible change. In the absence of any extended SQL trace showing that ITL Waits was the largest contributor to any particular job or report's response time, reorging or rebuilding is not justified.

End Analysis.

--
http://www.freelists.org/webpage/oracle-l

Received on Tue May 08 2007 - 13:53:04 CDT