Re: Deadlock ITL Waits

From: kyle Hailey <kylelf_at_gmail.com>
Date: Thu, 21 Jul 2011 08:43:56 -0700
Message-ID: <CADsdiQh49pn1CcgLJjB+OnN97+FQQSTCqAY8CKBiVZZhA_sk5Q_at_mail.gmail.com>



Yes, bad log write times:
Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s

>1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- -----
-----
log file parallel write      38K  72.5  15.4   5.4   2.0    .8    .4   1.3
2.2

You've got over 2% over a second! That's pretty bad. There is something wrong with your I/O subsystem.

I did some lgwr I/O write time distribution graphs on a couple of systems here

http://dboptimizer.com/2011/07/14/lgwr-redo-write-times-log-file-parallel-write/

Even at 6MB log writes , I was still getting under 300ms over NFS mounted file systems .
Typically log writes are much smaller and on the order of 3ms.

On Thu, Jul 21, 2011 at 12:16 AM, Stalin <stalinsk_at_gmail.com> wrote:

> Hah, totally forgot about event_histograms :)
>
> i think either the storage is pushed to its limits or something terribly
> wrong on the storage side.
>
> Wait Event Histogram DB/Inst: NBBSDB/nbbsdb Snaps: 11203-11204
> -> Total Waits - units: K is 1000, M is 1000000, G is 1000000000
> -> % of Waits - column heading: <=1s is truly <1024ms, >1s is truly
> >=1024ms
> -> % of Waits - value: .0 indicates value was <.05%, null is truly 0
> -> Ordered by Event (idle events last)
>
> Total ----------------- % of Waits
> ------------------
> Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s
> >1s
> -------------------------- ----- ----- ----- ----- ----- ----- ----- -----
> -----
> ADR block file read 13 69.2 15.4 7.7 7.7
> ADR block file write 5 80.0 20.0
> ADR file lock 6 100.0
> ARCH wait for archivelog l 8 100.0
> Data Pump slave startup 1 100.0
> Datapump dump file I/O 18 72.2 27.8
> Disk file operations I/O 1567 92.5 .1 .3 .1 .1 2.1
> 4.9
> LGWR wait for redo copy 562 98.9 .9 .2
> Log archive I/O 2 100.0
> PL/SQL lock timer 28 3.6 3.6 92.9
> Parameter File I/O 104 76.0 9.6 1.9 1.9 1.9 3.8 4.8
> SGA: allocation forcing co 19 5.3 94.7
> SQL*Net break/reset to cli 56 94.6 3.6 1.8
> SQL*Net more data to clien 3818 91.9 .0 8.1
> Streams AQ: enqueue blocke 1
> 100.0
> asynch descriptor resize 269 100.0
> buffer busy waits 7556 87.7 .8 .3 .2 .2 .2 2.5
> 8.2
> buffer deadlock 212 100.0
> buffer exterminate 92K .0 .0 .0 99.1 .5 .4
> change tracking file synch 58 17.2 13.8 10.3 1.7 43.1
> 13.8
> change tracking file synch 58 39.7 3.4 5.2 43.1
> 8.6
> control file parallel writ 1474 32.2 6.8 2.0 1.6 2.1 2.8 44.1
> 8.3
> control file sequential re 8078 55.9 .8 1.1 3.6 2.0 1.6 30.0
> 5.0
> cursor: mutex S 7 100.0
> cursor: pin S 1363 29.2 7.7 7.9 2.4 2.5 6.4 43.9
> cursor: pin S wait on X 378 9.8 6.6 82.3
> 1.3
> db file parallel read 7 14.3 14.3 28.6
> 42.9
> db file parallel write 276K 13.5 7.9 11.8 13.2 12.4 10.8 30.0
> .3
> db file scattered read 32K 11.3 2.7 10.2 20.2 44.3 8.9 2.3
> .0
> db file sequential read 308K 30.2 .9 2.3 4.1 1.6 .6 38.6
> 21.7
> db file single write 5 60.0 40.0
> direct path read 143 31.5 5.6 17.5 30.1 6.3 .7 7.0
> 1.4
> direct path read temp 9032 7.8 5.9 14.2 28.2 10.1 2.0 25.5
> 6.2
> direct path sync 2 100.0
> direct path write 4 25.0 25.0 50.0
> direct path write temp 3441 2.2 2.6 6.4 22.2 58.8 6.2 1.6
> .1
> enq: CF - contention 22 4.5 22.7
> 72.7
> enq: CR - block range reus 6 83.3 16.7
> enq: FB - contention 2 100.0
> enq: HW - contention 17 35.3 5.9 11.8 5.9 5.9
> 35.3
> enq: KO - fast object chec 1 100.0
> enq: RO - fast object reus 7 42.9 28.6 14.3 14.3
> enq: SQ - contention 11 27.3 72.7
> enq: TQ - DDL contention 1 100.0
> enq: TQ - TM contention 1 100.0
> enq: TX - allocate ITL ent 1716 .9 .1 .2 .6 1.9 2.2 30.7
> 63.4
> enq: TX - contention 1 100.0
> enq: TX - index contention 592 37.2 6.6 .2 .3 1.7 3.0 38.0
> 13.0
> enq: TX - row lock content 4 25.0 25.0 25.0 25.0
> enq: UL - contention 1 100.0
> kksfbc child completion 1 100.0
> kupp process wait 2 100.0
> latch free 33K 26.8 10.7 12.3 8.2 17.0 19.5 5.5
> latch: In memory undo latc 1046 98.6 .7 .5 .3
> latch: cache buffers chain 4750 96.3 1.1 1.1 .9 .2 .4 .0
> latch: cache buffers lru c 51 37.3 7.8 5.9 21.6 5.9 19.6 2.0
> latch: enqueue hash chains 1927 6.8 3.3 3.2 2.5 3.0 .9 80.3
> latch: messages 333 100.0
> latch: object queue header 39 76.9 5.1 5.1 12.8
> latch: redo allocation 5 20.0 20.0 20.0 20.0 20.0
> latch: redo writing 12 83.3 8.3 8.3
> latch: row cache objects 173 30.1 6.9 .6 4.0 36.4 22.0
> latch: session allocation 240 17.1 11.7 18.3 49.6 3.3
> latch: shared pool 2 50.0 50.0
> latch: undo global data 9 77.8 22.2
> library cache load lock 1 100.0
> library cache lock 31 64.5 9.7 6.5 19.4
> library cache: mutex X 649 98.9 .6 .2 .3
> log file parallel write 38K 72.5 15.4 5.4 2.0 .8 .4 1.3
> 2.2
> log file sequential read 470 2.8 .2 .9 .6 .2 72.6
> 22.8
> log file single write 8 62.5 12.5 12.5
> 12.5
> log file switch (private s 2303 .0 .2 6.2
> 93.6
> log file sync 838K 2.9 1.0 .5 1.7 1.7 .8 7.6
> 83.8
> os thread startup 21 100.0
> rdbms ipc reply 29 89.7 6.9
> 3.4
> read by other session 96K 73.6 2.0 3.7 3.5 1.4 .4 10.7
> 4.6
> reliable message 58 62.1 37.9
> row cache lock 4 100.0
> undo segment extension 20 15.0 15.0 10.0 60.0
> utl_file I/O 4602K 99.7 .0 .0 .3 .0 .0 .0
> wait list latch free 1 100.0
> DIAG idle wait 6675 100.0
> .0
> KSV master wait 3 66.7 33.3
> SQL*Net message from clien 2871K 33.7 2.8 4.4 3.8 10.9 18.3 22.2
> 3.8
> SQL*Net message to client 2811K 100.0 .0 .0 .0 .0 .0 .0
> SQL*Net more data from cli 26K 99.8 .0 .0 .0 .0 .1
> Space Manager: slave idle 690
> 100.0
> Streams AQ: qmn coordinato 269 51.3 .4
> 48.3
> Streams AQ: qmn slave idle 251 2.4
> 97.6
> Streams AQ: waiting for ti 2 50.0
> 50.0
> class slave wait 30 100.0
> jobq slave wait 120 100.0
> pmon timer 1243 8.5
> 91.5
> rdbms ipc message 53K 13.0 30.5 8.7 1.8 2.1 2.5 14.6
> 26.8
> smon timer 18 5.6
> 94.4
> wait for unread message on 81 1.2 1.2 2.5 18.5 8.6 6.2 61.7
> -------------------------------------------------------------
>
>
> On Wed, Jul 20, 2011 at 11:50 PM, Stalin <stalinsk_at_gmail.com> wrote:
>
>> Thanks Kyle.
>>
>> Apparently we don't have diagnostic pack licenses to use DBA_HIST tables.
>> Is this something i can get from statspack tables ?
>>
>>
>> On Wed, Jul 20, 2011 at 10:20 PM, kyle Hailey <kylelf_at_gmail.com> wrote:
>>
>>>
>>> Yeah, the massive log waits could exacerbate the ITL waits as things will
>>> be slow commiting their changes.
>>> The db file sequential reads are huge as well. Looks like you have a
>>> problem on your storage.
>>>
>>> Check the log write times ( I blogged some of this at
>>> http://dboptimizer.com/2011/07/20/wait-event-and-wait-class-metrics-vs-vsystem_event/)
>>>
>>>
>>> Latencies in the past minute
>>>
>>> col name for a25
>>> select m.intsize_csec,
>>> n.name ,
>>> m.time_waited,
>>> m.wait_count,
>>> 10*m.time_waited/nullif(m.wait_count,0) avgms
>>> from v$eventmetric m,
>>> v$event_name n
>>> where m.event_id=n.event_id
>>> and n.name in (
>>> 'log file sync',
>>> 'log file parallel write'
>>> );
>>>
>>>
>>> Latencies averaged over each hour
>>>
>>> select
>>> btime,
>>> (time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0) avg_ms
>>> from (
>>> select
>>> to_char(s.BEGIN_INTERVAL_TIME,'DD-MON-YY HH24:MI') btime,
>>> total_waits count_end,
>>> time_waited_micro/1000 time_ms_end,
>>> Lag (e.time_waited_micro/1000)
>>> OVER( PARTITION BY e.event_name ORDER BY s.snap_id) time_ms_beg,
>>> Lag (e.total_waits)
>>> OVER( PARTITION BY e.event_name ORDER BY s.snap_id) count_beg
>>> from
>>> DBA_HIST_SYSTEM_EVENT e,
>>> DBA_HIST_SNAPSHOT s
>>> where
>>> s.snap_id=e.snap_id
>>> and e.event_name in (
>>> 'log file sync',
>>> 'log file parallel write'
>>> )
>>> order by begin_interval_time
>>> )
>>> order by btime
>>> /
>>>
>>> If the log file parallel writes are bad then probably something wrong on
>>> the storage system.
>>>
>>> - Kyle Hailey
>>> http://dboptimizer.com
>>>
>>>
>>>
>>> On Wed, Jul 20, 2011 at 3:37 PM, Stalin <stalinsk_at_gmail.com> wrote:
>>>
>>>> We have been seeing lots of deadlock errors lately in load testing
>>>> environments and they all have been due to enq: TX - allocate ITL entry. In
>>>> reviewing the statspack report for the periods of deadlock, i see that, log
>>>> file sync wait being the top consumer with a terrible wait time. That makes
>>>> to me think the deadlock, is just a symptom of high log file sync wait
>>>> times. Below is the snippet from statspack and looking at these numbers,
>>>> especially CPU not being heavily loaded, wondering if this could be a case
>>>> of storage issue. Sys Admins are checking the storage layer but thought
>>>> would check here get any opinions/feedback.
>>>>
>>>> Top 5 Timed Events
>>>> Avg %Total
>>>> ~~~~~~~~~~~~~~~~~~
>>>> wait Call
>>>> Event Waits Time (s)
>>>> (ms) Time
>>>> ----------------------------------------- ------------ -----------
>>>> ------ ------
>>>> log file sync 1,400,773 4,357,902
>>>> 3111 91.4
>>>> db file sequential read 457,568 334,834
>>>> 732 7.0
>>>> db file parallel write 565,843 27,573
>>>> 49 .6
>>>> read by other session 16,168 7,395
>>>> 457 .2
>>>> enq: TX - allocate ITL entry 575 6,854
>>>> 11919 .1
>>>> -------------------------------------------------------------
>>>> Host CPU (CPUs: 64 Cores: 8 Sockets: 1)
>>>> ~~~~~~~~ Load Average
>>>> Begin End User System Idle WIO
>>>> WCPU
>>>> ------- ------- ------- ------- ------- -------
>>>> --------
>>>> 3.13 7.04 2.26 3.30 94.44 0.00
>>>> 7.81
>>>>
>>>> Statistic Total per Second
>>>> per Trans
>>>> --------------------------------- ------------------ --------------
>>>> ------------
>>>> redo synch time 435,852,302 120,969.3
>>>> 309.7
>>>> redo synch writes 1,400,807 388.8
>>>> 1.0
>>>> redo wastage 5,128,804 1,423.5
>>>> 3.6
>>>> redo write time 357,414 99.2
>>>> 0.3
>>>> redo writes 9,935 2.8
>>>> 0.0
>>>> user commits 1,400,619 388.7
>>>> 1.0
>>>>
>>>>
>>>> Environment : 11gr2 EE (11.2.0.1), Sol 10 Sparc
>>>>
>>>> Thanks,
>>>> Stalin
>>>>
>>>>
>>>
>>
>>
>> --
>> Thanks,
>>
>> Stalin
>>
>
>
>
> --
> Thanks,
>
> Stalin
>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jul 21 2011 - 10:43:56 CDT

Original text of this message