Re: latch: cache buffers chains

From: Laurentiu Oprea <laurentiu.oprea06_at_gmail.com>
Date: Sat, 2 Oct 2021 22:39:00 +0300
Message-ID: <CA+riqSXYABWZ=CN1aeuKPCodw6UkoXrUnHh2Kb0J_ajPpJNvRg_at_mail.gmail.com>





Initially I was using some hints and now I remove them to let the optimizer decide the execution plan. I attached the text sql monitor reports. The primary execution is from another database (copy around 2 weeks older) so there is a very small difference between number or rows. (is the only place I was able to do the test today)

The Read bytes distribution is very strange for standby execution. Not sure if I`m hiting a bug or if there is something I can do to mitigate this behavior.

În sâm., 2 oct. 2021 la 18:38, Jonathan Lewis <jlewisoracle_at_gmail.com> a scris:

> The most interesting difference in the two plans is that the primary plan
> does a serial scan of the GTT. (Execs = 1) while the standby plan shows 16
> PX slaves attempting to do a parallel scan (Execs = 16) for the PX BLock
> Iterator at operation 10, (execs = 396) for the table access full at
> operation 11. That suggests something is broken for parallel scans of GTTs.
>
> There are more oddities when looking at the stats.
> For example,
>
> egrep -e "index scans kdiixs1" -e"table scan rows" -e"callback"
> snapper_stats.sql
> 29 of the 32 sessions report matching results like:
> 4297 _at_4, STAT, table scan rows gotten
> , 25811, 403.64, , , ,
> , ~ per execution
> 4297 _at_4, STAT, index scans kdiixs1
> , 25811, 403.64, , , ,
> , ~ per execution
> 4508 _at_4, STAT, table scan rows gotten
> , 25801, 403.48, , , ,
> , ~ per execution
> 4508 _at_4, STAT, index scans kdiixs1
> , 25801, 403.48, , , ,
> , ~ per execution
> 4614 _at_4, STAT, table scan rows gotten
> , 77421, 1.21k, , , ,
> , ~ per execution
> 4614 _at_4, STAT, index scans kdiixs1
> , 77421, 1.21k, , , ,
> , ~ per execution
>
> Why does every row gotten have a matching index range scan? (Maybe
> something has gone wrong with the " adaptive" features and feeding the
> statistics collector operation. But why 3 of the PX slaves not report
> either statistic in that minute - is something jammed? But why have (at
> least) 29 slaves been involved in tablescanning the GTT when you're running
> parallel 16?
>
> Another example (there are many)
> 4 _at_4, STAT, physical read total multi block requests
> , 678, 10.6, , , ,
> , 342 total single block read requests
> 4 _at_4, STAT, table scan blocks gotten
> , 768, 12.01, , , ,
> , ~ per execution
> 111 _at_4, STAT, physical read total multi block requests
> , 678, 10.6, , , ,
> , 319 total single block read requests
> 111 _at_4, STAT, table scan blocks gotten
> , 768, 12.01, , , ,
> , ~ per execution
> 217 _at_4, STAT, physical read total multi block requests
> , 673, 10.52, , , ,
> , 314 total single block read requests
> 217 _at_4, STAT, table scan blocks gotten
> , 762, 11.92, , , ,
> , ~ per execution
>
> Why are multi block read requests (usually an indicator of tablescans)
> getting just over one block at a time while tablescanning when every
> extents in a temporary segment will be 1MB (by default, though you could
> have declared it to be something else).
>
> One thing the last results reminds me of is a bug where a trace file would
> show:
> WAIT "db file scattered read" file 5, block 256, block count 128
> WAIT "db file scattered read" file 5, block 257, block count 127
> WAIT "db file scattered read" file 5, block 258, block count 126
> WAIT "db file scattered read" file 5, block 259, block count 125
>
>
> I think you need to look at the ASH data to see if that gives you any
> further clues
> Then you could enable extended SQL tracing
> And you could enable PX tracing to see what is going on between PX
> processes
> You could run the Monitor report (as a text output) several times in rapid
> succession immediately after starting the query to see if the scan gets to
> 10M rows in the first couple of seconds and then keeps on working while
> getting no furhter.
>
> You didn't post the PX summary section of the Monitor report - that might
> have given a couple more little clues about what's going on.
>
> Something, possibly several somethings, is clearly broken in the mix of
> GTT, parallel execution, and standby database.
> The one thing I didn't see in the stats was any indication of a
> read-consistency problem. Though the amount of cross-chat for global cache
> requests was a little surprising,
>
> Regards
> Jonathan Lewis
>
>
> On Sat, 2 Oct 2021 at 13:35, Laurentiu Oprea <laurentiu.oprea06_at_gmail.com>
> wrote:
>
>> Exactly, code is using temp_undo_enabled true at session level. I usually
>> use Tanel`s snapper, I capture stats for 1 minute for all the slaves
>> involved.
>>
>>
>> În sâm., 2 oct. 2021 la 11:55, Jonathan Lewis <jlewisoracle_at_gmail.com> a
>> scris:
>>
>>>
>>> I forgot to ask for clarification about how you are populating a GTT on
>>> a standby database.
>>> Is this taking advantage of the 12c "temp_undo_enabled"
>>> The primary example suggests you're looking at a table of 385MB, with
>>> reads requests of 1MB each.
>>> Is the standby GTT likely to be about the same size - it's suffered 22GB
>>> of blocks read with an average read size of about 320KB. I'm now wondering
>>> if this is a problem of read-consistency: are the slaves doing 1MB reads of
>>> the GTT then going crazy checking temp undo segments using single block
>>> reads.
>>>
>>> Can you run a test on the standby that runs in isolation so that
>>> v$sysstat will be (mostly) the sum of the PX session stats; or run the test
>>> and check v$px_sesstat for the query so that you can capture what the PX
>>> processes for the query coordinator are doing. Also a couple of (diffs of)
>>> snapshots of v$filestat might be a pointer to where the action is happening.
>>>
>>>
>>> Regards
>>> Jonathan Lewis
>>>
>>>
>>>
>>> On Fri, 1 Oct 2021 at 11:17, Laurentiu Oprea <
>>> laurentiu.oprea06_at_gmail.com> wrote:
>>>
>>>> Hello Jonathan,
>>>>
>>>> Regarding point 1 I was thinking now that might be a single reason, in
>>>> primary stats are collected for the GTTs after insert while in standby are
>>>> not. Still estimates looks correct (maybe dynamic sampling)
>>>>
>>>> The original flow and query is very complicated so I tried to reduce it
>>>> to a simple count(*) select with a join to replicate the behavior (sql
>>>> monitor in the print screen).
>>>>
>>>>
>>>>
>>>>
>>>> În vin., 1 oct. 2021 la 12:34, Jonathan Lewis <jlewisoracle_at_gmail.com>
>>>> a scris:
>>>>
>>>>> First guess -
>>>>> Is there any reason why the primary would know that the parent session
>>>>> had written a very large amount of data to the GTT while the standby thinks
>>>>> the GTT is very small. In the latter case you may find that PX sessions are
>>>>> using buffered reads when you normally expect parallel tablescans to use
>>>>> direct path.
>>>>>
>>>>> Second guess -
>>>>> Is is possible that the execution plan has changed so that the primary
>>>>> is doing a tablescan of the GTT but the standby is doing a parallel
>>>>> tablescan of something else and using that to drive an indexed access path
>>>>> into the GTT
>>>>>
>>>>> Regards
>>>>> Jonathan Lewis
>>>>>
>>>>>
>>>>> On Fri, 1 Oct 2021 at 10:06, Laurentiu Oprea <
>>>>> laurentiu.oprea06_at_gmail.com> wrote:
>>>>>
>>>>>> Hello everyone.
>>>>>>
>>>>>> Version 12.1
>>>>>>
>>>>>> Looks like for a sql statement executed in standby database, with
>>>>>> parallel, that selects data from a GTT (with on commit delete rows) the PX
>>>>>> slaves spends a significant amount of time into latch: cache buffers chains
>>>>>> and read by other session . Same query executed in primary database has no
>>>>>> issues. THis issue is observed just for GTTs.
>>>>>>
>>>>>> Anyone have any ideas on how to mitigate this issue?
>>>>>>
>>>>>> THank you.
>>>>>>
>>>>>>



--
http://www.freelists.org/webpage/oracle-l



Received on Sat Oct 02 2021 - 21:39:00 CEST

Original text of this message