RE: Parallel Query change from smart scan to cell single block access

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Mon, 20 Jun 2022 09:38:32 -0400
Message-ID: <05bf01d884ab$090c8880$1b259980$_at_rsiz.com>

re: am puzzled by the 40,000 calls to "get and advance SCN": any thought on how that could happen if the process is stuck on transaction table read consistency?

Is that the old instance synchronization “bug” (aka some folks called it a bug). I didn’t catch the version…

If my questionable memory serves, trying to synchronize SCN with another instance that was churning through a bunch of tiny commits was a dog chasing its tail problem, or something like that, and there was also some problem with using an actual different database as the source of information.

Regarding the *possible* free space issue, I presume that Oracle has still resisted my enhancement request “untangle freelist” that was a response to the beta test of bitmap freeblock use in a Fibonacci sequence for super fast first use free block finding but a disaster with the way blocks stuck back on the bit map were handled after enough to deletes to requalify for insert after it had been “full enough.”

(Yet another reason to use the “Scaling to Infinity” method to do deletes.)

The best way to clean up a bitmap freelist without that internal clean-up I *think* they rejected as not possibly a real concern depends on how much would have to be moved to a different tablespace followed by a tablespace drop versus how much freespace in the tablespace would be lost by creating a dummy table in slack time and churning through until all the low blocks are no longer free (with the side issue of whether or not you ever have slack time).

The sledge hammer approach is technically easier if the sizes to move are plausible.

A symptom of a production system freespace problem is that it goes away if there is an unload/load refresh to a test system as opposed to a block copy refresh.

Finally, I can’t remember whether this:

20220614 221231 cell physical IO interconnect bytes                                     546250752          7   1221
20220614 221231 physical read total bytes                                               546250752          7   1221

actually means every single read is happening on a different instance, which would beg the question whether some other process and this one could be started on the same instance with force local, which would likely be faster leaving out the interconnect and also might cure the SCN issue as a side effect.

mwf

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Jonathan Lewis Sent: Sunday, June 19, 2022 8:04 AM
To: Tanel Poder
Cc: Ls Cheng; Oracle Mailinglist
Subject: Re: Parallel Query change from smart scan to cell single block access

Another little detail to pick up from the numbers is that there's been no change in the statistic "transaction tables consistent read rollbacks", which means you've spent all that time trying to take one undo segment header block back into the past to find out when one specific transaction that you found in one data block committed before or after your current statement started.

Tanel,

I am puzzled by the 40,000 calls to "get and advance SCN": any thought on how that could happen if the process is stuck on transaction table read consistency?

I am also curious about "gc remote disk reads" - which seems to be how the I/O is being done. Any thoughts ?

Regards

Jonathan Lewis

I've not noticed "gc remote disk read" before.

On Sun, 19 Jun 2022 at 06:47, Tanel Poder <tanel_at_tanelpoder.com> wrote:

That 60 second snapshot indicates that the full table scan / smart scan was not making any progress during that time. No increases in "table scan blocks gotten" nor "cell physical IO bytes eligible for predicate offload" or any other "cell blocks%" metrics.

The "consistent gets", "consistent changes" and "transaction tables consistent reads - undo records applied" have all increased by the same amount (68471) without any table/index block access metrics. You might be stuck processing the same block and looks like you might be hitting something like a delayed block cleanout issue (no metric saying undo blocks applied to a datablock, just "transaction tables consistent reads - undo records applied").

Smart scan can't process the block in the storage layer as it doesn't know whether some previous changes to that block had been committed before or after your smart scan started (snapshot SCN). So it sends the block back in passthrough mode and Oracle now starts examining (and rolling back) a CR copy of the undo segment header block, to get to the truth.

Thoughts:

Do you have any (unrelated) long running transactions active at the same time? There's something called the "minimum active SCN" optimization where Oracle keeps track of the lowest SCN of any still active (uncommitted) transaction and can push this info to storage cells too. If a select query does a consistent read on a block, requesting a snapshot SCN higher than the known "minimum active SCN" in the database, then we can be sure that the still locked rows (by a transaction with a SCN lower than the known minact SCN) are real and no need to go examining & rolling back undo segment header blocks to a past point in time.

You could also try to clean table blocks after loads or large ETL/batch modifications by committing more often (so that the fast cleanout can clean the blocks still in buffer cache on commit or just run a buffered (not direct path or smart) full table scan on the table after data load).

Of course, if it's some bug, all bets are off, might be something more complicated :-)

This is somewhat related to the topic (long running transactions causing some optimizations not be usable):

EXADATA and SuperCluster : Check if long running transactions are preventing min active scn from progressing, resulting in Storage Indexes not being used (Doc ID 2081483.1)

Tanel Poder

https://learn.tanelpoder.com

On Sat, Jun 18, 2022 at 8:10 PM Ls Cheng <exriscer_at_gmail.com> wrote:

Sorry for the late reply.

I did take snaps of gv$sesstat for the PQ slave which was doing cell single block reads but saw nothing strange

The following was sesstat in two snapshots with 60 seconds interval, only statistics where value - prev_value > 0 is considered

SNAP_DATE       NAME                                                             VALUE-PREV_VALUE    INST_ID    SID


--------------- ---------------------------------------------------------------- ---------------- ---------- ------


20220614 221231 messages sent                                                                   2          7   1221
20220614 221231 cluster wait time                                                             110          7   1221
20220614 221231 blocks decrypted                                                              364          7   1221
20220614 221231 gc hash slot removed                                                          374          7   1221
20220614 221231 prefetched blocks aged out before use                                         461          7   1221
20220614 221231 pinned buffers inspected                                                      536          7   1221
20220614 221231 dirty buffers inspected                                                      2156          7   1221
20220614 221231 shared hash latch upgrades - no wait                                         4236          7   1221
20220614 221231 user I/O wait time                                                           5439          7   1221
20220614 221231 non-idle wait time                                                           5549          7   1221
20220614 221231 hot buffers moved to head of LRU                                             6684          7   1221
20220614 221231 calls to kcmgas                                                             40158          7   1221
20220614 221231 free buffer inspected                                                       66597          7   1221
20220614 221231 cell flash cache read hits                                                  66681          7   1221
20220614 221231 physical read requests optimized                                            66681          7   1221
20220614 221231 physical read total IO requests                                             66681          7   1221
20220614 221231 physical reads                                                              66681          7   1221
20220614 221231 physical read IO requests                                                   66681          7   1221
20220614 221231 gc local grants                                                             66681          7   1221
20220614 221231 physical reads cache                                                        66681          7   1221
20220614 221231 free buffer requested                                                       66681          7   1221
20220614 221231 gc remote disk read                                                         66681          7   1221
20220614 221231 consistent gets from cache                                                  68471          7   1221
20220614 221231 consistent gets examination                                                 68471          7   1221
20220614 221231 consistent gets                                                             68471          7   1221
20220614 221231 consistent changes                                                          68471          7   1221
20220614 221231 transaction tables consistent reads - undo records applied                  68471          7   1221
20220614 221231 session logical reads                                                       68471          7   1221
20220614 221231 non-idle wait count                                                        161928          7   1221
20220614 221231 file io wait time                                                        54386966          7   1221
20220614 221231 physical read bytes                                                     546250752          7   1221
20220614 221231 cell physical IO interconnect bytes                                     546250752          7   1221
20220614 221231 physical read total bytes                                               546250752          7   1221
20220614 221231 physical read total bytes optimized                                     546250752          7   1221
20220614 221231 logical read bytes from cache                                           560914432          7   1221

35 rows selected.

Thanks

On Wed, Jun 15, 2022 at 10:10 PM Tanel Poder <tanel_at_tanelpoder.com> wrote:

Check the v$sesstat metrics, there's nowadays plenty of feedback from storage cells telling how "deep" the offload processing actually was or if there was pushback/passthrough caused

Common causes for smart scan "hiccups" with single block reads:

chained rows
migrated rows for some operations (update, maybe select for update etc)
uncommitted transactions in blocks (or not trivially cleanable blocks in storage)

But v$sesstat metrics will tell you more.

Here's an article from 12 years ago, where chained rows caused problems for a smart scan.

http://tech.e2sn.com/oracle/exadata/performance-troubleshooting/exadata-smart-scan-performance

Things have somewhat improved by now, but back then the "cell chained rows%" metrics didn't even exist, had to reason by looking at other metrics, like the "table fetch continued row" and the difference of "processing depth" at cache & txn layer vs data layer in the storage cells:

    869, DWH       , STAT, table fetch continued row               ,          3660,        732,
    869, DWH       , STAT, cell blocks processed by cache layer    ,          5428,      1.09k,
    869, DWH       , STAT, cell blocks processed by txn layer      ,          5428,      1.09k,
    869, DWH       , STAT, cell blocks processed by data layer     ,          3625,        725,

Tanel Poder

https://learn.tanelpoder.com

On Tue, Jun 14, 2022 at 6:26 PM Ls Cheng <exriscer_at_gmail.com> wrote:

Hi all

I have a strange situation where a very simple query (no join, a single FROM table) sometimes is fast (seconds) and sometimes slow (hours). After digging a bit It seems that one of PQ slave instead of accessing the table using cell smart table scan is accessing by cell single blocks.

This is 19.10, has anyone observed or faced such a problem?

Thanks

--

http://www.freelists.org/webpage/oracle-l Received on Mon Jun 20 2022 - 15:38:32 CEST

This message: [ Message body ]
Next message: Mark W. Farnham: "RE: Query ASH for Undo Blocks"
Previous message: Lothar Flatz: "Re: Query ASH for Undo Blocks"
In reply to: Jonathan Lewis: "Re: Parallel Query change from smart scan to cell single block access"
Next in thread: Ls Cheng: "Re: Parallel Query change from smart scan to cell single block access"
Reply: Ls Cheng: "Re: Parallel Query change from smart scan to cell single block access"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message