Re: Full table scan -- uniform extent allocation - extent allocation map issue?

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Wed, 27 Nov 2019 12:05:38 +0000
Message-ID: <CWXP265MB1191D4571DB20D0E26AB9FF2A5440_at_CWXP265MB1191.GBRP265.PROD.OUTLOOK.COM>

Reading the rather lengthy post about where the data might be in the segment, in the extents, or on the disc, you're obviously capable of thinking through the threats; as you say, however, there are many possible variations so the important point is to be aware of the possibilities so that when you see an example where there might be scope for improvement you can think apply the general ideas to the specific case. To a large extent I try not to generalize too much in my blog notes since they will otherwise turn into a sprawling, branching mass of "but if" scenarios.

I will raise two points though. First going back to your original post and picking up waitstates:

WAIT #139711646963136: nam='db file scattered read' ela= 12269 file#=5 block#=97540 blocks=124 obj#=75252 tim=248446182677
WAIT #139711646963136: nam='db file scattered read' ela= 17440 file#=5 block#=97664 blocks=128 obj#=75252 tim=248446202405
WAIT #139711646963136: nam='db file scattered read' ela= 3282 file#=5 block#=97792 blocks=128 obj#=75252 tim=248446208027


=======================================

This looks like consecutive reads of the first 3 1MB sections (excluding the 4 blocks of L1 bitmap) of an 8MB extents - and you can see that the read time varies massively even when you're looking at a single extent. If you can get this kind of variation in a single extent with optimum reads when (nominally) the extent is contiguous in the file and you hope that the file is "as contiguous as possible" on the disc (which it won't be, really) then there seems to be little point in trying to make all the extents contiguous in the file.

Secondly - you may have the wrong idea about hash partitions, ASSM etc.

c. if we have a hash partitioned table / index or a reverse key index, the data blocks may be placed partition/block and with ASSM and multiple sessions, data can be anywhere. and in these cases reverse order scan may not yield any performance benefits as is the case with the general scan.

With a hash partitioned table the pattern of "data inserted at the end of the segment" doesn't change, it's simply a case of there being multiple segments, so "N" 'end of segments' and "N" reversed segment scans. The potential for benefit still exists. I'm ignoring the comment about indexes since we're talking about tablescans - the mechanism is (as you clearly appreciate) irrelevant for index fast full scans because index entries go where they're supposed to, not at the end of the segment (Even then, a reverse scan on a sequence/time based index might benefit from a reverse scan). The comment about ASSM isn't quite correct; in general the data won't go "anywhere", there's a strong bias to it going into the last extent added to the segment - though there have been bugs with Oracle adding new extents when existing extents sill have a huge amount of space in them.

This was an interesting thought:

now that i am writing this mail, this came to my mind... what precisely happens when the full table scan starts and a table is large and during this time due to space pressure, an extent gets allocated, and possibility of a row migration migrating a row which is not read yet into the new extent...

It took me a couple of minutes to realise that this wasn't a problem, If the update happened after the tablescan started then the tablescan would reach the block containing the row header and apply undo to a clone of that block to construct a read-consistent version - so it doesn't need to worry about the fact that the current version of the row is actually in a different block that's above the HWM known at the time the tablescan started.

Regards
Jonathan Lewis

From: Vishnu Potukanuma <vishnupotukanuma_at_gmail.com> Sent: 26 November 2019 20:30
To: Jonathan Lewis
Cc: Oracle L
Subject: Re: Full table scan -- uniform extent allocation - extent allocation map issue?

Hi Jonathan, thanks for your timely response.

I am just trying to see the entire picture... of what it takes to perform a full table scan... every possibility and every path oracle can take..

My understanding:
Oracle knows that highwater mark information by reading the L3BMB or segment header initially to get the low - high water mark, high - high water mark information and also regarding extent map so that it knows which extents to read from. Oracle is also aware that all blocks below the low - high water mark - whether they are used or not or free, it should read (in other words all formatted) and as the blocks between Low - HWM and high HWM are either formatted or unformatted (may or may not contain data). so it selectively reads only those formatted block ranges between low-HWM and high-HWM.

So if A, B, C, D, E extents are allocated to the table E is the last extent so E usually contains the high water mark, unless users manually allocated extents.. So while performing a full table scan, oracle has to read all the blocks from the table (formatted). If the extent A is stored at say near 31gb position and extent stored at 1gb position, the disk spindles whether they are based out of RAID or single disks have to read the extend from A position go to position B which is at 1gb position and vice versa depending on the location..

My idea was as Oracle knows that E is the last extent which contains the high water mark, if it read A, B, C, D the order in which they are stored in a data file, the disk spindle has to move only along one path over the surface of the disk and finally read the last extent this goes same even RAID volumes. So that most of the IO seek time can be minimised I mean this is only for the tables in the tablespaces with uniform extent allocation, as the extent sizes grow with auto allocate, this issue i was talking about diminishes. The only reason why I brought this up is, although negligible as you indicated in most cases, this is rather an interesting behaviour to note, but when the disks are busy a large random read (pread64 which reads 1mb chunk in one IO operation) may affect other sessions as well.

Now cases when this issue can occur (in OLTP environments) i mean this can grow over the years, not immediate.

1. Over time rebuild indexes on a tables, as it is mandatory in few cases due to its dependency on ROWID.
2. Over time use shrinking operations on indexes or tables to free up unused space or bloat which can occur.. For examples, tables holding orders that are passing through, and we can't use a temporary table in this case as restarting the database we will loose all orders, days when the orders are high and days when it is low.
3. Users may rarely recreate tables to mitigate migrated rows.
4. archive old partitions and drop them (partitioning). etc

Same goes for OLAP:

1. Data is loaded initially into a table or segment, espeically if the data is coming from different sources, here temporary segments also fail as they are not retained post restart.
2. Here also old partitions archived to tape and dropped as in some environments compliance requires us to have only 5 or 10 years worth of data and anything older than that is archived to tape.
3. Movement of partitions between Tier-1 and Tier-2 storage as it gets old.

The idea about scanning the table in the reserve order makes sense but there are lot of variables or cases or alternate path that can lead to loads of possibilites. Regarding the CR processing part and reverse order:

Jonathan please correct me if am wrong here:

Reserve order is beneficial only in the cases when most of the activity is concentrated in the last extents or latest extents, which can be inserts updates or deletes, or application works with only latest data and the size of the table is very large to mitigate the effort required for CR processing. but this comes with two possibilities. when we are talking about full table scans.

use partitioning in this case, and modify the application to be partition aware, or periodically archiving the data - which may also require modifying the application.... and this ends up two possibilites.

mercy of the application team, whether they are willing to make any changes to the code.
whether the application is third party in which case we don't have any control, and they call it best practice.

Size of the buffer cache and table - and this raises four possibilities yet again. and can end in various paths....
- if the size of the buffer cache is large, and the buffers that are modified are hot enough not to get flushed from the disk, as a full table scan can go with db file scattered read (depending on its size and other variables like _small_table_threshold, mid_table_threshold, rarely even beyond mid_table_threshold until VLOT > don't remember 4 or 5 times buffer cache). in this case two possibilities again
  - buffers replaced from the cache to accomodate the buffers coming in doesn't belong to this table and the buffer cache is not that busy, if the buffer cache is busy this can add two more wait free buffer waits or write complete waits, in both cases the either DBWn or both DBWn and LGWR has to work again, and this is basically at the mercy of on the size on the buffer cache and the number of cpus as the number working sets are equal to some constant 2 (dont remember) * cpu_count, if the size of the buffer cache is too large and the number of cpus is less the length of the working sets are usually larger and if there are any dirty buffers in the working set they are added to the dirty queue, and during this time if the cached blocks from the current table in the last extents does not get modified we are good, if they are modified, then CR processing required, this again depends on how many times the buffer is modified as the SCN is maintained at the block level, and it has to go through the undo table entries to create a CR copy of the block and in order to make a CR copy of the block this again requires a free buffer, as this can again add two more waits free buffer waits or write complete waits.
    - buffers replaced from the cache belong to this table, in which case these blocks are flushed down to disk, and have to be reread again and this again raises two more possibilities and these two possibilities list like above two more possibilities.
      
      the table size is large By default before a full table scan (provided a session chooses direct path read), oracle performs a full object checkpoint which makes sure the on disk version of blocks are consistent till that point and in OLTP environments, as the transactions are usually of very short duration, the number of buffers with uncommitted changes will be less at the precise point at the time of select (diring object level checkpoint) will be less and as dbwr process uses db file parallel write wait event which inturn makes use of async IO to flush the contents, it is very aggressive in pushing the blocks or writing the blocks down to the disk > 1k blocks in one IO operation. example attachment output of strace. and as most of the work is done before there is less chance for CR image fabrication at a later point in time, and here there exists two alternate paths of execution.
      
      no further blocks written so no issues, continue with the full table scan.
      
      what if the dirty blocks with committed or uncommited changes gets written to the disk, by dbwr due to either checkpoint or buffer cache flush or space pressure in the buffer cache to bring new blocks in? in this case the CR processing is mandatory.
if we have a hash partitioned table / index or a reverse key index, the data blocks may be placed partition/block and with ASSM and multiple sessions, data can be anywhere. and in these cases reverse order scan may not yield any performance benefits as is the case with the general scan.

Thanks,
Vishnu

On Tue, Nov 26, 2019 at 10:38 PM Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk<mailto:jonathan_at_jlcomp.demon.co.uk>> wrote:

No longer any need to send a full test case. Your observation has been known for a very long time and doesn't need to be demonstrated again.

First thought, of course, is that an application should not be dropping and creating segments at all (let alone frequently). If this type of processing appears to be needed then the requirement should be re-examined with global temporary tables in mind.

Second, if one is dropping and creating segments frequently they should be assigned to a tablespace that is (a) uniform extent sized at 1MB (or more if the O/S allows for a larger read) and (b) not used by permanent objects. Feature (a) ensures that Oracle has the best chance of reading the segment as efficiently as possible and once the tablescan is doing the largest read the various components of the O/S will handle it's over-optimistic to assume that changing the order in which extents are accessed will make any significant difference in a multi-user system. If (some of) the segments you are dropping and recreating are guaranteed to be very small and you really don't want to "waste" 1MB per segment then assign small objects to a tablespace with a small uniform extent size.

A significant problem, though, with having a good idea is that it's very difficult to think of all the ramifications and handle all the peripheral cases. For example: "shrink space compact" deletes rows from the table backwards from the last block of the last extent, inserting them forwards from the first block of the first extent - should the shrink code be modified to use the L1 bitmaps to decide where to delete from and where to insert to ? What impact would that have on the potential for dropping extents.

Consider an idea that sounds much more useful: the idea that you should be able to do a tablescan "backwards" has been around for (literally) years - after all, the most recent, hence most interesting, data will be in the last extent(s) of the table - it's a feature that hasn't been made (publicly - but see https://jonathanlewis.wordpress.com/2016/03/14/quiz-2/ ) available despite the fact that from time to time people complain about performance problems because the end of a table has been changing a lot while their tablescan has been reading the unchanging start of the table. (And that's a good reason why reading by extent id (in reverse order) would be nicer than reading by L1 address.)

Regards
Jonathan Lewis

From: Vishnu Potukanuma <vishnupotukanuma_at_gmail.com<mailto:vishnupotukanuma_at_gmail.com>> Sent: 26 November 2019 16:30
To: Jonathan Lewis
Cc: Oracle L
Subject: Re: Full table scan -- uniform extent allocation - extent allocation map issue?

Hi Jonathan,

The trace file indicate for autoallocate - yes.. i must have pasted the old one when i was testing auto allocate (in auto allocate this issue diminishes as the extent sizes become large)

Oracle allocates extents to a segment at any allocation in the data file (uniform size) when the demand for space arises which is logical thing but as oracle reads extents in the order they are allocated during the full table scans rather than where they are stored, the IO becomes random rather than sequential given time when extent sizes are low and the workload mainly contains frequent segment allocation and deallocation and that if other sessions doing index range scans saturate IOPS at the storage layer especially HDDs as these are random in nature, reduces the overall throughput of the full table scan...

I mean when the full table scans are slow, usually we end up looking at various things, but could be a factor in very rare cases (OLTP) and could be common in warehouses where segments are frequently dropped or created, and this random extent allocation can grow right under our nose and we never detect it. There are various other scenarios, edge cases i am looking at, but i am really interested as to why a session blindly reads the extents in order they are allocated rather than in the order they are stored based on L1DBA.

Please give me sometime (have to get dinner before its too late), i will send the complete test case with which this thing can be replicated.

Thanks,
Vishnu

On Tue, Nov 26, 2019 at 8:58 PM Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk<mailto:jonathan_at_jlcomp.demon.co.uk><mailto:jonathan_at_jlcomp.demon.co.uk<mailto:jonathan_at_jlcomp.demon.co.uk>>> wrote: insert into table10 select * From table1; drop table1;

create a tablespace with 350mb in size and uniform extent allocation,

The extent information you've supplied shows us that you've created the final table in a tablespace using autoallocate - the early extents are 64KB, the next 1MB, the last few you've shown are 8MB.

If you want to supply a test case that has a chance of reproducing then you need to tell us WHAT size uniform size, preferably giving us a script that allows us change a couple of names (tablespace / file) and create the tablespace.

create 10 tables with 30 mb in size.

Needs one sample Create table script - did you create it with "initial 30M" (obviously not); did you create it with "create as select ...", did you do anything odd with pctfree; how did you populate it. Do you need the tables to allocate 30MB, or should the highwater mark be pushed to 30MB as well.

insert into table10 select * From table1;

That's probably going to work - because we'd probably assume you wanted all 10 tables to look alike.

drop table1;

That should work, but do you have your recyclebin enabled - that may affect where the next insert puts the data (especially if some doesn't follow your instruction that the tablespace should be 350MB).

similarly perform the same randomly

If my results don't match your results, might that be because your results are highlighting a flaw in your system, or because I didn't choose the same random order as you did ? Don't supply a "random" test if you want to demonstrate a reproducible anomaly. Tell us the exact order you used to insert and drop.

interestingly the fix is to read the extents one after the other based on the L1 DBA rather than the order in which they are allocated to the segment

Are you saying you think this is what Oracle does, or what you think Oracle should do ?

Regards
Jonathan Lewis

From: oracle-l-bounce_at_freelists.org<mailto:oracle-l-bounce_at_freelists.org><mailto:oracle-l-bounce_at_freelists.org<mailto:oracle-l-bounce_at_freelists.org>> <oracle-l-bounce_at_freelists.org<mailto:oracle-l-bounce_at_freelists.org><mailto:oracle-l-bounce_at_freelists.org<mailto:oracle-l-bounce_at_freelists.org>>> on behalf of Vishnu Potukanuma <vishnupotukanuma_at_gmail.com<mailto:vishnupotukanuma_at_gmail.com><mailto:vishnupotukanuma_at_gmail.com<mailto:vishnupotukanuma_at_gmail.com>>> Sent: 26 November 2019 14:12
To: Oracle L
Subject: Full table scan -- uniform extent allocation - extent allocation map issue?

Hi,

I have observed an interesting behaviour today, while working on other aspects and my hunch appears to be true ,,, I will start by saying that there are too many variables and scenarios where things can be optimal but i am talking about the worst case scenario where the segments are frequently created and dropped in a tablespace.

This works by exploiting certain aspects of the database such as how oracle maintains extent map in the segment header and how the extent allocation is done or how the full table scan reads the segment. Inherently in this case there is nothing we can do as there are too many variables. The impact may be small in terms of the overall response times but this typically reduces the lifespan of the disks as the disk spindles ends up moving from higher to lower tracks and vice versa over and over again... instead of just in a simple arc from higher tracks to lower tracks. in worst case this could add delay when the segment is very large and if the extent sizes are small (this probability of this occuring is very less but there is still a chance).

when it comes to oracle while performing full table scans, the process starts by reading the extent map and starts reading extents one by one until the last extent or high water mark is reached... but what if the extents allocated to the segment are stored in random areas just like clustering_factor? interestingly the fix is to read the extents one after the other based on the L1 DBA rather than the order in which they are allocated to the segment, in most cases, oracle does a pretty good job allocating extents close to each other to the segment, but things can get messy.

this is the simplest way to explain and this goes gradually from here and we can make however we like it. create a tablespace with 350mb in size and uniform extent allocation, create 10 tables with 30 mb in size.
insert into table10 select * From table1; drop table1;
similarly perform the same randomly so that the final end result is only one table. here we are just simulating a case where the subsequent extents allocated to the segment are on different regions randomly in a datafile, thus their locations are on the disk as well (which could be or could not be based on several factors such as how fragmented the disks are etc, i wont go in that space as it adds too much complexity).. here we perform a full table scan, oracle reads the segment header to read the extent map and reads the extents one by one...

Notice the block numbers in the traces.

WAIT #139711646963136: nam='db file sequential read' ela= 612 file#=5 block#=71810 blocks=1 obj#=75252 tim=248445129880
WAIT #139711646963136: nam='db file scattered read' ela= 763 file#=5 block#=71811 blocks=5 obj#=75252 tim=248445130757
WAIT #139711646963136: nam='db file scattered read' ela= 771 file#=5 block#=71816 blocks=8 obj#=75252 tim=248445131732

....
....

WAIT #139711646963136: nam='db file scattered read' ela= 1941 file#=5 block#=44736 blocks=64 obj#=75252 tim=248445270874
WAIT #139711646963136: nam='db file scattered read' ela= 1788 file#=5 block#=44802 blocks=62 obj#=75252 tim=248445273696
WAIT #139711646963136: nam='db file scattered read' ela= 1902 file#=5 block#=44864 blocks=64 obj#=75252 tim=248445276573
WAIT #139711646963136: nam='db file scattered read' ela= 1756 file#=5 block#=44930 blocks=62 obj#=75252 tim=248445279335

....

WAIT #139711646963136: nam='db file scattered read' ela= 3078 file#=5 block#=132 blocks=128 obj#=75252 tim=248445887737
WAIT #139711646963136: nam='db file scattered read' ela= 3045 file#=5 block#=260 blocks=124 obj#=75252 tim=248445893065
WAIT #139711646963136: nam='db file scattered read' ela= 3399 file#=5 block#=384 blocks=128 obj#=75252 tim=248445898641

....
...

WAIT #139711646963136: nam='db file scattered read' ela= 12269 file#=5 block#=97540 blocks=124 obj#=75252 tim=248446182677
WAIT #139711646963136: nam='db file scattered read' ela= 17440 file#=5 block#=97664 blocks=128 obj#=75252 tim=248446202405
WAIT #139711646963136: nam='db file scattered read' ela= 3282 file#=5 block#=97792 blocks=128 obj#=75252 tim=248446208027

...
...

WAIT #139711646963136: nam='db file scattered read' ela= 3279 file#=5 block#=61058 blocks=128 obj#=75252 tim=248446281696
WAIT #139711646963136: nam='db file scattered read' ela= 3115 file#=5 block#=61186 blocks=126 obj#=75252 tim=248446287141
WAIT #139711646963136: nam='db file scattered read' ela= 2990 file#=5 block#=61312 blocks=128 obj#=75252 tim=248446292359
WAIT #139711646963136: nam='db file scattered read' ela= 3109 file#=5 block#=61440 blocks=128 obj#=75252 tim=248446297711

....
...

WAIT #139711646963136: nam='db file scattered read' ela= 3125 file#=5 block#=7812 blocks=128 obj#=75252 tim=248446303109
WAIT #139711646963136: nam='db file scattered read' ela= 3246 file#=5 block#=7940 blocks=124 obj#=75252 tim=248446309728
WAIT #139711646963136: nam='db file scattered read' ela= 3180 file#=5 block#=8064 blocks=128 obj#=75252 tim=248446315169

the L1 DBA extent map:
Auxillary Map

   Extent 0     :  L1 dba:  0x01411880 Data dba:  0x01411883
   Extent 1     :  L1 dba:  0x01411880 Data dba:  0x01411888

....

   Extent 35    :  L1 dba:  0x0140ae80 Data dba:  0x0140ae82
   Extent 36    :  L1 dba:  0x0140af00 Data dba:  0x0140af02
   Extent 37    :  L1 dba:  0x0140af80 Data dba:  0x0140af82

...

   Extent 38    :  L1 dba:  0x0140b000 Data dba:  0x0140b002
   Extent 39    :  L1 dba:  0x0140b080 Data dba:  0x0140b082
   Extent 40    :  L1 dba:  0x0140b100 Data dba:  0x0140b102
   Extent 41    :  L1 dba:  0x0140b180 Data dba:  0x0140b182
   Extent 42    :  L1 dba:  0x0140b200 Data dba:  0x0140b202
   Extent 43    :  L1 dba:  0x0140b280 Data dba:  0x0140b282

   ...
   ...
   Extent 92 : L1 dba: 0x01417c80 Data dba: 0x01417c84    Extent 93 : L1 dba: 0x01418080 Data dba: 0x01418084    ...

   Extent 94    :  L1 dba:  0x0140ee80 Data dba:  0x0140ee82
   Extent 95    :  L1 dba:  0x01401e80 Data dba:  0x01401e84
   Extent 96    :  L1 dba:  0x01402280 Data dba:  0x01402284
   Extent 97    :  L1 dba:  0x01402680 Data dba:  0x01402684
   Extent 98    :  L1 dba:  0x01402a80 Data dba:  0x01402a84

When it comes to index range scan things are as expected, as the rowids are sorted it reads all the blocks in one arc (i am not talking about prefetching, or batched (db file parallel read which exploits IOPS at a storage layer but a very simplistic case with all prefetching etc turned off) but for full table scans things appear to be different as it simply following the extent map and not in an order of their L1 DBA,

maybe this can be explained by something, but I am still not be able to figure out any other valid reason as to why we should not perform a full table scan based on the storage order extent map with the exception of last extent. Please let me know if I have missed anything?

Thanks,
Vishnu

--
http://www.freelists.org/webpage/oracle-l


--
http://www.freelists.org/webpage/oracle-l


--
http://www.freelists.org/webpage/oracle-l

Received on Wed Nov 27 2019 - 13:05:38 CET

This message: [ Message body ]
Next message: Vishnu Potukanuma: "Re: L1 BMB - incorrect freeness status - are they relevant for tables alone?"
Previous message: Noveljic Nenad: "RE: Oracle Authentication pw/ssl parallel"
Maybe in reply to: Vishnu Potukanuma: "Full table scan -- uniform extent allocation - extent allocation map issue?"
Next in thread: Vishnu Potukanuma: "Re: Full table scan -- uniform extent allocation - extent allocation map issue?"
Reply: Vishnu Potukanuma: "Re: Full table scan -- uniform extent allocation - extent allocation map issue?"
Reply: Vishnu Potukanuma: "Re: Full table scan -- uniform extent allocation - extent allocation map issue?"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message