RE: Is Parallelism happening at INSERT level?

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Wed, 6 May 2020 10:47:11 -0400
Message-ID: <450501d623b5$39f923d0$adeb6b70$_at_rsiz.com>



AND (not in contradiction of anything that went before, at least not intentionally), a bit of this gets into how much human engineering is compatible with your apparent goal:  

“We are trying to tune a query to use parallelism at multiple levels. Here is the query along with the plan gotten by using dbms_xplan:  

PLAN_TABLE_OUTPUT


SQL_ID 1qrcwrva8jgba, child number 0


insert /*+ ENABLE_PARALLEL_DML APPEND PARALLEL(4) */ into abcd select

/*+ parallel(4) */ * from user1.sce where esk in

(select esk from efgh)”  

Now with just a bit of work you could compute the breakpoints for (unique) quartiles of efgh (either raw[probably improves], or with respect to the number of records then selected from abcd [better, best?].

If you deposit those into global temporary tables efgh1, efgh2, efgh3, efgh4 and run  

insert….select

  • from user1.sce where esk in

(select esk from efgh1)

union all select

  • from user1.sce where esk in

(select esk from efgh2)

…  

with the parameters/hints to build the components of union all in parallel, you probably win. You have four smaller hashes and no query coordinator for each of the selects.  

You have to do the union because with append breaking it up into four separate insert jobs would just serialize on each other for the append… hmm, serializing each quartile as parallel 4 might still be faster and would keep your footprint smaller…  

Unless you happen to have (or can create) exactly matching partitioning leading on esk, and you can manage that each quartile appends into a distinct partition (or set of partitions that are disjoint, but multiple partitions per quartile is even more jobs to generate the separate sequels for appending into exactly one partition for each sql you run, but you could fire them off in threads balanced for your desired parallelism and each thread completing overall in as close total elapsed time as possible.  

Now if you do the matching partitioning on efgh, then you just need to fire them off for the list of partitions, which is less software engineering but more DBA engineering with possibly good or bad ramifications for all your queries. IF you have a lot of queries for a single esk or esks that would often be in a single partition, it is possibly quite sweet. Or it could be a disaster.  

Money equals speed. It increasingly requires more money than it is economic to spend to beat Oracle’s generated plans by dissecting the queries. Especially if you follow the advice of JL, et. al, in the first place. BUT IF the need justifies the expense you can probably figure out a way to run this faster than you currently do and possibly with lower resource footprint.

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Jonathan Lewis Sent: Wednesday, May 06, 2020 3:05 AM
To: ORACLE-L
Subject: Re: Is Parallelism happening at INSERT level?  

Ram,  

As Lothar says, you are doing a parallel insert as well as a parallel select.

You need only 4 slaves to run this one parallel 4 because of the optimizer's choice for the plan, which is taking advantage of the 12c PQ_REPLICATE() mechanism.  

Each slave reads the WHOLE efgh table into an in-memory hash table and then scans sections of the sce table to do the probe into its copy of the dfgh table, then inserts surviving rows in batches.  

If you were on 11g the plan would look more like:  



| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |


| 0 | INSERT STATEMENT | | | | 487 (100)| | | | |
| 1 | PX COORDINATOR | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10001 | 564K| 75M| 487 (11)| 00:00:01 | Q1,01 | P->S | QC (RAND) |
| 3 | LOAD AS SELECT (HYBRID TSM/HWMB)| T2 | | | | | Q1,01 | PCWP | |
| 4 | OPTIMIZER STATISTICS GATHERING | | 564K| 75M| 487 (11)| 00:00:01 | Q1,01 | PCWP | |
|*  5 |      HASH JOIN RIGHT SEMI          |          |   564K|    75M|   487  (11)| 00:00:01 |  Q1,01 | PCWP |            |

| 6 | PX RECEIVE | | 56764 | 277K| 42 (5)| 00:00:01 | Q1,01 | PCWP | |
| 7 | PX SEND BROADCAST | :TQ10000 | 56764 | 277K| 42 (5)| 00:00:01 | Q1,00 | P->P | BROADCAST |
| 8 | PX BLOCK ITERATOR | | 56764 | 277K| 42 (5)| 00:00:01 | Q1,00 | PCWC | |
|* 9 | TABLE ACCESS FULL | T1 | 56764 | 277K| 42 (5)| 00:00:01 | Q1,00 | PCWP | |
| 10 | PX BLOCK ITERATOR | | 567K| 73M| 438 (10)| 00:00:01 | Q1,01 | PCWC | |
|* 11 | TABLE ACCESS FULL | T0 | 567K| 73M| 438 (10)| 00:00:01 | Q1,01 | PCWP | | ----------------------------------------------------------------------------------------------------------------------------

And this WOULD run with 8 slaves - 2 sets of 4.  

As for timing - your figures report 536 GB for sce, to be scanned in 1:51. That's roughly 5 GB per second - with each slave doing 1.25GB per second and allowing no time for any of the CPU required to do the hash join and write the data. Is that realistic ? It seems a little unlikely.

Are any of the direct path reads and write also "temp" - since you've got 4 copies of 10GB (estimated) data, it seems likely that the hash join is going to spill and end up doing a lot of temp space writes.    

Regards

Jonathan Lewis          

On Tue, May 5, 2020 at 11:34 PM Ram Raman <veeeraman_at_gmail.com> wrote:

Hi,   

We are trying to tune a query to use parallelism at multiple levels. Here is the query along with the plan gotten by using dbms_xplan:  

PLAN_TABLE_OUTPUT


SQL_ID 1qrcwrva8jgba, child number 0


insert /*+ ENABLE_PARALLEL_DML APPEND PARALLEL(4) */ into abcd select

/*+ parallel(4) */ * from user1.sce where esk in

(select esk from efgh)  

Plan hash value: 1996703413  


| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib |


| 0 | INSERT STATEMENT | | | | 2840K(100)| | | | | | |

| 1 | PX COORDINATOR | | | | | | | | | | |

| 2 | PX SEND QC (RANDOM) | :TQ10000 | 30M| 10G| 2840K (1)| 00:01:51 | | | Q1,00 | P->S | QC (RAND) |

| 3 | LOAD AS SELECT (HYBRID TSM/HWMB)| | | | | | | | Q1,00 | PCWP | |

| 4 | OPTIMIZER STATISTICS GATHERING | | 30M| 10G| 2840K (1)| 00:01:51 | | | Q1,00 | PCWP | |
|* 5 | HASH JOIN RIGHT SEMI | | 30M| 10G| 2840K (1)| 00:01:51 | | | Q1,00 | PCWP | |
| 6 | TABLE ACCESS FULL | efgh | 193K| 1131K| 36 (3)| 00:00:01 | | | Q1,00 | PCWP | |

| 7 | PX BLOCK ITERATOR | | 1547M| 536G| 2839K (1)| 00:01:51 | 1 | 32 | Q1,00 | PCWC | |
|* 8 | TABLE ACCESS FULL | sce | 1547M| 536G| 2839K (1)| 00:01:51 | 1 | 32 | Q1,00 | PCWP | |


......

Note


  • dynamic statistics used: dynamic sampling (level=AUTO)

PLAN_TABLE_OUTPUT


  • Degree of Parallelism is 4 because of hint

Given that P->S happens in line id 2, is this statement using parallelism for the INSERT part of it? I can see a parallelism of 4 reported at the v$px_session table for this SQL. Should I see 8 rows if there is parallelism of 4 for SELECT and INSERT each?  

Dont know why it is saying the statement will complete in 1:51. It has been running for an hour. The waits are all on 'direct path read/writes'. we have some bad IO, but still finding ways to tune the statement. ver is 12c.   

Thanks

Ram  

-- 

 

 




--
http://www.freelists.org/webpage/oracle-l
Received on Wed May 06 2020 - 16:47:11 CEST

Original text of this message