Re: sane number of the table partitions in DWH

From: Milen Kulev <makulev_at_gmx.net>
Date: Wed, 01 Apr 2009 14:05:40 +0200
Message-ID: <20090401120540.94110_at_gmx.net>



Hallo Greg,
many thanks for your reply .
Due to a lack of time I really haven't provided much details. Now the details : This is 8 nodes RAC Cluster (Oracle version 10.2.0.4, OS = Suse9 SP4) All nodes have between 64-128GB RAM und 16-32 CPUs.

A typical SQL Statements looks like this (very simplified version, dimention tables and many GROUP BY clauses are left off, since they add up very little to the total response time. All the tables (1 fact, 2 dimentions) are joined via HASH joins together):

SELECT

     t1.project_id project_1,
     t1.productgroup_id  productgroup_id_d, 
      t1.period_id period_id_d,
      t1.basechannel_id basechannel_id_d, 
      t1.outlet_id outlet_id_d, 
     MAX(t1.pd_distribution_factor) distributionfactor,
     SUM(t1.pd_sales_units*t1.pd_projection_factor*t1.pd_price_units_eur)  salesvaluefactor
 FROM
     DWH_RB.rb_fact_pd_out_itm t1
 WHERE
      t1.period_id IN
     (20070199999030,20070299999030,20070399999030,20070499999030,20070599999030,20070699999030,20070799999030,20070899999030,20070899999060,20070999999030,20071099999030,20071099999060,20071199999030,20071299999030)
   AND t1.country_id IN
     (15,12,14,13,85,26,16,18,19,86,17,23,25,48,30,87,29,82,49,901,908,51,52,53,45)
   AND t1.domain_productgroup_id IN (32647,32672)    AND t1.project_id IN
     (42175,42377,42495,42496,42497,42498,42500,42502,42514,42517,42519,42521,42523,42525,42531,42532,42550,42552,42556,42559,42644,42646,42647,42649,42651,42653,42667,42668,42753,42754,42836,42838,42853,42859,43030,43031,43943,43944,43964,43966,56248,56249,59669,59670,70615,70618,70620,70623,70625,70627,71220,71221)    AND t1.productgroup_id IN
     (15647,15672,12582,12672,14982,14672,20809,20803,24191,24194,26101,26172,16583,16617,18081,18672,19420,19672,17672,23291,23293,25883,25617,48647,48672,30491,30472,87101,87172,29391,29672,82172,82101,49672,49647,901101,901172,17283,908647,908672,51101,51172,52101,52172,53101,53172,45581,45585)    AND t1.flg_status_id IN ('1','2','4') AND t1.project_type_id IN ('1','3')  GROUP BY
     t1.project_id,
      t1.productgroup_id, 
      t1.period_id,
      t1.basechannel_id, 
      t1.outlet_id
       ; 

There are single column bitmap indexes on the following columns: domain_productgroup_id
country_id
period_id

The table DWH_RB.rb_fact_pd_out_itm is the fact table (~ 800- 900GB at the moment). This fact table is partitioned on (some parts of the definition are left off, since they are irrelevant to the problem):

 CREATE TABLE "DWH_RB"."RB_FACT_PD_OUT_ITM"

   (    "PROJECT_ID" NUMBER NOT NULL ENABLE,

"PROJECT_TYPE_ID" NUMBER NOT NULL ENABLE,
"PERIOD_ID" NUMBER NOT NULL ENABLE,
"PERIODICITY_ID" NUMBER NOT NULL ENABLE,
"OUTLET_ID" NUMBER NOT NULL ENABLE,
"BASECHANNEL_ID" NUMBER NOT NULL ENABLE,
"COUNTRY_CHANNEL_ID" NUMBER NOT NULL ENABLE,
"CHANNEL_ID" NUMBER NOT NULL ENABLE,
"COUNTRY_ID" NUMBER NOT NULL ENABLE,
"ITEM_ID" NUMBER NOT NULL ENABLE,
"ITEM_VERSION_ID" NUMBER NOT NULL ENABLE,
"PRODUCTGROUP_ID" NUMBER NOT NULL ENABLE,
"DOMAIN_PRODUCTGROUP_ID" NUMBER NOT NULL ENABLE,
...... ) PARTITION BY RANGE ("PERIOD_ID") SUBPARTITION BY HASH ("PROJECT_ID") SUBPARTITION TEMPLATE ( SUBPARTITION "S1", SUBPARTITION "S2", SUBPARTITION "S3", .... .... So partitioning scheme is range->hash Period_id is a time dimention.

PROJECT_ID is kind of artificial key , a grouping of DOMAIN_PRODUCTGROUP_ID Product_type_id A datavolume is expected to grow by 30-40% each year. The biggest problems (in my opinion) are: 1) Customer expectations:
database design is for a typical DWH application (star schema)  The customers of my customer are expecting response times in 10-20 seonds range (at the amoment  the response time for such a typical SQL statements (like this above) is ~ 200 -1000 seconds, depending on  the amount of data processed ).
2) Data volatility :
  This is an operational DWH database. Data is almost steadily loaded (mainly INSERTs + UPDATEs).   Bitmap indexes are getting bloated/corrupt and must be often rebuilded.   Furthermore, data in the fact tables could not be preaggregated and/or compressed (because of steady data changes) 3) Many customers are accesing this resource base and different customers need different level
	of aggregations(the SQL statement above is a kind of template, sometimes even more predicates are being generated,
	sometimes the IN lists are longer/shorter, but the core structure of the statement is the same).

Here is a excerpt of the 10046 event trace file (processed with orsasrp proflier):

Session Flat Profile


  • Time Per Call --------- Event Name % Time Seconds Calls Avg Min Max ---------------------------------------- -------- ------------ --------- ----------- ----------- ----------- db file sequential read 47.1% 412.9400s 240,657 0.0017s 0.0000s 0.3119s gc cr grant 2-way 27.1% 237.9957s 225,795 0.0010s 0.0000s 0.1638s FETCH calls [CPU] 11.2% 98.3720s 42,273 0.0023s 0.0000s 96.9783s SQL*Net message from client 7.6% 66.4010s 42,190 0.0015s 0.0005s 11.0692s unaccounted-for time 4.1% 35.6414s SQL*Net message from client [idle] 1.9% 16.7582s 3 5.5861s 2.7191s 9.9124s gc current block 3-way 0.6% 5.5963s 2,438 0.0022s 0.0002s 0.0223s gc cr grant congested 0.2% 1.5404s 382 0.0040s 0.0013s 0.0237s

The average access times "db file sequential read" and "gc cr grant 2-way" are ~ 1msm so the performance (in term of latencies) is OK.

The idea is to have many partitions, so that more granular partition pruning can occur. Each partition should be read via FTS using multiblock reads (MBRC=16)

The biggest concern is the parse time. At the moment the fact table has ~ 390 partitions. My customer claims that sometimes the SQL statements are hanging in the parse phase. I couldn't verify this (never happended when I was by customer on site), but the client suspects it *could* have something to do with the amount of partitions.

Now my comments to your comments:

"
 Multi-column partitioning can work, however it behaves different when

> it comes to partition elimination based on the type of partitioning
> and how many of the partitioning keys are present in the predicate in
> the query.  T

"

There is no problem to supply all the partitioning keys as predicates. The application accesing the database is in-house written and can be changed accordingly. What could not be chaged is datavolume and business customer expectations (max. 20 sec response time).  

"

> Based on these numbers I don't really see the benefit for the large
> number of partitions.  A 4TB table with 200K partitions means 20MB
> partitions (assuming even distribution).  That is quite small for a
> partition size.  I would say too small to warrant significant benefit.
>  And since you mention the fact tables are up to 1TB, well, that means
> single MB or smaller for each partition segment.  That is probably too
> small to be overly useful.

"

That is the point ! Each partitin should be small, so that after partition pruning each partition should read 20-100MB . Not that to satisfy a query many (hopefully not sooo many ) partitions (each ~ 20-100MB) must be read, preferrably using PQ. At the moment no PQ occurs.

"

> I would suggest range(date) with a granule size of 1 day because that
> tends to be the unit that business reports frequently use.  Unless you
> are either 1) doing fact-fact joins and want to leverage
> partition-wise joins or 2) single key access via location_id and not
> using indexes, then I would see little benefit from the hash
> subpartitioning at that scale.

"

The minimum time granule for reports is 1 week. Typicall time granule is 3 months. There are users who need 1 month time granules. Many customers are accessing this resouce base and each customer has its on requirements in terms of time aggregations, region aggregations, product type aggregations.

At the moment we are currently discussing the following design: 1) each country gets its own fact table e.g. FACT_USA, FACT_UK etc. There will be ~ 60 such fact tables 2) each fact table will be range->list partioned. Range partitionig (level1) -> in time dimention (monthly), for the last 4 years + some special time partitions -> 280 partitions
List partioning -> product type dimentions. There are about 300 different product type groups

This, a single country-specific facxt table will have 280x300 = 84000 partiotions 3) Using country-specific services:

	If a user wants aggregated UK data, she/he will connect to service UK, which is bound to instance1, for example.
	The hope is to use cache locality affects e.g. table FACT_UK will be cached only in the database cache of instance1
	The idea is to minimize interconnect traffic (see the profile above). 
 Data loading is always country specific, so there should be little "cache polution" on each instance (in the cache of instance1  should contain solely data from FACT_UK table).

This should minimize the "gc cr %" interconnect traffic.

Best Regards. Milen

  • Original-Nachricht -------- > Datum: Tue, 31 Mar 2009 09:47:42 -0700 > Von: Greg Rahn <greg_at_structureddata.org> > An: makulev_at_gmx.net > CC: oracle-l_at_freelists.o > Betreff: Re: sane number of the table partitions in DWH
> My first comments would be if you are going to have an extreme design,
> then your testing should also be extreme, in both scale and depth.
> 
> > The number of calculated/estimated partitions is 200 000 - 300 000 (Yep,
> this is NOT a typing error !).
> 
> Is this total for the db or per table?
> 
> > 1) What is the "manageable" number of partitions (from performance point
> of view == parse times) that I you have seen by clients.
> 
> The number of partitions does affect parsing, but it also has an
> impact on the buffer cache.  If the segment header is not in the
> buffer cache the execution will take longer than if it was (slightly
> obvious).  The not obvious part of this is that if you are using
> parallel query, the query coordinator reads the extent maps serially
> so there can be a long time between start of query and when the PQ
> slaves start their work, unless you are on 11.1.0.7 or have bug
> 6525904 fix as a one-off.  To put this in perspective, for a FTS of a
> table of 92,160 segments I have seen this "stall" take 30 minutes
> prior to the bug fix.
> 
> > 2) Any parameters (also underscore ...) to reduce the hard-parse times
> (because of the enormous amount of partitions) ?
> 
> Set your buffer cache large enough to keep all the segment headers in it.
> 
> > 3) If I have so many partitions , what is the influence on the
> fragmentation of the shared pool? Any experiences ?
> 
> I don't think this is really an issue.  Warehouses usually hard parse
> everything, use it once, and it ages out fast, so there is likely
> little need for more than a few GB of shared pool memory.
> 
> > 4) Is multi-column partitioning  a save bet (bugs) ?
> 
> Multi-column partitioning can work, however it behaves different when
> it comes to partition elimination based on the type of partitioning
> and how many of the partitioning keys are present in the predicate in
> the query.  Test this well noting partition elimination before
> deciding on a design.
> 
> > Fact tables have sizes ~ 700 - 1000 GB.
> > Application type = DWH ( ~ 4.5 TB)
> > At the moment partitioning schmeme is range(date)/hash(location_id)
> 
> Based on these numbers I don't really see the benefit for the large
> number of partitions.  A 4TB table with 200K partitions means 20MB
> partitions (assuming even distribution).  That is quite small for a
> partition size.  I would say too small to warrant significant benefit.
>  And since you mention the fact tables are up to 1TB, well, that means
> single MB or smaller for each partition segment.  That is probably too
> small to be overly useful.
> 
> I would suggest range(date) with a granule size of 1 day because that
> tends to be the unit that business reports frequently use.  Unless you
> are either 1) doing fact-fact joins and want to leverage
> partition-wise joins or 2) single key access via location_id and not
> using indexes, then I would see little benefit from the hash
> subpartitioning at that scale.  Of course, this is based off very
> little provided information, so take it with a grain of salt.
> 
> -- 
> Regards,
> Greg Rahn
> http://structureddata.org

-- 
Neu: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate + Telefonanschluss für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a
--
http://www.freelists.org/webpage/oracle-l
Received on Wed Apr 01 2009 - 07:05:40 CDT

Original text of this message