Re: Data Warehouse Guru Question (Data Model question) between FACT and DIMENSION tbl

From: Tim Gorman <tim.evdbt_at_gmail.com>
Date: Wed, 4 Aug 2021 07:34:40 -0700
Message-ID: <a8d1021e-a6c5-7b85-b17a-0e2365554c81_at_gmail.com>



Chris,

Apologies for slow response!

Comments in-line in /red italicized font/ below...

Thanks!

-Tim

On 7/30/2021 6:16 AM, Chris Taylor wrote:
> Tim,
>
> Thank you for the detailed response.
>
> There are a lot of dimensions involved in the query, this one jumps
> out though because this dimension is very large and the secondary pass
> on the "*TABLE ACCESS BY LOCAL INDEX ROWID" *with a filter condition
> is a bit annoying.
>
> So two MORE questions if you don't mind if I pick your brain a bit.
>
> 1.) *So if we were filtering on a lot of conditions on all the
> dimensions, it would be normal to see the filter condition applied
> after the TABLE ACCESS BY LOCAL INDEX ROWID on all the dimensions
> where we filter for some condition then?*  (That is, is it
> normal/expected to get a filter condition against the table access for
> the rows we are retrieving?  That's annoying because effectively a
> second pass/sanitation of the data but it sounds like it's expected.) 
> Some of these dimensions are really, really large.

    /[TG]: Yes, a joinback is expected, to go back to the dimension     table to retrieve columns which were not part of the original star     transformation operation./

>
> 2.) *Is it NORMAL (or a no-no) to join dimensions to dimensions as
> part of a data warehouse query?*  We've got the following two situations:
> a.)  A join between 2 dimension tables on an ID field and also joined
> to the FACT table on the same ID field.  I think the join between the
> DIMs is redundant here and causes an extra condition.
> b.) We have a situation where partition pruning isn't happening
> because a separate dimension table has the partition key but isn't
> being brought into the joins.  So to get to the partition key of the
> FACT table, we have to join 2 other DIMENSIONS together (Dim to Dim
> Join) and then join that result back to the FACT table to have the
> appropriate partition_key for the FACT table.  I've tested doing that
> and the performance didn't really improve much.  What's crazy is, if I
> do an IN statement instead and do a sub query against the other
> DIMENSION to pull out the partitions I want (SITE_IDs) , the query
> returns in 4 seconds (vs over a minute).

    /[TG]: Yes, I'm guessing that a DIM->DIM join would     exclude/disqualify both DIM tables from the initial star     transformation operation.  The IN statement on FACT is likely     happening at a stage subsequent to the massive filtering of the star     transformation, around the same time as the dimensional joinbacks.      So I think it would be expected for that option to perform better./

>
> For example of item 2.b):  (beware SQL text ahead!) - two examples
>
> *Example 1 of a DIM to DIM join to get the SITE partition key on the
> FACT table:*
> /*performance bad/
> WITH
>     SAWITH0
>     AS
>         (SELECT SUM (T1586.UOM_QNTY_CNT) AS c1,
>                 COUNT (DISTINCT T1526.CE_INTRNL_DRUG_CD) AS c2
>            FROM DM_CHARGE_DIM          T1526  /* Dim_CHARGE */
>                                             ,
>                 DM_DATE_DAY_DIM        T1568             /*
> Dim_DISP_SRV_DT */
>                                             ,
>                 DM_COVERED_ENTITY_DIM  T1032          /*
> Dim_COVERED_ENTITY */
>                                             ,
>                 DM_DISPENSING_FACT     T1586 */* Fact_DISPENSING
> partitioned by SITE_KEY_NB */,*
>                 , DM_SITE_DIM T1 */* Added Table with SITE_KEY_NB
> partition */*
>           WHERE ( T1032.CE_KEY_NB = T1586.CE_KEY_NB
>                  AND T1526.CHG_PRICE_KEY_NB = T1586.CHG_PRICE_KEY_NB
>                  AND T1568.DT_KEY_NB = T1586.SRV_DT_KEY_NB
>                  AND CASE WHEN T1526.HID_MAP_FLG = 1 THEN 'Yes' ELSE
> 'No' END =
>                      'No'
>                  AND CASE WHEN T1526.NON_RX_FLG = 1 THEN 'Yes' ELSE
> 'No' END =
>                      'No'
>                  AND CONCAT (
>                          CONCAT (TRIM (BOTH ' ' FROM T1568.MO_DESC), ' '),
>                          CAST (T1568.YR_NB AS CHARACTER (4))) =
>                      'December 2020'
>                  AND (T1032.CE_GRP_ID IN (29.0,
>                                           30.0,
>                                           31.0,
>                                           32.0,
>                                           741.0,
>                                           4223.0,
>                                           7091.0,
>                                           10552.0))
>                  AND (T1526.CE_GRP_ID IN (29.0,
>                                           30.0,
>                                           31.0,
>                                           32.0,
>                                           741.0,
>                                           4223.0,
>                                           7091.0,
>                                           10552.0)))
> *AND T1586.SITE_KEY_NB = T1.SITE_KEY_NB /* Added Join to get
> SITE_KEY_NB */*
> *AND T1032.CE_ID = T1.CE_ID  /* Added Join on CE_IDs to find
> appropriate SITE_KEY_NBs from DM_SITE_DIM , DIM-DIM join */*
>                                           )
> SELECT D1.c1 AS c1, D1.c2 AS c2, D1.c3 AS c3
>   FROM (SELECT DISTINCT 0 AS c1, D1.c2 AS c2, D1.c1 AS c3
>           FROM SAWITH0 D1) D1
>  WHERE ROWNUM <= 125001
>
> *Example 2 of adding an IN statement for the FACT table instead:*
> /*performance great!/
> WITH
>     SAWITH0
>     AS
>         (SELECT SUM (T1586.UOM_QNTY_CNT) AS c1,
>                 COUNT (DISTINCT T1526.CE_INTRNL_DRUG_CD) AS c2
>            FROM DM_CHARGE_DIM          T1526  /* Dim_CHARGE */
>                                             ,
>                 DM_DATE_DAY_DIM        T1568             /*
> Dim_DISP_SRV_DT */
>                                             ,
>                 DM_COVERED_ENTITY_DIM  T1032          /*
> Dim_COVERED_ENTITY */
>                                             ,
>                 DM_DISPENSING_FACT     T1586 * /* Fact_DISPENSING
> partitioned by SITE_KEY_NB */,*
>           WHERE ( T1032.CE_KEY_NB = T1586.CE_KEY_NB
>                  AND T1526.CHG_PRICE_KEY_NB = T1586.CHG_PRICE_KEY_NB
>                  AND T1568.DT_KEY_NB = T1586.SRV_DT_KEY_NB
>                  AND CASE WHEN T1526.HID_MAP_FLG = 1 THEN 'Yes' ELSE
> 'No' END =
>                      'No'
>                  AND CASE WHEN T1526.NON_RX_FLG = 1 THEN 'Yes' ELSE
> 'No' END =
>                      'No'
>                  AND CONCAT (
>                          CONCAT (TRIM (BOTH ' ' FROM T1568.MO_DESC), ' '),
>                          CAST (T1568.YR_NB AS CHARACTER (4))) =
>                      'December 2020'
>                  AND (T1032.CE_GRP_ID IN (29.0,
>                                           30.0,
>                                           31.0,
>                                           32.0,
>                                           741.0,
>                                           4223.0,
>                                           7091.0,
>                                           10552.0))
>                  AND (T1526.CE_GRP_ID IN (29.0,
>                                           30.0,
>                                           31.0,
>                                           32.0,
>                                           741.0,
>                                           4223.0,
>                                           7091.0,
>                                           10552.0)))
> *AND **T1586.SITE_KEY_NB IN (SELECT T1.SITE_KEY_NB /* Find the Site
> Key Numbers that belong to our CE_GRP_IDs */*
> *        from DIM_SITE_DIM T1*
> *                                  where T1.CE_ID IN*
> *                                        (*
> *                                        SELECT CE_ID*
> *                                        FROM DM_COVERED_ENTITY_DIM T2
> /* This table is our main join above as well */*
> *                                        WHERE T2.CE_GRP_ID IN (29.0,*
> *                30.0,
>                 31.0,
>                 32.0,
>                 741.0,
>                 4223.0,
>                 7091.0,
>                 10552.0*
> *)*
> *                                        )*
> *                                        )*
>                                           )
> SELECT D1.c1 AS c1, D1.c2 AS c2, D1.c3 AS c3
>   FROM (SELECT DISTINCT 0 AS c1, D1.c2 AS c2, D1.c1 AS c3
>           FROM SAWITH0 D1) D1
>  WHERE ROWNUM <= 125001
>
> I also tried creating a mapping View between DM_SITE_DIM and
> DM_COVERED_ENTITY_DIM that has SITE_KEY_NB, CE_ID, CE_GRP_ID and add
> that to the join and performance is also equally bad.
>
> Also we have the PKs and enabled FKs (novalidate) between the DIMs and
> FACT.
>
> Finally, I will look CREATE HIERARCHY stuff you mentioned along with
> the BITMAP join indexes.  Those concern me a bit because we do a lot
> of parallel DDL when running the ETL process and there seems to be a
> lot of Oracle caveats/warnings with using BITMAP join indexes.

    /[TG]: There are a lot of limitations on BJI, but if they can be     built, then they provide an efficiency similar to that of     materialized views have on normal table join operations and     filtering operations.  That is, the BJI is the intermediate result     of the star transformation materialized as a bitmap index, which     allows queries to skip over portions of the normal star     transformation process, and I believe their presence also helps the     CBO choose their use, effectively coercing a star transformation     operation./

>
> Chris
>
>
>
> On Thu, Jul 29, 2021 at 6:43 PM Tim Gorman <tim.evdbt_at_gmail.com
> <mailto:tim.evdbt_at_gmail.com>> wrote:
>
> Chris,
>
> It is just the two tables (i.e. FACT and DIM) involved in the
> query?  There are no additional dimension tables?
>
> If so, that is likely much of the problem, because the concept of
> a dimensional data model (a.k.a. star schema) as well as Oracle's
> "star transformation" mechanism is built around the idea of two or
> more dimensions being involved in any query on the fact table.  As
> many dimensions as possible, in fact.  The more, the merrier.
>
> The basic idea is this...
>
> 1. user specifies search criteria on columns on the dimensions
> * if no search predicates are specified for a dimension
> table, then that dimension is not included in steps 2-4 below.
> o instead it will be used in step 5 as a post-processing
> filter or join step
> 2. database uses search criteria to filter down to a result set
> from each dimension
> 3. database uses all the result sets from all the dimensions to
> create a consolidated set (a.k.a. star transformation)
> 4. database uses all of the dimension key values to join to the
> bitmap keys in the fact table (a.k.a. star join)
> 5. database uses additional filter predicates or join predicates
> against the fact table to produce final results
>
>
> The entire reason for having bitmap indexes on all of the foreign
> keys of the fact table is for that big operation on step 4 to
> work.  That operation in step 4 works well if the consolidated set
> created in step 3 is small, and that's only going to happen if
> filtering happens on the dimensions.
>
> The filtering in step 5 is merely an after-effect, ideally most of
> the work of joining all the tables was already performed in step
> 4, and step 5 is running against a really small final result set.
>
> The upshot is the more well-filtered dimensions that are involved
> in the star transformation, the more efficient is the star join. 
> Most (if not all) filtering predicates in star transformations
> should be found on columns in the dimensions.  Ideally there are
> no filter predicates against the fact table, but if there are,
> then we have to consider whether they should be converted into
> dimensions or left as "conformed dimensions".
>
> Are star transformations important?  Yes, if you are attempting to
> query a dimensional data model.
>
> There is more about star transformations HERE
> <https://docs.oracle.com/en/database/oracle/oracle-database/19/tgsql/query-transformations.html#GUID-76B6B58B-24B0-4DFE-AC1B-CFAC4D93C55A>. 
> Also, the STAR_TRANSFORMATION_ENABLED
> <https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/STAR_TRANSFORMATION_ENABLED.html#GUID-B2E6145D-164A-4453-9839-0F6E6442A922>
> parameter should not be FALSE, and there are additional guidelines
> to post for the Oracle optimizer to initiate star transformations
> more easily, such as hierarchies specified with the CREATE
> HIERARCHY
> <https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-HIERARCHY.html#GUID-73925877-992B-4624-AA28-8F565E9C3F0D>
> command, enabled PK constraints on the dimension tables, and
> ENABLED or RELY enforced foreign-key constraints on the fact table.
>
> Finally, one fun feature for star transformations is when Oracle
> database materializes the consolidated set mentioned in step 3
> above, which is called a bitmap-join index. Unfortunately, there
> are no specific documentation pages in the Oracle doc-set devoted
> to bitmap-join indexes, but there is a good discussion of DW
> optimization HERE
> <https://docs.oracle.com/en/database/oracle/oracle-database/19/dwhsg/data-warehouse-optimizations-techniques.html#GUID-79C29A60-3477-448D-835D-2940D060D050>as
> well Tim Hall's more detailed blog post HERE
> <https://oracle-base.com/articles/9i/bitmap-join-indexes>.  If it
> is possible to create a bitmap-join index on a fact table, then it
> is almost certain that any query using those dimensional search
> predicates will achieve a star transformation.
>
> Hope this helps,
>
> -Tim
>
> On 7/28/2021 6:41 AM, Chris Taylor wrote:
>> This is a sanity check question as I'm helping out with a data
>> model design and indexing strategy with our ETL team. None of us
>> are what I'd call real strong on data warehouse design.
>>
>> Here's the setup and the question:
>>
>> Setup:
>> We have a DIMENSION table partitioned by SITE (Client) and we
>> have a FACT table partitioned by SITE (Client).
>>
>> The PK on the DIMENSION is a combo of SITE+KEY_FIELD
>> The join between the DIMENSION and the FACT is on SITE+KEY_FIELD
>>
>> All that is fine and works great with bitmap indexes on the FACT etc.
>>
>> *HOWEVER* here is the problem:
>>
>> In one particular query, we're applying an /additional FILTER
>> /condition to the DIMENSION table (instead of the FACT) after the
>> join which causes a bit of performance penalty as the rows are
>> found between the DIMENSION and FACT but then the filter is then
>> applied to the rows from the DIMENSION table (which is very large
>> table also).
>>
>> My theory is that this filter clause (the columns used by the
>> filter) should actually also be in the FACT and apply the filter
>> there with a BITMAP index instead of applying it to the
>> DIMENSION.   RIght now the condition (columns) in the DIMENSION
>> that are checked for a value do not exist in the FACT table..
>>
>> The join & condition is something like this:
>>
>> *|  35 |     PARTITION LIST ITERATOR                    |        
>>                     |      0 |      1 |     |     1   (0)|
>> 00:00:01 |   KEY |   KEY |  Q1,01 | PCWP
>> |* 36 |                  INDEX UNIQUE SCAN                 |
>> DIMENSION_TABLE_NDX2      |  0 |      1 |       |     1   (0)|
>> 00:00:01 | KEY |   KEY |  Q1,01 | PCWP
>> |* 37 |                TABLE ACCESS BY LOCAL INDEX ROWID        
>>   | DIMENSION_TABLE
>>
>>
>>   36 - access("FACT_TABLE"."SITE_ID"="DIMENSION_TABLE"."SITE_ID"
>> AND "FACT_TABLE"."KEY_FIELD"="DIMENSION_TABLE"."KEY_FIELD")
>>   37 - filter((CASE "FIELD1" WHEN 'I' THEN 'Inpatient' WHEN 'O'
>> THEN 'Outpatient' END))*
>>
>> So we have the join between the FACT and DIMENSION on SITE_ID &
>> KEY_FIELD
>>
>> But then we have an additional FILTER being applied to the
>> DIMENSION table.
>>
>> Currently this doesn't kill performance, but definitely adds
>> about 25% execution time (6 secs out of 24) and I'm concerned
>> this will get worse as data volumes grow.
>>
>> So the question is , should CASE statement filter really get
>> applied to the FACT table and replicate that column (or some
>> version of it) to the FACT table instead?   (Generally as a rule
>> of thumb I mean)
>>
>> We can add the CASE statement to the PK index on the
>> DIMENSION_TABLE as well, and thats what the ETL team is
>> suggesting, but that seems like the wrong way to go.
>>
>> Chris
>>
>>
>>
>>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Aug 04 2021 - 16:34:40 CEST

Original text of this message