Re: Star Schema from 'Flattened' CSV

From: Peter H. Coffin <hellsop_at_ninehells.com>
Date: Wed, 8 Nov 2017 14:28:45 -0600
Message-ID: <slrnp06q7t.4ah.hellsop_at_nibelheim.ninehells.com>


On Thu, 2 Nov 2017 13:48:00 -0700 (PDT), Aaron Peters wrote:

> CSV Description:
>
> The CSV file contains multiple columns related to the performance of a
> given product. It has a product ID, a product description, unit sale
> price, total units sold, total units returned, as well as names/quotes
> from several vendors giving the price per unit. In the future I
> expect vendors will be added and will want my Schema to handle this
> eventuality gracefully.
>
> I. Consider the trivial case where the columns of the CSV reference
> only 1 vendor price:
>
> CSV COLUMNS:
>
> prod_id, prod_desc, unit_price, tot_sold, tot_ret, vend_name,
> vend_price
>
> STAR SCHEMA:
>
> fact_performance
> - unit_price
> - tot_sold
> - tot_ret
> - vend_price
>
> dim_product
> - prod_id
> - prod_desc
>
> dim_vendor
> - vend_name
>
>
> II. Consider the non-trivial case where the columns of the CSV reference multiple vendor quotes:
>
> CSV COLUMNS:
>
> prod_id, ..., vend1_name, vend1_price, vend2_name, vend2_price
>
> STAR SCHEMA:
>
> fact_performance
> - total_sales
> - total_units_returned
> - total_units_sold
> - vend1_price
> - vend2_price
>
> dim_product
> - prod_id
> - prod_desc
>
> How do I account for multiple vendors in a dimension table?

In the simplest answer, the same way you deal with multiples of ANY entity: more rows. Which will probably end up changing a 1-relationship to an n-relationship somewhere along the line. Which means both of your schemas are wrong, but schema 1 is closer, and you may end up with a snowflake regardless of your intentions. You can cheat by putting more rows into "fact_performance" of schema 2 but with only one vendor per row, and just live with that your "total" is going to be done by SUM().

IMHO "star schema" is almost always an oversimplification of snowflake arrangements, and one shouldn't try to actually BUILD one until you've got the data properly modelled somewhere else and you need a reporting tool for performance reasons.

-- 
The light at the end of the tunnel is not an oncoming train.

It is muzzle-flash.
Received on Wed Nov 08 2017 - 21:28:45 CET

Original text of this message