Oracle-L: Re: data warehousing desing - to denormalize or not to denormaliz

From: Steven Lembark <lembark_at_wrkhors.com>
Date: Fri, 10 May 2002 18:13:18 -0800
Message-ID: <F001.0045EC89.20020510181318@fatcity.com>

Henry Poras <hporas_at_attbi.com>

>> From what I remember, one of the main advantages of a normalized
>> database is

> a certain level of data integrity and flexibility. The integrity comes
> from foreign keys, lack of duplicate data, and generally a data design
> based on characteristics of the data. The flexibility comes from the
> ablility to query in many different ways (you are not restricted as you
> are in a heirarchical db). Though I have never designed a data warehouse,
> I always thought that the integrety piece would be less of an issue.
> Since there are very few writes to a warehouse, there is less chance to
> fowl up the data integrity as long as there are good controls on the
> load. Flexibility on querying a warehouse seems to also be possible using
> other design methods (fact tables, snowflake, ...)

Basic warehouse operations use an offline update with read-only access to users. The offline update allows for complete validation before the data is used (in theory at least), which makes foreign keys less important for maintaing consistency. The normal ETL cycle is usually designed to reject data with missing keys in the transform cycle anyway -- usually by placing the offending items in a reject que for later analysis.

Avoiding snowflakes makes any foreign key issues moot in the dimensions: the ETL cycle will either merge the data successfully or reject it. The fact table is mostly key in the first place, with a small amount of data hung off the side. Inserting the record requires pre-merging all of the key information also, so at insert time the records should be valid (or the ETL code reworked).

An RDBMS allows the same queries in both a star or 3rd (or BCD) normal form. The difference is that a star schema is much simpler to query since the joins are only 1-2 levels deep (dims:fact or dim:fact:dim is as far as it goes). The joins tend to be faster also since indexes are 1:1, the dim's are shorter tables and the fact is "narrow" enough that they don't require excessive I/O to process down their length.

In most cases a synthetic integer key is used for the dim's also. This is partly done to keep the fact table narrow, since most of it is the composite key for each fact. This tends to help indexes also.

Depending on the database, joining dimensions across the fact table is also more effecient. Red Brick was the first with their "star index", which basically pre-joins the dimension and fact records at load time. Informix and DB2 picked up the technology by purchasing RB and Informix; Oracle is currently working on a similar concept (I think). This basically trades off a fairly expensive operation done once at load time for read effeciency. This works in a warehousing environment where the ETL cycle can pre-sort records for better load speed and the loads happen offline on basically a dedicated system. After that read effeciency is the only thing going, noone cares about update speed until tomorrows load -- at which point they don't really care about user effeciency for a while.

The cycle works pretty well in most cases, the biggest problem being the management of rollof. If the data can be segmented in "rollof units" (e.g., time buckets) then the process is simple. In Oracle, for example, with locally managed partitions you can offline, truncate and drop them without any real pain. If the unit of partitioning isn't the unit of rollof -- not hard since the primary key has to start with the partitioning field in nearly all cases -- then you end up having to perform deletes. THAT can be a Real, True Pain (tm) on a 3-4 TByte warehouse. It's even worse in cases where the rollof units are not uniformly distributed, in which case Oracle will have to perform a table scan to find the records. In most cases the simplest fix it to force the rollof value into the primary key and be done with it, hopefully that doesn't screw up the database.

Data marting helps this in some ways, since the ETL process can pre-generate the aggregates required for marting. In that case a smaller database gets updated with a smaller load each cycle and can usually be placed online quicker. The marts can also hold data longer without degrading performance, so their rollof cycles tend to be longer. Many databases will have pre-aggregated data prepared for drilldowns already loaded into the fact table (this is a requirement of some querying tools). In that case the "rollof" procedure consists of deleting the more detailed period records (say daily totals) on a monthly basis, leving monthly aggregate values online for perhaps 24 months. The delete cycle is less painful in a -- much smaller -- data mart than the whole warehouse and leaves users able to make the buisness-cycle queries they need.

--
Steven Lembark                               2930 W. Palmer
Workhorse Computing                       Chicago, IL 60647
                                            +1 800 762 1582
-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Steven Lembark
  INET: lembark_at_wrkhors.com

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Fri May 10 2002 - 21:13:18 CDT