fwd: denormalization (http://www.tdan.com/i014ht04.htm)

From: Eric D. Pierce <PierceED_at_csus.edu>
Date: Mon, 07 May 2001 16:19:39 -0700
Message-ID: <F001.002FB6FC.20010507151546@fatcity.com>

(follow up)

fwd: denormalization articles

from "The Data Administration Newsletter" (TDAN)

(earlier version?: http://www.tdan.com/i001fe02.htm )

http://www.tdan.com/i014ht04.htm

---begin 1 of 2---

DENORMALIZATION AND THE
RULES OF RECONSTRUCTION
Dick Root - Thistledown Consulting Services

Introduction

Software and Database engineering are complex activities that require planning and control to be successful. By the time the DBA is called up to tune the indices of a database it is probably already too late. Efficiency should be designed into the data structure before the data is actually put on disk. Since the invention of CASE tools there is usually a missing step in the database design. The logical database design is set up in the modeling tool and then the DDL is generated. The table design of the physical database is the entity design of the logical database. Then when tuning is required, data is moved around on disk, indices are applied, freespace is modified, and more CPU memory is assigned.

The DBMS level tuning steps are valid and will continue to be used. But, there has been a missing step in the database design process.

Physical Design of Databases

The word ‘denormalization’ is used to describe changes to the table design that cause the physical tables to differ from the normalized entity relationship diagram. ‘Denormalization’ does not mean that anything goes. Denormalization does not mean chaos. The development of properly denormalized data structures follows software engineering principles that insure that information will not be lost. If the table is read-only (periodically refreshed from the system-of-record) then the rules are looser. Star schemas and hyper-cubes are readonly denormalizations. If the data is to be distributed and/or segmented and added-to, changed, or deleted from then the reconstruction described below must be followed. Fundamentally a single principal must be followed. If the individual table is updated in more than one system, it should be possible to reconstruct the original table as if the data was never reformatted or taken apart.

Denormalization

There are many techniques for denormalizing a relational database design. These include –

Duplicated data - This is the technique of making copies of data whole or in part and storing and utilizing both the original and the copy(s). This technique is great unless you want to update it. This is the area of distributed updates and synchronization. Whole texts have been written on this subject. The general idea is that extra-DBMS processes must insure integrity and accuracy. Stored joins are an example of duplicated data.
Derived data - The issues with storing derived data are accuracy and timeliness. When the base data changes the derivation(s) must change accordingly. When the semantics of the derived columns is ‘current balance’ you have one sort of accuracy problem. When the semantics of the derived column is average sales by product, salesman, and division, and month; and the salesman are constantly being reassigned. You have another accuracy problem. Also many designers store the derivation in tables containing inappropriate keys. When derivations are not stored with their logical (functionally dependent) keys subsequent (tertiary) derivations are inaccurate. Also many derivations are non-additive
(percents, highest, lowest, etc). This subject deserves many
chapters in data warehousing texts. See references to summary data and slowly changing dimensions.
Surrogate keys - There is a problem with very long and compound keys in that they are hard to use when writing queries and they generate inefficient indices. If the table has a very long key and also has many rows this can generate a “show stopper” situation. If the table has a maximum of 100,000,000 rows and a fifty byte real compound key, assigning a 10 digit surrogate key (and indexing on it) will increase performance dramatically. Imagine the situation where the fifty byte key is used in an equi-join! The real key(s) should not be deleted after the surrogate key is added. This would make reversing out the surrogate key impossible. And would offend the Rule of Reconstruction (see below). Usually the long real key is made up of many sub-keys that are useful in their own right.
Over Normalization (Vertical partitioning/segmentation) - This is the technique of splitting the original logical table into two or more physical tables. By assigning some of the columns to one physical table and some to another. Both tables end up with the same number of rows and have the same keys (see “Rule of Reconstruction”, below). Grossly this will increase performance since the individual tables are now smaller. In most DBMSs the negative affect of long column length is non-liner. The query time against a 1000 byte row length table can be more than twice the query time against a 500 byte row length table. So arbitrary vertical partitioning will cause much better performance against each of the separate partitions. If you are constantly joining the partitions, over normalization is selfdefeating. Therefore, the trick is to cluster the columns together that are used together.
Horizontal segmentation - This is the technique of storing some of the rows in one table and some in another. Many modern DBMSs can do this automatically. When the criteria for segmentation is nonsimple, segmentation must still be done programmatically. Of course, update anomalies occur when rows occur in more that one segment.
Stored Joins - This is the technique of joining two or more tables together and storing the answer set as an additional table. This is one of the most common denormalizations. If the stored join table is never updated, there is no problem with this. Since this always generates duplicate data, updates are a problem. Look out for query anomalies when a measurement column is on the many side of the relation being joined.
Recurring data groups (vector data) - When there is a fixed small number of subordinate tables associated with a table collapsing the subordinate table into the parent table will increase performance. Care must be taken that the logical key of the subordinate table is not deleted or obscured. Otherwise the join is not reversible and the “Rule of Reconstruction” is offended.

...
(end)

---begin 2 of 2---

http://www.tdan.com/i008hy02.htm

GREAT NEWS -
THE RELATIONAL MODEL IS DEAD!
Michael Gorman, Whitemarsh Information Systems, Corp.

Introduction

...

7. SQL/99's Impact on Database Applications

For the past 20 years, database designers and implementors have struggled with highly normalized databases that perform poorly. The only solution is to denormalize by collapsing hierarchies of nonredundant tables into a single flat table with replicated data. While these highly redundant collapsed tables speed data reporting, it slows updating, and also becomes a significant risk for data integrity. That is because the data is highly disbursed and is duplicated across these report-tuned denormalized database structures that are commonly known as data warehouses. For all these reasons, most organizations only allow reporting from data warehouse databases.

As DBMS vendors implement SQL/99, the database design process will transform itself from designing third normal table designs and then denormalizing these tables to enable cost effective reports to a set of database design activities similar to the ones that were commonly performed in database design efforts of the middle 1970s through the middle 1980s. There will have to be a greater knowledge of the application’s processing to take advantage of the natural data structure hierarchies now possible within SQL/99 tables.

While processing speeds will dramatically improve with SQL/99 conforming DBMSs, the effort and processing time effort required to accomplish database redesigns and reorganizations will dramatically increase.

In short, we are returning to the past. That is, adopting the data structures of the network and independent logical file DBMSs. While we will see increased performance for well designed and highly tuned databases, we will also see the return of significant designer and analyst time for database design and redesigns.

Keith Hare of JCC Consulting (www.jcc.com), a long time member of H2 and a user of Vax DBMS products put it best when he said, "With SQL/99 you can get the best of both worlds and of course, you can get the worst of both worlds. It is up to the database practitioners to do the right thing."

The long version of the paper is available from the "What’s New" section of the

http://www.wiscorp.com/whatsnew.html
-
http://www.wiscorp.com/featuredpapers.html

...
(end)

--
Please see the official ORACLE-L FAQ: http://www.orafaq.com
--
Author: Eric D. Pierce
  INET: PierceED_at_csus.edu

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L


(or the name of mailing list you want to be removed from).  You may

also send the HELP command for other information (like subscribing).

Received on Mon May 07 2001 - 18:19:39 CDT