Re: Are redundant fields ever appropriate?

From: Clifford Heath <cjh_nospam_at_osa.com.au>
Date: Thu, 22 Nov 2001 09:41:17 +1100
Message-ID: <3BFC2D8D.F97AB913_at_osa.com.au>


Thanks for the good feedback... I eventually get back to my original comment about the use of de-normalisation in clustering (which you didn't address) at the bottom of this somewhat long response.

> >> The join is implicit in the foreign key reference,.. <<
> In some products, the joins between primary and foreign keys are
> actually pre-joined.

I'm familiar with the concept though I haven't used a DBMS like that. I distinguish these with the terms "extensional" vs "generational", terms I used with deductive (rule-based) databases.

> In Standard SQL, there is no such thing as an index, much less a
> clustered index.

However clustering is a physical reality that can yield real benefits. When you have a table with 10^6 records, and you have to scan 10^4 of them, having the records clustered is a huge win. I don't particularly like (or even fully understand!) the implementation chosen in MS SQL Server, but it does improve the query performance.

> >> Apologies for using SQL-Server extensions in the form of IDENTITY
> fields - this reduces the keys to a manageable size. What equivalent
> support is there in other DMBSs (Oracle?) or in the SQL standard for
> this sort of thing (apart from allocating unique IDs manually)? <<

> Real SQL has nothing like the IDENTITY column. It is not only
> proprietary, it is non-relational, not verifiable aganst the reality
> being modeled, and usually the result of a lazy database designer who
> did not want to deal with thinking about how to model his situation.
> I have a really long rant about the evils of IDENTITY I might post
> later.

The previous version of the schema used the ComputerDN (an X.500 Distinguished Name, from Active Directory), which we arbitrarily limited to 512 bytes. Going to Unicode made 512 characters take 1024 bytes, and you're over the limit on key size before you add any more columns into the index. The same thing occurs wherever you have ComputerDN as a foreign key - the records and keys become large and you get poor performance. Using a ComputerID was a practical necessity, though it's not in the conceptual model - we did model the situation, but then we optimised the physical design.

The other cause for the change was that we also handle SMS, and that names computers differently, so having a Computer table was helpful to create a "local" naming scheme for computers by storing the external names in a single table.

Unfortunately the serial number of the computer is not always known, so can't serve as a primary key. The IDENTITY thing can of course be done manually, but SQL Server does it more efficiently by keeping the next-available value with the table metadata.

Finally the actual index scans produced by the query engine are *much* faster when they must only scan four-byte ID values than DNs, which are at least 40 and frequently over 100 bytes.

> Tables are sets of things, so their names should be plural.

From my O-O background I always name a collection in the singular, unless it's a collection of collections (which doesn't often happen in SQL). Personal preference I guess, but it has avoided confusion for me.

> You have a OrgUnit_id and a Computer_OrgUnit_id; please explain what
> the LOGICAL difference between these attributes is. I do not think
> that there is one, so they should have same name. NEVER name an
> attribute after its PHYSICAL location in the schema.

Ok, fair enough. Sometimes I have two relationships needing foreign keys in the same table though, and they can't be named the same. What do you do then?

> Let me try to put this into a normalized schema. But you need some
> rules first. My guess is that:

> 1) A computer is in one and only one organizational unit.
> 2) A software package residences on one and only one computer.
> 3) An Organizational unit can have multiple computers.
> 4) A computer has one or more software packages.

Modulo your name changes (which conflict with parts of our schema I haven't shown), that's correct.

SDOU's may also contain other SDOUs, but we didn't need to model the tree structure.

> I tried to imply that I had looked for industry standard codes.

Unfortunately there's no ISBN system for software! We have attempted to make use of cryptographic hash functions (MD5, SHA-1) applied to a standardised package format, but that relies on standardising the format...

We also model Software (application) distinct from its Versions, Software Products (licensable entities that may cover versions of multiple application), Packages (an application packaged for distribution), Package Versions, and a whole bunch of data about other subsystems (distribution etc) ... so you can see that I simplified quite a bit.

> CREATE TABLE SiteDomainOrgUnits
> (org_unit_id INTEGER NOT NULL PRIMARY KEY,
> org_unit_name CHAR(80) NOT NULL,
> -- other columns);

You're still using identity fields, but now the programmer must allocate them (perhaps using SELECT MAX(...id)+1 FROM ...). What's your point about the evils of identity? It seems better to let the DBMS do this internally if the value is meaningless anyway, even though it's non-standard.

> CREATE TABLE Computers
> (computer_serial_nbr INTEGER NOT NULL PRIMARY KEY,
> -- other columns);

Of course if we always had the serial number, we'd use it, though it wouldn't fit in an INTEGER, and would often have letters also.

> CREATE TABLE Configurations
> (org_unit_id INTEGER NOT NULL
> REFERENCES SiteDomainOrgUnits (org_unit_id)
> ON UPDATE CASCADE
> ON DELETE CASCADE,
> computer_serial_nbr INTEGER NOT NULL
> REFERENCES Computers (computer_serial_nbr)
> ON UPDATE CASCADE
> ON DELETE CASCADE,
> software_id INTEGER NOT NULL UNIQUE
> REFERENCES SoftwarePackages (software_id)
> ON UPDATE CASCADE
> ON DELETE CASCADE,
> PRIMARY KEY (org_unit_id, computer_serial_nbr, software_id));

> Notice that use of UNIQUE constraint on the software packages in the
> Configurations table, and the way the table takes care of itself with
> referential actions.

It would be nice if MS SQL Server supported reference propagation, but we must support MSDE and MS SQL Server 7 which don't, sigh.

> >> ... most important query is to summarise over a single
> SiteDomainOrgUnit. <<

> SELECT org_unit_id,
> COUNT(DISTINCT computer_serial_nbr) AS computer_tally,
> COUNT(software_id) AS package_tally
> FROM Configurations AS C1
> GROUP BY C1.org_unit_id;

> Then I would worry about clustered versus unclustered, etc.

And so to my original point. The OU doesn't belong as an FK in the Configurations table, because it's the OU of the computer referenced by the computer FK in the same table - this breaks normalisation. However it is helpful to cluster Configurations by OU, so we added it anyhow.

Phew, it took a long time to get back to here... :-)

--
Clifford Heath, ManageSoft Corporation
Received on Wed Nov 21 2001 - 23:41:17 CET

Original text of this message