Re: Are redundant fields ever appropriate?

From: --CELKO-- <71062.1056_at_compuserve.com>
Date: 21 Nov 2001 20:47:53 -0800
Message-ID: <c0d87ec0.0111212047.753e8a0_at_posting.google.com>

>> However clustering is a physical reality that can yield real
benefits.
When you have a table with 10^6 records, and you have to scan 10^4 of them, having the records clustered is a huge win. I don't particularly like (or even fully understand!) the implementation chosen in MS SQL Server, but it does improve the query performance. <<

When you get to a VLDB, I would be using Teradata or a product designed for that volume. SQL Server is still basically a file system under the covers -- physically contigous rows made of physically contigous columns.

>> The previous version of the schema used the ComputerDN (an X.500
Distinguished Name, from Active Directory), which we arbitrarily limited
to 512 bytes. Going to Unicode made 512 characters take 1024 bytes, and you're over the limit on key size before you add any more columns into the index. The same thing occurs wherever you have ComputerDN as a foreign key - the records and keys become large and you get poor performance. Using a ComputerID was a practical necessity, though it's not in the conceptual model - we did model the situation, but then we optimised the physical design. <<

I am not familar with the X.500 Distinguished Name standards. I find it a bit strange that your company does not have a property sticker, a serial number or some other physical label on their equipment that can be verified by walking over to the machinery.

>> From my O-O background I always name a collection in the singular,
unless it's a collection of collections (which doesn't often happen in SQL). Personal preference I guess, but it has avoided confusion for me. <<

Does a collection have one and only one item in it? SQL has no concept of a "table of tables", so we have no problems.

>> Ok, fair enough. Sometimes I have two relationships needing foreign
keys in the same table though, and they can't be named the same. What do you do then? <<

I do not understanda what you are saying. Two columns in the same table cannot have the same name. A FOREIGN KEY can reference the table in which it appears. A FOREIGN KEY has to reference a unique constraint, which menas a PRIMARY KEY (the most common case) or a UNIQUE constraint column set.

>> SDOU's may also contain other SDOUs, but we didn't need to model
the tree structure. <<

I have a neat way to model a tree, but that is another topic ..

>> Unfortunately there's no ISBN system for software! <<

I thought that the SIIA had a system for this ...

>> We also model Software (application) distinct from its Versions,
Software
Products (licensable entities that may cover versions of multiple application), Packages (an application packaged for distribution), Package Versions, and a whole bunch of data about other subsystems (distribution etc) ... so you can see that I simplified quite a bit. <<

That is ugly ... I thik I will pass on that one.

>> What's your point about the evils of identity? <<

The IDENTITY column is a hold over from the early programming language which were <i>very<i> close to the hardware. For example, the fields in a COBOL or FORTRAN program were assumed to be physically located in main storage in the order they were declared in the program. This meant that you could define a template that overlaid the same physical space and read the representation in several different ways. In COBOL, the command was REDEFINES, EQUIVALENCE in FORTRAN and a union in 'C'.

From a logical viewpoint, this redefinition makes no sense at all. It is confusing the numeral with the number the numeral represents. The history of programming after this point in time has been to divorce the logical and physical models completely.

The early SQLs were based on existing file systems. The data was kept in physically contiguous disk pages, in physically contiguous rows, made up of physically contiguous columns. In short, just like a deck of punch cards or a magnetic tape.

But physically contiguous storage is only one way of building a relational database and it is not always the best one. But aside from that, the whole idea of a relational database is that user is not supposed to know how things are stored at all, much less write code that depends on the particular physical representation in a particular release of a particular product.

One of the biggest errors is the IDENTITY column in the Sybase family (SQL Server and Sybase). People actually program with this "feature" and even use it as the primary key for the table! Now, let's go into painful details as to why this thing is bad.

The practical considerations are that IDENTITY is proprietary and non-portable, so you know that you will have maintenance problems when you change releases or port your system to other products. It also has some very strange bugs in both Sybase and SQL Server; go to a newsgroup and do a search.

But let's look at the logical problems. First try to create a table with two columns and try to make them both IDENTITY columns. If you cannot declare more than one column to be of a certain datatype, then that thing is not a datatype at all, by definition.

Next, create a table with one column and make it an IDENTITY column. Now try to insert, update and delete different numbers from it. If you cannot insert, update and delete rows from a table, then it is not a table by definition.

Finally create a simple table with one IDENTITY column and a few other columns. Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES  ('a1', 'b1', 'c1'); 
INSERT INTO Foobar (a, b, c) VALUES  ('a2', 'b2', 'c2'); 
INSERT INTO Foobar (a, b, c) VALUES  ('a3', 'b3', 'c3');

to put a few rows into the table and notice that the IDENTITY column sequentially numbered them in the order they were presented. If you delete a row, the gap in the sequence is not filled in and the sequence continues from the highest number that has ever been used in that column in that particular table.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;

Since a query result is a table, and a table is a set which has no ordering, what should the IDENTITY numbers be? The entire, whole, completed set is presented to Foobar all at once, not a row at a time. There are (n!) ways to number (n) rows, so which one do you pick? The answer has been to use whatever the physical order of the result set happened to be. That non-relational phrase "physical order" again.

But it is actually worse than that. If the same query is executed again, but with new statistics or after an index has been dropped or added, the new execution plan could bring the result set back in a different physical order. Can you explain from a logical model why the same rows in the second query get different IDENTITY numbers? In the relational model, they should be treated the same if all the values of all the attributes are identical.

Think about trying to do replication on two databases that differ only by an index, or by cache size or something that occasionally gives them different execution plans for the same statements. Want to try to maintain such a system?

There are better ways of creating identifiers.

>> It would be nice if MS SQL Server supported reference propagation,
but we must support MSDE and MS SQL Server 7 which don't, sigh. <<

Sigh. Time to get the next release ...

>> ... most important query is to summarise over a single
SiteDomainOrgUnit. <<

SELECT org_unit_id,

         COUNT(DISTINCT computer_serial_nbr) AS computer_tally,
         COUNT(software_id) AS package_tally

FROM Configurations AS C1
GROUP BY C1.org_unit_id;

>> And so to my original point. The OU doesn't belong as an FK in the
Configurations table, because it's the OU of the computer referenced by the computer FK in the same table - this breaks normalisation. However it is helpful to cluster Configurations by OU, so we added it anyhow. <<

I will have to play with that, but the Configurations table is what joins the computer, the software and the organizational units. A table that is all keys is automatically in 5NF. You just have the extra constraint that the organizational unit and the computer have a relationship. Received on Thu Nov 22 2001 - 05:47:53 CET

This message: [ Message body ]
Next message: Sandar: "Re: Need help on terminologies..."
Previous message: Sam: "hello"
Maybe in reply to: Eel: "Are redundant fields ever appropriate?"
In reply to Clifford Heath: "Re: Are redundant fields ever appropriate?"
Next in thread: Sandar: "Re: Need help on terminologies..."

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message