Re: Fails Relational, Fails Third Normal Form

From: James K. Lowden <jklowden_at_speakeasy.net>
Date: Sun, 15 Feb 2015 20:07:58 -0500
Message-Id: <20150215200758.4b07bde9.jklowden_at_speakeasy.net>


On Thu, 12 Feb 2015 06:55:38 -0800 (PST) Derek Asirvadem <derek.asirvadem_at_gmail.com> wrote:

Having read over your post several times, ISTM we're more in vociferous agreement than not. Your focus is on designing a database -- a particular one, based (of course) on a known enterprise of interest. The academics you malign are describing -- and still searching for -- a way to do that automatically. You have no particular beef with them except that from your point of view they've produced "zero" in twenty years, while all along you've been doing just fine, thanks, without said algorithm.

I think there's a there there. We should be able to generate normalized designs from a set of columns and constraints.

I also think there's essentially no academic interest in the hard part: the problem of devising a convenient notation (never mind tool) for capturing constraints. To the extent they consider it at all, language people consider it a database problem, and database people likewise consider it out of scope, either a trivial non-problem or one belonging to language design, not database theory.

Absent academic results, industry produces the language poverty we currently live with: SQL, with all its warts, and proprietary GUI tools for design.

I doubt you find much to disagree with there, except that you'd give SQL higher marks, and you have automatic database design generation filed under "pending results".

Below I try to clarify a few points and offer some particulars on the super/subtype design we used for securities. Finally, I put it to you that the design both obeys all rules of normalization and illustrates the need for deferred constraint enforcement.

> > Actually, I'm sure you agree. By "DBMS" I mean "database management
> > system". It's a machine, and it has no concept of meaning.
>
> Ok, so you don't means DBMS.

I'm using the term "DBMS" literally, in the way I daresay everyone on c.d.t. understands it. I wouldn't dream of using a private definition any more than you would!

> You mean the theoretical concept of a DBMS. Abstracted to the point
> where those statements can be true. Ok, I agree.

It's not abstract in the least. I'm describing the very DBMS you're using at your current place of employ.

Maybe you find it useful to think of the DBMS as "understanding" things, of enforcing rules as you conceive them according to the meaning you attach to the labels you attach to the columns. That's OK with me. It's something else entirely to claim that fiction is somehow more real, more concrete, than the actual reality: that the machine is only a machine. It can no more appreciate the meaning of your database than it can the beauty of the Mona Lisa.

> Here I am defining two discrete levels of meaning:
> - one within the table itself, relative to all the keys and
> attributes in the table (at whatever level of progress; whatever
> iteration)
> - a second, within the context of all the other tables in the
> cluster, that it could possibly relate to (ditto); reworking the keys
> overall; the hierarchy; etc

Thank you for explaining that, because I think it will help clear up a misunderstanding.

I cannot accept these defintions of "meaning". I don't dispute they're important. They're just not what "meaning" means in the context of database design.

When I use the word "meaning", I'm talking about what real-world objects the terms in the database represent. Remembering that the machine is just a machine, recognize that every label is arbitrary; to the machine, the column may as well be "x" as "price". But we human beings associate meaning with "price". To us, it means something. In fact, often it means several somethings.

The art of database design is to develop a single shared meaning for all the relevant terms in the enterprise of interest.

(IMO that fact and its importance is generally under-appreciated. I've never seen it discussed in any depth in any database-design or project-management treatise, and I've read a bunch.)

Until and unless we know what e.g. "price" means, we don't know where it belongs in the database schema. Probably it needs to be associated with a time, or at least a date. It may be qualified to distinguish it as the open/close/high/low/composite/bid/ask/trade/purchase price. Is the price relative to an exchange? Does it apply to a class of bonds rather than a particular one? And so on. That is the meaning we use when we design our databases. That is how we know what's a key and what's not, etc. Meaning doesn't derive from the database design; meaning exists without the database (in both senses) and is expressed in the design.

And that is what we mean when we speak of "meaningful results" from a query: a logical conclusion derived from a set of facts stored in the database. To the machine, a bunch of tuple-types, devoid of meaning. To us, yesterday's returns, the very stuff the enterprise of interest is interested in.

> > I'm not being pejorative: The "jumping out" is the practice of
> > associating meanings with (column) names and deciding what
> > identifies what.
>
> There you go with that word again. Yes. Agreed.
>
> Which is why I say your algorithm is stupid because it strips the
> meaning out of the names that require the meaning to be maintained;
> rolled over the tongue; and determined; in the context of all the
> tables in the cluster; not alone. It is self-crippling.

Count me out. I design databases much as you do. I determine which columns are unique (PK, UNIQUE), which attributes are required and which not (NULL), what cardinality (FK). Print, review, repeat.

Implicit in that process is a set of functional dependencies. I know they're there; I could isolate them if need be. But they don't drive my process. They're a formal model for my "is this a property of that?" question. As you put it,

> Whereas the Codd algorithm retains the meaning, finds the Key first,
> then uses the FD to validate the determined Key, then the attributes,
> etc. So yours is bottom-up, devoid of the meaning that you claim is
> relevant, and ours is top-down down with meaning, and the meaning
> gets clarified (eg. we improve the names, improving and
> discriminating the meaning) during the process.

Everyone agrees that the problem of discovering the FDs is non-trivial. Beyond the problem of knowing there is such a thing, there's the trick of extracting them from the incoherent thinking that passes for "business rules" in every organization. (That's not a remark on anyone's intelligence or seriousness of purpose. These are human beings; no system exists to ensure referential integrity within or among our skulls.)

> > > - I am saying the theoreticians in the RDB space are stupid
> > > because they are using a hammer for a task that calls for an axe.

You would agree that if FDs were sprinkled down from the sky, that it would be nice to apply an algorithm to them to generate your database design. What you're mocking isn't the logical process they would employ but that they are, as it seems to you, working on a process with no input. The difficulty, in your experience, isn't in deriving a design from a set of FDs, but in discovering them in the first place.

> Wouldn't it be better to retain both levels of meaning, all the way
> through ?

If you're designing a database, sure. If you're designing an algorithm to design a database, no.

I think the theoretical problem with the "Codd and Derek method" is that it's not exactly an algorithm. The process terminates when you're satisfied with it, when you don't see any more problems. An algorithm terminates when a specific condition is met. For a machine to evaluate it, that condition has to be clearer than "looks good to me".

> Having just one algorithm , that is tractable is not an economical
> position to be in.

Perhaps. And if your method is an algorithm, that's the position you're in, no? Because there's no two ways to do it, right? ;-)

> Any problem is tractable.

You are of course aware that some problems are not tractable. I assume you mean something like "all databases are designable", which I'm sure is true.

> > ERwin is a good tool, the best I ever saw. [...] It would be nice
> > to define a relationship-rule in a symbolic library, and be able to
> > apply it to a given set of tables.
>
> You must have been using a very old un-maintained version. By 1995,
> it had all that and more. The macro language is much more powerful.

We were using the current version. I guess I wasn't clear wrt the macro language. Yes, the ERwin macro language was quite powerful (and arcane). It was also trapped in the ERwin tool. I would have liked to have had a database-design macro language independent of the tool, something like m4 for databases.

[long ERwin discussion omitted]

I concede I was oversimplifying to some extent, and found ERwin guilty of not solving a problem it wasn't designed to solve, namely checking the model for normalization. To the extent it allows the designer to work at a higher level of abstraction and avoid repeating things, that's great. We remain a long way away from a tool that does automatically that which could be automated, but which ERwin leaves to the human brain.

> > I wrote macros to generate triggers to enforce relationships that
> > couldn't be declared, such as sub/supertype relationships that
> > required both parts of a two-part entity, where the first part was
> > fixed and the second part was in one of N mutually exclusive
> > tables.
>
> So, after reading your para four times, and I really want to
> understand the problem, I still have no idea what the problem is.
> Would you please give me a better description

Since you ask, I will indulge you. I don't say it's the only or best way. I haven't thought of a better one, but I haven't addressed myself to the problem in 15 years.

The Securities table has {ID, Type, Name, Active}. The ID is internally assigned because as you know we have an entire industry whose central product has no universally recognized identifier. Type is a discriminator, one of {'equity', 'bond', 'swap', etc.}; they reflect the enterprise of interest, not the whole securities market. Name is obvious, and Active distinguishes between instruments currently traded and those expired (e.g. an option past its exercise date). We called this a supertype for ERwin purposes.

Naturally there are other attributes to capture, and they vary by instrument type. For each Type in Securities there is a table to hold that instrument's particulars: Equities, Bonds, Swaps, etc. We called this a subtype in ERwin, although I'm not sure how useful it is to call Equities a "subtype" of Securities.

The rule is that a security is represented by a pair of rows, one in Securities and one in the subtype table. There *must* be two rows, and only two rows, and the subtype table must be the one indicated by Securities.Type.

Declare that, Kemosabe! ;-)

Of course, you could weaken the rule. You could eliminate the Type column, and require each subtype have a FK relationship to Securities. But that would not require a Securities row to have a related subtype, and would not prevent one Securities row from being referred to by more than one subtype table (e.g., "being" both Equity and Bond). And, as a practical matter, you couldn't get the security's type without scanning the union of subtypes looking for the appearance of the ID.

You could also put all securities in a single table with the union of all security attributes, all NULL, and apply the mother of all CHECK constraints to get them right. Although I'm pretty sure SQL Server 7.0 at the time didn't support CHECK. We were pretty spartan with declared integrity constraint checking in any case, for feasibility reasons. bulk loads were vetted for correctness before and sanity-checked afterward, and interactive updates were required to use stored procedures (some of which used temporary tables as implicit parameters). But I wouldn't use a single table, even today; attributes with an optional relationship to the key usually should be in separate tables.

Aside: some years later, I looked into an incident of the kind that had by then acquired a name: "SQL injection". Of course I soon discovered they weren't using stored procedures, and had granted (possibly administrative) access rights to little Bobby Tables via the webserver. We have an entire industry built on application-developer paranoia based on willful ignorance of How to Use the DBMS. I understand the tech-politics that motivate the situation, but not why management abdicates its stewardship of the firm's data. [end aside]

Now that I've described the design, I want to take you to task.

You've posted a few colorful rants about the nonexistence of bona fide circular references in the real world, and therefore in database designs. ISTM you overlook mutual dependence. There are reasons to have tables with 1:1 cardinality with both components mandatory. Sometimes the reason is physical, because one part is updated or referenced or replicated more than the other. Others are logical, as with this securities design. There's perfectly nothing wrong with it relationally, and yet the semantics cannot be enforced with standard SQL.

You have wickedly mocked the need for deferred constraint enforcement -- constraints that apply to the transaction, not to the statement -- on the theory that all such mutuality is illusory. But you are in a trap: if it's OK to have two non-null attributes on one row (1:1 mandatory cardinality), why is it not OK to have the same relationship between two attributes on two rows? And if 2 tables can be related that way, why not 3 or more?

--jkl Received on Mon Feb 16 2015 - 02:07:58 CET

Original text of this message