Re: How to 'normalise' this scenario

From: Erwin <e.smout_at_myonline.be>
Date: Wed, 18 May 2011 06:31:01 -0700 (PDT)
Message-ID: <e0c977f6-866c-4fa3-8e84-14c47381ee2c_at_x10g2000yqj.googlegroups.com>



On 18 mei, 10:33, Frank Millman <fr..._at_chagford.com> wrote:
> On May 17, 5:16 pm, Erwin <e.sm..._at_myonline.be> wrote:
>
> > On 17 mei, 13:16, Frank Millman <fr..._at_chagford.com> wrote:
> > The contents of file systems are not the best example for discussing
> > management of graph data, because the "identity" (the means for
> > identification) of a file typically is a combination (concatenation)
> > of the "identity" of its containing directory and the file's own
> > distinguishing name _within that directory_.  This is different
> > compared to, say, bill-of-material structures or genealogical
> > relationships, where the "nodes" (parts, people, ...) have a property
> > "of their own" that uniquely identifies them "within the entire
> > universe".  This is not the case for file systems.  *IX systems may
> > have multiple directories each containing a /bin "file", and all
> > those /bin "files" are distinct things.  Windows systems have multiple
> > directories each containing an "Application data" folder, and those
> > are all distinct.
>
> Agreed - a file system is not a perfect analogy because, as you point
> out, files in different directories can have the same name, so you
> need a combination of directory name and file name to uniquely
> identify a file. In the scenario that I am trying to describe, that
> does not apply.
>
> > I don't fully see what you mean by "creating an entry in the
> > directories table that points to a file".  Relational database designs
> > do not include "pointers".  And you don't say on the directory level
> > which files it contains, instead you say on the file level to which
> > directory each file belongs.
>
> I don't know the correct terminology, but take your example of a bill-
> of-materials structure. Each element in the structure 'is',
> 'represents', 'points to', 'references', an item in a 'products'
> table, which will contain all the attributes of the product in
> question.
>
> This is also not a perfect analogy for the scenario I am trying to
> describe, because in my scenario only the 'leaf' nodes represent items
> in a separate table.
>
> How about a 'menu' system as an analogy? A typical cell phone has
> hundreds of functions available to the user. Instead of presenting the
> options in a long list, they are grouped in the form of menus, with
> sub-menus, of arbitrary depth, ultimately arriving at a function.
>
> Only the 'leaf' nodes represent something 'tangible' - an executable
> function. All the other nodes are purely descriptive, and their only
> purpose is to assist the user in arriving at the function they are
> looking for.
>
> Assume we have a table of 'functions', with columns such as function
> name, description, and a pointer to the executable code.
>
> If we create a separate table representing the 'menu' system, how do
> we ensure that all 'leaf' nodes represent functions?
>
> Maybe you gave a clue earlier, when you said -
>
> >  you don't say on the directory level which files it contains,
> > instead you say on the file level to which directory each file belongs.
>
> Substituting, are you saying that "you don't say on the menu level
> which function it contains, instead you say on the function level to
> which menu each function belongs."?
>
> Are you suggesting the following?
>
> CREATE TABLE menus
>   (node_id INT PRIMARY KEY,
>   ..., ..., ...)
>
> CREATE TABLE functions
>   (function_id INT PRIMARY KEY,
>   ...,
>   menu_id INT REFERENCES menus)
>
> I can see the benefit of this approach, but I can see some issues.
>
> 1. How do you ensure that functions.menu_id only references 'leaf'
> nodes?
>
> 2. How do you ensure that every 'leaf' node is referenced by a
> functions.menu_id?
>
> 3. How do you prevent a user from expanding a 'leaf' node and giving
> it children?
>
> All of these issues can be handled at the application level. Are there
> any SQL constraints that could be used?
>
> All comments welcome.
>
> Frank

I'll leave original message intact and copy-paste the parts to which I respond.

> This is also not a perfect analogy for the scenario I am trying to
> describe, because in my scenario only the 'leaf' nodes represent items
> in a separate table.
>
> How about a 'menu' system as an analogy? A typical cell phone has
> hundreds of functions available to the user. Instead of presenting the
> options in a long list, they are grouped in the form of menus, with
> sub-menus, of arbitrary depth, ultimately arriving at a function.

Graph theory applies only to scenarios where, and only to the extent where, there is a conceptual 'sameness' between the nodes. That 'sameness' in a file system is the fact that a directory, just like any other kind of file, has an identifying name within the file system, has some kind of physical content (bits&bytes) that is physically stored on disk, and that it can be, just like any other file, contained within another directory. The conceptual 'difference' is that typically, interpretation of the physical contents (the bits&bytes) of a directory is reserved for OS- or FS- provided services (system-provided at any rate), and not user programs.

Menus are completely analogous to file systems. Menus are like menu entries in that they can both be contained within another menu, menus are different from menu entries in that only 'actual' menu entries have an "executable function" linked to them. Well, on second thought, is that really so ? If you regard a function, perhaps systemprovided,  such as "display submenu" as an "executable function", then the conceptual difference has been abstracted away.

> Only the 'leaf' nodes represent something 'tangible' - an executable
> function. All the other nodes are purely descriptive, and their only
> purpose is to assist the user in arriving at the function they are
> looking for.

Is that really so ? If a menu is existant, but empty, is that menu then not a leaf node ? Can you not have empty menus ? Can you not have empty directories in a file system ? It never ceases to amaze me how people consistently consider the empty set as an indication of something being wrong ...

> Are you suggesting the following?
>
> CREATE TABLE menus
> (node_id INT PRIMARY KEY,
> ..., ..., ...)

I assume that the ellipsis here stand for either the attributes needed to make this a graph-represented-as-nested-sets, or else graph- -as-an-adjacency-list.

As an adjacency list, I'll assume the attributes will be {nodeID, parentNodeID}.

If there are other details to be kept about the menus (menu title and/ or description, e.g. ?), they could be in a separate table, or in the same table, depending on circumstances. This aspect is a bit irrelevant.

But note explicitly that _there must not be entries in this table for the containment of a *function* in a menu_. That is, there are no rows in this table where the nodeID is a value that is actually a _function_ID. You can do this (and I suspect it's indeed what you have in mind), but then you'll run into other problems. I'll discuss those later, separately.

What are the typical constraints and design problems related to this table ?

(1) What is the root node ?

A root node, unlike all the others, does not have a parent node. You can address this problem either by applying relational theory very rigorously, and provide a separate table that identifies the root node (CREATE TABLE ROOTNODE (nodeID)), or you can address this problem with a bit of a hack, by inserting a self-referencing row (or yet worse, a row containing a null).

At any rate you want to constrain your root node to be "unique in the entire universe", i.e. your "set of root nodes" must be a singleton (unless of course your business case requires you to be managing multiple independent trees/hierarchies simultaneously and independently).

In relational theory, you could do this by defining a key with no attributes on table ROOTNODE. Alas, SQL does not support this, so if you do define table ROOTNODE in an SQL system, you are still left with work to do to enforce this (an ASSERTION to the effect that SELECT COUNT(*) FROM ROOTNODE <= 1). If you do not define a separate table to identify the root node, then "identifying the root node" will look something like SELECT nodeID FROM MENUS WHERE nodeID = parentNodeID; In fact you could define ROOTNODE as being exactly that view of table MENUS ... But in this case too, you are left with work to do for enforcing the rule that there is at most one row that satisfies 'WHERE nodeID = parentNodeID'.

(2) A statement of something being a parent node, must be a valid reference to an existing node.

If you have the single-table design with the "self-referencing row hack", then this is achievable by declaring an FK from parentNodeID to nodeID.

If you have the two-table design with the "separately identified root node", then you'd need to create an ASSERTION to the effect that all values that appear as a parentNodeID, appear either as a nodeID in MENUS, or else as the nodeID in ROOTNODE. That is, you have a "foreign key" from parentNodeID to the nodeID as it appears _in a VIEW_ that is the UNION of ROOTNODE with 'SELECT nodeID FROM MENUS'. And while you could create such a view (CREATE VIEW ALLNODES AS SELECT nodeID FROM ROOTNODE UNION SELECT nodeID FROM MENUS), SQL will not allow you to create an FK reference from some table into this view.

So in the two-table design, you cannot enforce this rule declaratively. This is SQL systems (not SQL itself, mind you, the standard does support this) actively punishing you for choosing the design that is, from a logical perspective, the most (and in fact the only) correct one.

(3) Menu items are unique.

If you have the one-table design, you achieve this by the KEY declaration that you already mentioned.

If you have the two-table design, then in addition to the KEY on MENUS, you must enforce the rule that SELECT nodeID FROM ROOTNODE is disjoint with SELECT nodeID FROM MENUS. This is an ASSERTION to the effect that SELECT COUNT(*) FROM (SELECT nodeID FROM ROOTNODE NATURAL JOIN SELECT nodeID FROM MENUS) == 0. Once again, this is SQL systems actively punishing you for choosing the right logical design.

> CREATE TABLE functions
> (function_id INT PRIMARY KEY,
> ...,
> menu_id INT REFERENCES menus)

Yes.

> I can see the benefit of this approach, but I can see some issues.
>
> 1. How do you ensure that functions.menu_id only references 'leaf'
> nodes?

First, do you _really_ want to impose that rule ? Do you _really_ want to enforce the rule that "functions" can _only_ be mentioned on "leaf'" menus ?

Suppose you do. You obviously understand that a simple FK from FUNCTIONS is not strong enough. Because the set of nodeIDs that are "valid" for functions is a proper subset of the set of *all* nodeIDs. The FK would have to be from FUNCTIONS into a VIEW that lists all, and only, your leaf nodes : CREATE VIEW LEAFMENUS AS SELECT nodeID AS leafNodeID FROM MENUS A WHERE NOT EXISTS (SELECT * FROM MENUS B WHERE A.nodeID = B.parentNodeID). Or CREATE VIEW LEAFMENUS AS SELECT nodeID AS leafNodeID FROM MENUS EXCEPT SELECT parentNodeID AS leafNodeID FROM MENUS. If you really want this constraint, then it is up to you to enforce it no matter what. The problem is completely analogous as the one of enforcing an FK into that UNION view.

> 2. How do you ensure that every 'leaf' node is referenced by a
> functions.menu_id?
> 3. How do you prevent a user from expanding a 'leaf' node and giving
> it children?

I suspect that you are indeed thinking of having rows in table MENUS where the nodeID represents a function. I think this is a mistake, and here's why : because if you "reserve" the rows in the MENUS table for "genuine" menus only, then I think there is no need to enforce these constraints at all ! If you "reserve" the rows in MENUS for the "genuine" menus, then the mention of {functionID and menuID} in table FUNCTIONS makes that an "unexpandable" leaf node by definition.

But if you do want to enforce these, then :

As for (2), consider this again. You are saying that you want to prohibit "orphaned" functions in your database, right ? In that case, in addition to the FK from FUNCTIONS to MENUS (/LEAFNODES), you want a similar thing from LEAFNODES to FUNCTIONS. For one, this creates the kind of chicken-and-egg problem that SQL has addressed by defining DEFERRED CONSTRAINT CHECKING. But more importantly, because LEAFNODES is a view, SQL supports this stuff only in the form of CREATE ASSERTION, which no currently existing SQL system allows you to use, thus which isn't of any help to you.

As for (3), you probably can guess by now : this can be done by defining a certain ASSERTION, to the effect that no functionID that appears in the FUNCTIONS table, will also appear as a parentNodeID in MENUS. Thus, this is once again a disjointness constraint : 0 = SELECT COUNT(*) FROM (SELECT parentNodeID AS n FROM MENUS NATURAL JOIN SELECT functionID AS n FROM FUNCTIONS). Alas, ASSERTIONS are always for the designer to implement.

And now of course, you're going to ask, "if I can't have the full menu in the MENU table, then how can I know what the menu must look like ?"

The answer is, obviously enough, views :

CREATE VIEW ALLFUNCTIONS AS
 SELECT "F" AS itemType, functionID AS itemID, menuID FROM FUNCTIONS; CREATE VIEW ALLSUBMENUS AS
 SELECT "M" AS itemType, nodeID AS itemID, parentNodeID AS menuID FROM MENUS;
CREATE VIEW ENTIREMENUSTRUCTURE AS
 SELECT * FROM ALLFUNCTIONS UNION SELECT * FROM ALLSUBMENUS; Adapt according to personal taste and user requirements.

(Note that you might consider it desirable to also enforce a disjointness constraint between functionID values and nodeID values from MENUS, but this is not strictly necessary (because you can always inspect the ITEMTYPE and use it where necessary). If you do want this, once again no SQL system will help you and you're simply on your own.)

> All of these issues can be handled at the application level. Are there
> any SQL constraints that could be used ?

Well, I think I've said about all I had to say.

Except, you might try and figure out where exactly the "nested sets" approach makes addressing any of the problems mentioned here any easier or less labour-intensive.

Nested sets are a hack to bypass (and note that I didn't say "overcome" here) the difficulties when having to compute closures over adjacency lists. (And thus to bypass the shortcomings of certain SQL implementations whose support for recursive querying is anywhere from nonexistant to unacceptably poor.)

Oh, wait, no, I'll give you two references of good books that touch on these issues :

"Practical Issues in Database Management", by Fabian Pascal (I've heard that it's out of print, but the dump circuit might still provide copies). There's a dedicated chapter on "dealing with graphs and hierarchies" in this book.

"Applied Mathematics for Database Professionals", by De Haan and Koppelaars. It's a brilliant book on formally specifying database designs, plus there's a very large dedicated chapter on "enforcing database constraints in SQL systems". Received on Wed May 18 2011 - 08:31:01 CDT

Original text of this message