Relational and multivalue databases

From: Eric Kaun <ekaun_at_yahoo.com>
Date: Wed, 18 Feb 2004 01:55:04 GMT
Message-ID: <YBzYb.23125$Ir2.20735_at_newssvr16.news.prodigy.com>



This letter (monograph?) is in reply to Dawn Wolthius, who received what she seems to have considered a short and unhelpful series of responses from Fabian Pascal between October 9 and November 12, 2002. This was also motivated by postings on the comp.databases.theory newsgroup, specifically the “foundations of relational theory?” thread which began on September 26, 2003 and accumulated over 500 responses before petering out on November 7, 2003. The length of my response to all this indicates the depth of my fascination with the entire discussion.

For those who would flame me for opening old wounds and awakening sleeping dogs, I apologize, but I believe this discussion still has merit. I can appreciate Fabian Pascal's apparent crankiness, given that he's been answering the same questions and debating the same issues for over 20 years now. I do see reason to resurrect this debate (pedagogy), but it needn't be the same people who do it generation after generation.

And kudos to Dawn for always being willing to ask questions, explain herself, and to remain cheerful in the face of insults.

(I'm assuming by “database” you mean “DBMS,” and there are no RDBMSs yet, excepting the allegedly excellent Dataphor, which I've yet to use.)

It sounds as if the MV development environment (IDE, reporting tools, etc.) is to credit for productivity gains. I'm far more concerned with (provable) correctness than productivity (not that the two are unrelated, unless you're committed to a “code-and-fix” cycle), but have always found that a logically solid foundation (of which “conceptual integrity” is only a diluted description) will enhance productivity – in the short term as well as the long term, though I would also posit that there's an exponential benefit with the size and complexity of the system, lifetime of the product, and turnover within the team.

“Variable length data structures” are allowed by the relational model, which places no restrictions on types (aka domains). It's lousy “relational implementations” like SQL, and even lousy SQL implementations (pick any of them), that fail the relational model utterly by offering only a crippled type system, and until recently, either nonexistent or highly proprietary type definition facilities. The type of your attribute determines whether length is significant; for example, wouldn't you want a U.S. Social Security number to be restricted to a maximum of 9 digits (as well as a minimum of 9 digits)? Wouldn't you want a UPC code to be limited to 1-5-5-1 digits (if I remember correctly), and also to guarantee that the final check digit is a valid checksum?

Current SQL implementations have DATE, INTEGER, and others. “Data structure lengths” violate the relational model in at least two ways: not allowing users to define a type (including operations), and exposing physical implementation details (of built-in types).

The databases you referenced are not relational. Fixed lengths I addressed. As to strong typing, you'll have to be more specific... but if you like, you're free (even in SQL!) to define every attribute as a general-use type (e.g. a BLOB / CLOB / string), and then manipulate it as you like in individual application programs. Every program will have to be aware of how to properly address characters or bytes within the general type (since if you've bothered to define a field it typically has some meaning), and the DBMS will be unable to restrict what's placed in that “data element.” This loses a great deal, since declaring types to an RDBMS allows the definition of your “type rules” in one place, and disallows operations that would corrupt your database (for example, placing “ABCDEFGHI” into the aforementioned Social Security Number attribute).

By “specifying data,” I'm assuming you're referring a combination of type definition, relation definition (including normalization), and constraints.

The problem is that we're most of us not psychic, and any database I've ever developed (or inherited) has outgrown its originating application, which means that in designing a database, you have to look at the business domain and not just the queries being asked for right now, which won't necessarily predict future query needs. Unless, of course, your application is so awkward that the users have no desire to extend it, or the company's automation needs are so limited that they have no need to extend it.

Relational shines in its ability to model data of all sorts in a minimal and egalitarian fashion which is closely-aligned with logic, so that you don't repeat yourself (and have to write extra code to synchronize the redundant elements). This applies to every aspect of the relational model and its attendant (though orthogonal) techniques: normalization, type definition, constraints. All of these are designed to allow you to state a required predicate once and be done with it. Predicates underlie the relational model at every turn; it's the single driving concept, rooted in logic and directly applicable to all data. This is the concept that most people miss, as it's not stressed nearly enough (by enough people). It's a not-entirely-unexpected nicety that specification languages like Z and VDM use predicates as well; hence the practical value of relational in executing queries augments its ability to model high-level specification abstractions.

The root problem is that the “entity-relationship” view of data is horribly limited; relationships often require (or acquire) enough additional attributes and “business logic” to be reasonably viewed as entities in their own right, and at that point you'll find you've been treating a first-class citizen as a second-class one, and in addition marring your syntax with arbitrary (and unnecessary) path expressions. In “An Introduction to Database Systems (8th Edition),” C.J. Date gives a good example of this phenomenon: that of marriages. While you could regard a marriage as a simple “relationship” between two “entities,” queries like “How many marriages took place between 1972 and 1974 in Old St. Luke's Episcopal Church?” obviously “view” a marriage as an entity – and it has attributes such as date, time, venue, etc. Line items on an order, for example, could just be viewed as a relationship between an order and a stock item, right?

Don't think of entities and their relationships; those shift. Predicates are far more stable, and treating them as individual concepts will do much to enhance their use and reuse.

Perhaps, but that seems a highly subjective statement predicated on the fact that you know (and enjoy) the concept to begin with. I could just as easily state that people talk and think in terms of predicates, and that despite being an object-oriented programmer, I find the use of predicates far more useful and “roll them up” into objects only as late as humanly possible. I've found that business people talk much more in terms of raw “rules” - which correlate much more directly to predicates than anything else. Using objects (as in OOA) can be useful to flush (flesh?) out additional concepts, but are not the lingua franca of business, despite what OO pundits will tell you.

Besides, people often talk and think in terms that are various combinations of profane, vague, contradictory, unrealistic, nonsensical, wishful, etc... that doesn't mean our logic machines have to work that way!

I would take resemblance to XML to be a damning attribute until demonstrated otherwise... and as far as productivity, I'll assume you're right; that doesn't say much about the model itself, at least not to me. I personally want my DBMS to be as unforgiving as possible, as long as it's unforgiving of someone trying to violate its rules! The rules (relations, types, constraints) are how I state what the data “means.” That's how I protect my company's data from corruption (e.g. from having its representation or encoding mangled into a “value” that violates the data's meaning).

“Firewalling” is something every good programmer does, even within his or her own code. It helps prevent mistakes, it helps protect us from the logical (though sometimes undesirable and unpredictable) effects of combining multiple pieces of code, and best of all, it expresses our intentions. All software is description, and explicit is best.

For initial costs, I can't say; in any event, they're dwarfed by ongoing costs for a system of any degree of importance (and barring license price gouging, which I've seen). Ongoing costs are tricky to measure; however, the cost of data inconsistencies (violation of rules) is likely to dwarf productivity losses. Regarding productivity, this is nearly impossible to meaningfully compare, but I've been able to develop in 1 day (actually during the meeting where we were discussing the requirements for it!) an application consisting of 20 tables, a dozen screens, and several reports, against an MS Access “DBMS” which was later ported (in very little time) to SQL Server. I don't think this is either impressive or unusual (it occupied me during a dull meeting), but the constraints were never a hindrance; in fact, a properly relational language would have allowed me to express many more “business rules” directly in the database, rather than in my code, thus saving me time and potential errors.

  • It really isn't very time-consuming to add SQL tables, given even crude database design tools.

Dawn wrote: “Of what are we [those who would use XML] ignorant? Of some theory or practical advice?”

Possibly the theory, or possibly of the practical value of the theory, or possibly neither (I can't make a blanket statement of ignorance, but do believe that XML is a huge leap backward for this industry, and will hasten the decline of the reputation of programmers everywhere once light is shined on the naked Emperor).

Read chapters 3-10 in Date's “Intro to DB Systems (8th edition).” Then read additional chapters to see how the pure theory translates into direct benefits including such “pragmatic” topics as database distribution, concurrency and recovery, type systems (!), and even code generation (actually declarative programming, but code generation is one implementation that's more marketing-friendly).

First, see the dangers of dividing relations (predicates) into “entities” and “relationships.” And keep in mind the excellent analogy from Hugh Darwen: “Types are to relations as nouns are to sentences.” Types (domains / attributes, sort of) are the things we can speak about; relations are what we can say about them. Since you refer to language all too frequently in referencing MultiValue, the noun/sentence distinction should strike a chord.

This might be OK as long as the attribute is one on which you're not doing a computation (in which case its cardinality can't be easily increased or decreased), and as long as you never need to track additional information about those attribute values. How would you store, for example, the model and year of the car or the color of the bike? Additional attributes – and if so, how do you correlate those with the original car and bike attributes? If they're separate attributes, then what does it mean if I have 3 bike “values” and 2 color “values”?

And the relational model “makes sense” of this sort of “relationship” perfectly well; in addition, it allows you to formulate many more constraints about the data, and these constraints are an important aspect of “business rules” that you would otherwise need to program (perhaps repeatedly).

Hierarchies are much more uncommon, and much less useful, than most people seem to acknowledge. Even the cliché “org chart,” while hierarchical, fails to capture the matrix reporting which occurs in most organizations.

They are in relational too, and even in SQL.

Only if the data element isn't involved in actual logic (e.g. the text examples like car names I usually see in Pick examples). If the data element is numeric, does making it a list mean that reports now have to sum the elements in the list? Average them? If I'm only displaying them, why even a list? Why not a string?

And, for the record, the relational model supports lists (or relations!) as types as well. It just isn't typically a good idea, unless it's an attribute with which you do nothing but display on reports and screens.

Intuition is a poor basis for logical and data decisions, as it's subjective and prone to change. In particular, a database consisting of many relations can be used to generate an uncountable (?) number of hierarchies, depending on which (and how many) joins are done, and in what order, and on what attributes. Which one you choose depends on your immediate needs, but see above regarding the durability of “query needs”; they're prone to change, so why bother with a hierarchy in the first place when it expresses only one thing you're going to need to do?

Agreed. The “type constraint” you mention isn't one; it's because the non-relational SQL doesn't offer real type definition, and exposes unnecessary physical implementation details, that such things exist at all. That ain't relational. So how do MultiValue DBMSs enforce types; for example, that an SS# has to be 9 digits? Does the DBMS know or enforce this? Does it know or enforce any restriction at all on types? While unnecessary bounding is a problem, infinite bounding (allowing any values in any data element) is also a problem, as I'm sure you can see. Dawn wrote: “I suspect there are times when having a constraint coded into an RDBMS increases the maintenance work to an extent that is not cost-justifiable, particularly when the constraint is one that is prone to change.”

Constraints are or aren't; they're seldom prone to change. Changes like that usually indicate either an overly-specific constraint (which a good RDBMS would allow to be generalized) or shallow analysis. If relational systems (all 1 of them!) had a consistent catalog (which represents, in relational form, the structure of the database), then you could possibly add constraints to the catalog itself (since they're just relations too!), which would further enforce requirements. This is beyond me, but I theorize that it could both be done and be very expressive, not implicit in the code of many programs across the business.

This places the burden of maintenance in every program that uses your file/table; every application has to know enough about that data element not to violate how you intend it to be used. Because those types are not in the database, you'll need a document of some sort to describe to each programmer how to use the data! Or, at least, you have to hope your data element names are suggestive enough to describe to people how to use it. You risk much with this sort of assumption; I don't think there's anything simple about it. You're simply delaying paying for the “simplicity,” but that loan accumulates interest.

Again, if the data is only displayed, this might be acceptable. But type systems can be very rich, and that richness is extremely useful, and practical. It lets you (again) say something of critical business value, once and only once.

The hierarchies that MultiValue allows you to establish aren't single propositions, unless you restrict yourself you propositions about the “top-level entity” in your file. Attributes in a relation are conjunctive clauses (e.g. “the employee has id ID and has name NAME and was born on BIRTHDATE”). If you “nest” the employee's dependents, for example, you're making a number of additional assertions, and merely glossing over the fact that they're separate – furthermore, this approach becomes intractable if one of those dependents becomes an employee.

If you read about “antipatterns,” you might know one called Big Ball of Mud. The idea is that applications are best factored into multiple pieces, so no one piece becomes too complex. Propositions are similar – why lump them together unless you have to?

On the contrary: integrity constraints, and types, define not just the spine but the skeleton and muscle of your application. It's only flaccid relational-scented SQL implementations that have diluted this fact. I hope for great things (including application generation) from the relational implementation in Dataphor (and, I hope, soon many others). To repeat what Mr. Pascal said: “It [XML] was invented by people who know nothing about data management – text publishers.” Not a solid foundation for logical machines like programs.

Dawn wrote: “...instead of a programmer coding some procedural code (!) specific to a certain circumstance, the logic is declared in the database, but designated as a local constraint where the next developer working with the database might not want/need to apply it?”

Can you give an example? I suspect your logic isn't sufficiently general, and that the Alphora folks could volunteer some specifics on their D4 language's capabilities in this arena.

  • Eric
Received on Wed Feb 18 2004 - 02:55:04 CET

Original text of this message