Relational and MV (response to "foundations of relational theory")

From: Eric Kaun <ekaun_at_yahoo.com>
Date: Wed, 18 Feb 2004 02:05:23 GMT
Message-ID: <DLzYb.23126$ft2.1420_at_newssvr16.news.prodigy.com>



This is in reply to the “foundations of relational theory?” thread which began on September 26, 2003 and accumulated over 500 responses before petering out on November 7, 2003. Yes, I read them all, and was fascinated. Apologies - I found them only after the thread had run cold.

Perhaps, but it at least appeals to the “logical aesthetic” of Occam (the philosopher, not the language): not to multiply entities beyond necessity. Hierarchies are more complex than relations, and yet express no additional information. Agreed that KISS is not a scientific argument, but it is a practical one: we're trying to build logical machines here, and simpler concepts are easier to mechanize.

There is of course an argument to be made for expressiveness; you can “do” logic using various minimal subsets of combinators (S, G, I, etc.) which I barely understand, but I would argue they're too simple to be expressive to humans (although people have written many programming languages using them, primarily - I suppose - as a learning exercise). You can manipulate relations using only 2 operators: remove and nor. This minimal set is relationally complete (as are many other sets), but difficult, so comprehensibility was a factor in the design of the usual relational operators: restrict, project, join, etc.

You focus on just one type of compound data because it gives you all you need, and walks a fine line between simplicity, expressiveness, and combinatorial power. You stray from it at the expense of one or more of those “attributes.”

Mathematicians and scientists have different definitions of “better.” For scientists, better means more accurate at predicting the future (typically of physical reactions and events). For mathematicians and logicians, better typically means more amenable to symbolic manipulation, mechanization, and better able to describe formerly disparate models. For a data model, we want a minimal set as described above, but as long as we can define “higher-level” operators and types in terms of the minimal ones, there's referential transparency: simple substitution allows us to write expressions that are easier for us to understand, but still correct by virtue of being defined by base concepts that are opaque but simple to mechanize.

That said, hierarchies are neither higher-level nor more expressive for modeling data. They are more expressive in the limited context of a presentation to humans, in some cases – but then again, so are trees, tables, etc. In any event, those are all capable of generation from relations, but it must be understood that there are implicit decisions made (e.g. ordering, path) that are not present in relations.

The same can be said of many technologies. The losers in a horse race still live – sometimes even when they lose repeatedly. It doesn't mean you should bet on them. Saying that companies sticking with MV is the norm refers primarily to any business's unwillingness to change (especially when they perceive no reason to).

The first part of this is completely wrong when you consider multiple applications which use the same database (or at least parts of it). This is the norm in companies in which I've worked. Yes, what one does with the data is part of what defines it; types and constraints in RDBMSs express what is and isn't allowed, yet also define what can be done through operators, both type-specific and relational.

Popular success proves nothing except success (i.e. A => A). And all applications have constraints, many more than are typically placed in the database; just because you're writing them in code doesn't mean they don't exist.

So evidently that “data set” was so unimportant that it was only displayed – sounds like any CLOB or BLOB or string would work in that case. Why bother to nest values? Relational could define a dumb List type to do this anyway.

I've never seen a requirement phrased like that, but why don't we still use assembly language, and raw file system manipulation? The entire history of computing, math, logic, and even science is based on abstraction; at what point do you think we went too far? Today, compiler optimization builds better generated code than most assembly programmers write most of the time, due to years of analysis and “best practices” made concrete in code. Shouldn't we let our DBMSs do the same? Date's book makes it clear that the relational model enables much optimization (based on logic!) that the flaws in SQL effectively disable.

No, it's not. Since I can generate many “containment” hierarchies from a 3rd normal form database, I would argue the opposite. For example, from relations, I can show not only children “nested” inside parent “records,” but also the reverse. I suspect that just using the term “subset” isn't going to do either of us any good – perhaps a category theorist would pipe in with a proper definition of what we're each trying to say?

Which language? English? I'd wager that a human language has different requirements, and a completely different development process, than a computer language. Unless you're referring to Vulcan, human language isn't based on logic, nor should it be (?). Nor vice versa. What's the value in using human language as a basis for data and queries?

Since there have been no RDBMSs until recently, you're right. Neither is there “proof” on the Pick/MV side, though there are some vociferous proponents.

Relations are explicit, and types and constraints make the meaning of your data far more explicit. MultiValues express, at best, one kind of hierarchy.

That implies no one uses Euclidean Geometry or Newtonian Mechanics, which I think is a false statement – any physical engineers care to pipe in? It also implies that the same mathematics doesn't underlie quantum physics and relativity; at worst, Euclidean and Newtonian use the same mathematics limited by context, within a certain tolerance.

In any event, comparing the mathematics and its correlation to the real world with the relational model's correlation to reality is a misplaced one; better to compare a given data model (for a specific business) with the real-world external predicate it's meant to represent.

So does practically anything. It does sound, however, that the Pick environment(s) offer useful application-building tools, despite a limited data model. Probably the Pick development is an excellent implementation of the “Pick data model,” whereas our current SQL implementations are pathetic implementations of something only vaguely pretending to have once been relational. Note, however, that however flawed SQL is (and it was even worse in the 80s!), it displaced previous hierarchical and network data models very quickly, (possibly) indicating a great degree of dissatisfaction with those. We should remember that history, lest we be condemned to repeat it.

Strange, then, that the educational system is becoming more of a trade school, teaching not relational but SQL, and even specific implementations of SQL. Relational at least has a logical and mathematical foundation worthy of study. Since the Pick theory papers seem to have fallen behind a filing cabinet somewhere, there's not much to study.

Yes, it shows that logic and mathematical analysis have some place in designing logical machines (programs). Knuth's pride might just be based on the quality of his thoughts, and the logical strengths of the algorithms. It further reiterates the difference between levels of abstraction, and the design / implementation divide. It's not hard to fathom that given an implementation (e.g. a processor that represents numbers in a certain way, and manipulates words of a given size at a given speed), you can compare two algorithms based on what they do and the mapping of their elements to constructs within the processor. You don't need to code every possible algorithm to know something about the results; otherwise, how did any mathematics ever develop?

Don't forget that Donald Knuth (whose specialization was algorithmic optimization) quoted C.A.R. Hoare (one of the pioneers of axiomatization in software) in writing: “Premature optimization is the root of all evil [in programming].” Forcing one particular type of premature optimization at the language or DBMS level is undoubtedly evil; even some pseudo-relational SQL DBMSs have clustering, which allows the DBMS to physically do exactly what you're describing with NO impact on database design or program logic, and which furthermore can be later tailored to actual usage and data patterns. Once again, the Pick “model” is a poor “subset” of the relational one.

The access you describe is the simplest one of all; building a record given a primary key value. If Pick does that quickly, fine; that still says nothing about its utility as a model. Any system I've ever used has many patterns of data access, and I'd hate to optimize for the simplest path at the expense of the others. Especially prematurely. And especially when that same optimization can (and should) be done behind the scenes without impacting every program that relies on that data.

Again, why don't you develop in assembly language? Much closer to the metal (as if that were an advantage). Or at least use file system primitives. Memory management primitives too – I'm sure you can gain some speed advantages with custom paging.

Though apparently you believe this lack of typing to be an advantage, the potentially catastrophic disadvantages of this are one of the problems relational was designed to address. How exactly do you feel more comfortable giving programmers instructions on how to “use” a field, as opposed to defining a type which restricts what you can insert into an attribute tuple, as well as the operations defined over it? Again, I wouldn't feel comfortable with someone putting “abcdefghi” into a SSN# field – though if that was a special value of some sort, a true RDBMS would allow me to define a domain that allows just that! YOU decide what the data means, and the RDBMS helps you make it stick!

Again: if the rules aren't in the database, then every application has to encode them separately. How is this an advantage?

There is only one file structure: a sequence of bytes (we're abstracting above magnetic fields and bits). There is only one programming language: the raw machine code (even assembly is an abstraction). Why don't you use those constructs? Once you've answered that, apply the same reasoning to what's being said here.

In software development, we consistently aim too low; the computer could do so much more work for us. But instead we feel we know better, can optimize code and data access better, etc. The history of compiler development shows that we're wrong.

That's just patently untrue (all tools are not the same – that's a fact in any field of endeavor), and would have us all content to keep using what we know and never anything more. Again as in any field, we need to evaluate tools based on other merits than “well, it's worked for me so far!” And a business has to consider support costs, maintenance costs, training costs and economies of scale in selecting standard tools, rather than allowing every developer to use whatever he or she desires.

Chris Rock (a comedian) said (paraphrasing): “You can drive a truck with your feet if you want to – that doesn't make it a good idea.” The same applies to many things (including software development). The Pick model, apparently, works because you can ignore it? And if you can ignore it, then the DB is essentially a file wrapper and says little about what's actually contained in it. Since the database represents the logical predicates of your application, “using as much or as little of the DB as we care to” essentially allows programmers (possibly on different applications or projects) to ignore whatever portions of the system's logic that they like!

I don't understand why persisting a “language proposition” (as if human language were a good model!) is better than persisting a logical one. Human language is a poor basis for mechanizing reasoning (unless maybe you're talking about Esperanto or Vulcan).

A proposition such as Pick claims to model better isn't an atomic one. Breaking compound propositions into atomic ones allows better mechanization, substitutability, optimization, etc. And even if you don't break them apart, at the very least you should avoid nesting data about which you want to make assertions – if you do, you're cementing a single (and limiting) path.

The implied parent-child, one-to-many relationship can be modeled in relational too, very easily. The fact that Pick would have you placing them in a single record may be several seconds less work than declaring a relational foreign key constraint. The flaw is in the word “primarily” - I've seen that assumption change many times in the course of a single project, and for any database worth its salt (one which truly models a piece of the business, not just some tiny segment), “primarily” is purely a matter of perspective.

More to come, in case you're having trouble sleeping. :-)

  • Eric
Received on Wed Feb 18 2004 - 03:05:23 CET

Original text of this message