Re: some information about anchor modeling
Date: Sun, 10 Feb 2013 23:41:14 -0800 (PST)
Message-ID: <d38c66fc-2799-4b97-bc25-4f1c518dffb5_at_googlegroups.com>
On Monday, 11 February 2013 02:02:53 UTC+11, vldm10 wrote:
Vladimir
Thank you for your response.
That's not Codd's greatest contribution.
Agreed, he applied mathematical theory to databases (and did so completely, with full examples and discussion). Note that I state it in chronological order, personally, I would not say he "devised the theory of database as a mathematical theory", no, he applied pre-existing mathematical theory to database design.
I disagree, he did not suggest that he invented said theory (others have suggested that he did, which is incorrect).
Codd did not cite the source of the mathematical theories that he applied in the RM. I am not disputing your attribution of Frege (I have not researched his article on this subject), but what I do know, before I read your post, is that more than one great author before Codd had developed parts of it. Boole's Logical Calculus comes to mind. So at this stage I would not grant that it was one person, or that that one person was Frege (much as I respect him, for the articles that I have read).
I agree about the story and the fundament.
I think it is a bad paper as well, which is why it is not in my briefcase when I travel. Perhaps explained by the notion that it is exploratory, whereas the RM is definitive.
Ok, AM uses surrogate as per the RM/Ts. The majority (90%) of the database implementers out there use surrogates in a more primitive manner, with no reference to or knowledge of RM/T. I do not see this as a remarkable different when discussing surrogates and their problems.
Which begs the question: what exactly are you defining as surrogates ?
[While we are here, it should be noted that the RM/T gives a full and proper definition of what some imbecile has "defined" as "sixth normal form", written a "paper" about it, without attribution, and without knowledge of its purpose and use, and that his "colleagues" have cited and elevated in their orgies. One hundred percent plagiarism.]
Let me try to relate the issue in a chronological order or a set of progressions.
2. My Highly Normalised Tables are exactly RM/T, minus the surrogates. In IDEF1X, Entities and Non-entities are named Independent and Dependent, but that is a bit too general re the categories identified in RM/T (eg. I would say that Non-entities/Dependents are not allowed to be related (have relationships); only Entities/Independents can be related. It is more of a stricture to constrain the novice modeller from making silly mistakes.
2.1. In my "5NF" databases (no update anomalies; no duplication *of any kind*; all FDs and MVDs resolved [amongst themselves first, before application to the db!]; all present and future "normal forms" satisfied), given an entity has been Normalised correctly, all the attributes (P-Relations) that are mandatory are located in a single Independent table in which the Relational Key (I will not use the term E-Relation, in order to avoid confusion, and because you seem to have a problem with that) is the Primary Key and not a surrogate. All the optional attributes (P-Relations) are located in Dependent P-Relation tables, where the Primary key is the same as the Independent table to which it belongs. There are no Nulls in the db. There are no ambiguities in the db. No surrogates.
2.2. If any of those mandatory attributes were to be used in OLAP fashion, eg. "pivoting" or "columnar access", then I would remove it from the Independent table, and locate it in a separate Dependent P-Relation table, where the Primary key is the same as the Independent table to which it belongs. No Nulls. No surrogates.
2.3. We could do the same for all tables. Remove the consideration of mandatory/optional, and treat all attributes as P-Relations only. Then we have only Independent tables with RKs, no attributes, and Dependent tables with the RKs and one attribute each. In order to avoid royally confusing oneself, relationships are allowed only between Independent tables.
2.4 Generally, I provide a single View for each independent table and its entire cluster of Dependent tables. (Sometimes a series of Views, but that, and the predicates that drive it, are not relevant to this thread.)
3. Another way of stating that is:
the Identifiers are all unary relations
the Dependent tables have all been reduced to binary relations
the Associative tables are all ternary relations.
> > Now, we see that the RM / T needs to know all attributes of the entity, so that it is possible a binary decomposition of the entity.
Yes.
But it is not the RM/T that needs to know all the attributes. The designer decides what attributes are mandatory/optional.
[ I think it would be silly to argue that the mandatory attributes are onerous. If an Employee must have a salary, then we should not be attempting to add an Employee for whom the salary is missing or "missing" or "unknown". That has nothing to do with database design or RM or RM/T, it has everything to do with what the business has decided is required for an Employee.]
> > This is due to the use of surrogates.
No.
I have the same requirements in my HNTs, which have no surrogates. That is due to whatever business rules are implemented, not the use of surrogates.
4. At this stage I still have my genuine Relational Keys; no Nulls; no surrogates.
> > The authors of "Anchor Modeling" claim that their model solves problems with nulls. They wrote the following:
Absence of null values There are no null values in an anchor database. This eliminates the need to interpret null values [27] as well as waste of storage space. See
Anchor Modeling Agile Information Modeling in Evolving Data Environments , Section 9.2 (The article published in Data & Knowledge Engineering in 2010)
Moreover, these authors claim the following: Anchor Modeling is a technique that has been proven to work in practice for managing data warehouses. See section 11 of this paper.
To be fair, I read their earlier paper; the one you quote is more about their temporal implementation. Nevertheless ...
Well, the claims are correct, although I would not state the resolution of the Null issue that way.
5. The big difference (the only one?) between my HNTs and AM is that they use surrogates where I use RKs. I think this is the issue you have with them as well. Please agree or disagree with my simple chronology up to this point, before we launch into the next part.
dbdesign10
5.1. I would like to be able to say, at this point that my [4] is the same as your "DbDesign 10 Knowledge Data Model", at least in the sense that [4] is an implementation of dbdesign10, and dbdesign10 is a generic or template definition (not an implementation). But I can't say that yet, because:
the one big difference that stands out (in my reading thus far) is that I totally accept RKs, and RKs are compound keys, that AFAIC cannot be decomposed. Whereas, your "Keys" do not allow compound keys.
on the face of it your "Keys" are surrogates, but since you decry surrogates, I am sure you are trying to convey something else, that I have not absorbed yet.
CarId is the Car Key. CarKey is not a Key, it is a surrogate, and the column is therefore incorrectly and named, and leads to confusion.
(I think dbdesign10 needs to be elevated in terms of specific statements and clarity, because it takes undue effort to understand it, but let's not get into that here)
6. In general, before getting into the specifics of your claims, I agree that surrogates do not work. But I state that, and the reasons why, in quite a different way.
7. You seem to think that in RM/T, Codd defined and therefore prescribed surrogates. I totally disagree with that. If he did, he would be contradicting himself re what he defined in the RM. The essence of "relational" is relation-by-key, as opposed to the previous paradigm which was relation-by-record_number (or pointer). So my view of RM/T is that either for convenience, or to avoid dealing with the Relational Key requirement of Relational databases, which would complicate the new concepts presented in RM/T, Codd used surrogates. I am sure he is kicking himself in the shins, now that surrogates in the RM/T are used for vastly more than he intended. Note that I apply the RM/T in [4] and I have no surrogates; I do not accept that the absence of surrogates in my [4] as indicating that it does not comply with RM/T.
Relational Keys vs Surrogates
I am viewing RM/T in the context of RM. I do not view it as a stand-alone paper, or as a discipline. Direction and guidelines, yes, but it is not complete enough to be viewed as a discipline. I can erect a perfectly good Semantic Model using IDEF1X (plus a bit of textual documentation, which would be required for any Relational model), using primarily the RM as doctrine, discipline, mandate, and secondarily the RM/T as guidelines and direction. And do so without any angst or contradiction.
From my various reading, I notice that people either use RKs and refer to the RM, xor they use surrogates are refer to the RM/T. That latter is dishonest, as per the two paras above, they take the RM/T out of context.
I have no need to change the papers that were written, or to write books interpreting them, or misrepresenting them (the majority), so I hope I do not fall into your category of people who elevate the RM/T to something that it is not.
You seem to be taking RM/T as a stand-alone paper, a decree. Well, it isn't. But if you do take it like that, then yes, it would fail.
8. I think AM are foolish in giving up RKs and implementing surrogates. Not for the reasons you mention, but because the displacement of the RK-as-PK with a surrogate at each location in the db, eliminates the essential Relational property of the RM, deemes the database non-Relational, and is a complete loss of Relational power between the *parents* of the subject table (not the subject table itself), and its child tables of the subject table.
8.1. I disagree that AM *substitutes* or replaces the RK with a surrogate. Clearly, one of their attribute tables (P-Relation) contains the RK, the K-Relation or K-Role. So the surrogate is used in the normal manner, as a permanent Identifier, a substitute PK, that is an FK in all its child tables.
9. This is doubly foolish in the context of a data warehouse, where Dimensional access is implemented. AFAIC, If the db is Normalised to the degree required for DW use, ie. my HNTs, well then, the Dimensions are already there, as ordinary Entities (per RM/T) or Independents or E-Relations, with RKs, ready for use, in every table in which each Dimension has some content. There is no additional work to do, to produce a DW, or Dimensions. Note it is relying on RKs. So the displacement of RKs [8] where Dimensional access is required is doubly foolish: Relational capability is broken between every related pair of tables. That results in many more joins being demanded, than for a [4] database.
- Now let's take your specific points.
> > Now we have the following situation:
> > 1. I have shown that surrogates can not manage "nulls."
Reference please, I easily cannot find a passage that shows that explicitly.
On the assumption that I understand where you are heading with this one, to be clear, it is the decomposition to binary relations that eliminates Null. Whether the RK-as-PK is displaced with a surrogate-as-PK has nothing to do with it, for or against managing or removing Nulls.
> > 2. Authors of anchor modeling claim that their model manages "nulls", even better than existing solutions.
Since Anchor modeling using surrogates, then it seems to me that the statements of these authors are not accurate.
If by "manage" they mean eliminate, I agree, their design does.
If by "existing solutions", they mean generally "most implementations", I agree.
If I take "existing solutions" to be the body of technical information available to us, which includes the RM and the RM/T, then their claim is nonsense.
Surrogates do not imply elimination or removal of Nulls. The fact that they use surrogates is bad news, but it does not inhibit the removal of Nulls, or demand Nulls.
> > My results related to the design of databases that manage the history of events, the first time I put on this user group in September 2005. The name of the thread was "Database Design, Keys and some other things." In this thread, Joe Celko posted a very good comment regarding surrogates:
(Allow me to ignore the temporal issue, until it becomes something that I cannot ignore. I appreciate that dbdesign10 is primarily to support temporal requirements.)
> > David Cressey: 9/27/05
> > A VIN, a bank account number, and an SSN are all surrogate keys. <<
> > Celko:
> > No; read Codd's definiiton of a surrogate key. These are all
industry-standard, externally verifiable keys with known validation
rules. Honking BIG difference!! The big part of this is that they
are EXTERNAL to the database.
For the record, Celko is an idiot, and Cressey is an even bigger idiot, that even Celko can destroy.
I would not state it the way Celko does, he has it backwards. I would state: read Codd's definition of Relational Key, in the RM by that definition, a surrogate is not a Key a surrogate is entirely internal to the database, invisible to the user VIN, BankAccountNo, SocialSecurityNo are each data, clearly Keys (unique row identifier from data) completely visible to the user, and relied upon as Keys both inside the database and by the user.
(Since the purpose of any database is to record facts about the real world, well, all the data in the database is, er, um, gee whiz, "external".)
> In this thread, I presented five examples that show fundamental weaknesses of the surrogates.
- Agreed
- Agreed.
- I do not like your proposition or the way you have presented it. I think I understand your intent, so let's continue, rather than be hindered by that.
> > Note that a surrogate key is only in the database, it is not in the real world.
Noted.
> > The above decomposition is very bad. For instance, there is the
question: how will a user find the real world entity that has the
attribute C=c3 and the surrogate key K=k4?
Date, Darwen and some others use (a) examples that are ridiculous, then they (b) propose some absurd nonsense, which can only be entertained if suspension of disbelief (hollywood style) has been achieved (hence the hollywood style presentation in the"scientific" papers and books), then they (c) propose the danger involved is huge, due to some difficulty that the "user" will have when inserting rows into the database, which an user never does (more suspension of disbelief required). All that, each point, is laughable, and dishonest, the whole proposition is laughable, but those who have been programmed for hollywood suck it up. Honest people present examples from the real world (no suspension of disbelief required) that apply to the proposition, thereby giving it a credible foundation, and do not suggest dangers that do not exist in the real world. At best, these disgusting papers are entertainment, but they are marketed as "science'.
I think you are honest, you have put (a) and (b) squarely. But you damage its credibility because your (c) is laughable. Users do not walk up to databases (let alone highly normalised ones, at the cutting edge, which are not common), and make changes to single RM/T rows. No. First, they are isolated from the low level implementation of the database (if anything, they will see Views [2.4] ); second, there will be various constraints in place that prevent incorrect updates; third, whatever they are attempting will be encapsulated in a transactions (that convert the logical business action into a series of single-row updates, all of which together, constitute an Atomic change to the db).
So, no, the user will not be doing any such thing, and as we agree (I think) the surrogate is not visible to them anyway, so they will not be looking for k4 that they cannot see. The user will execute the relevant transaction, that looks up C=c3, and finding that 3 identifiers exist, it will fail. Or else the user will look up c3 on a search window first, find that 3 logical 5NF rows exist for it; choose one via proper Keys (let's assume column A is the Key), then execute the relevant transaction using Key A{value}, which will succeed or fail. Since column A is not given in your example, there is not enough detail to suggest either success or failure.
If you clean up the example, and tighten up the failure, you can use it. But as it is, you damage your paper by using the dishonest method (c).
3.1. I do not accept that "[Codd] was unsuccessful at [decomposition of a relvar into binary relvars] and was not able to show how this is done. " I think it is clear in RM/T, and I do it all the time. There may be marginal cases where the technique does not apply or where further techniques are necessary in order to provide resolution, but that does not subtract from the technique given, and you are not one of those idiots who argue at the margins (straining at the gnat and swallowing the camel).
4. I cannot say one way or the other, since I do not have RM/T paper at hand. But I will say that I would not show a surrogate to the user, or give the user a heading for the surrogate column.
5.
I cannot find the fifth one.
So I agree with your proposition that surrogates are bad.
Note that this proposition appears to contradict your dbdesign10, which is based on surrogates (incorrectly named "Keys").
That is not correct. In the RM, Codd identified the exact issue of surrogates, and of "merging" data, although he used terminology from that age. The former is clearly defined. He stated the latter point thus:
"The simplicity of the array representation which becomes feasible when all relations are cast in normal form is not only an advantage for storage purposes but also for com- munication of bulk data between systems which use widely different representations of the data."
Of course the "normal form" referred to, is that which he provided in the RM, converting Hierarchical Normal Form into Relational Normal Form, and further, he provided details re the superiority of Relational Keys over surrogates.
And of course, most capable people, even if they had not read Codd, when they attempted to communicate bulk data between systems, without using Proper Keys, found that out. Which has been happening since the 1970s. It is hardly a first time or first time clearly presented.
Nevertheless, I agree with the main point, that surrogates cause chaos. Particularly because with the loss of Relational power [8], and the exponential number of constraints that are required in that absence, and the consequences therefrom. But it is possible for someone who is fixated to using surrogates, *and* who is diligent in their implementation of the morass of constraints, to substantially reduce the chaos. They still have the poor performance.
I dare say, I understand it better than most, and certainly better than authors of books allegedly about the RM.
I dare say, that *if* Anchor Modelling databases do not suffer from the consequences of data and referential integrity loss, then they have a good handle on it as well.
I do not argue the statements above because I have not finished reading your paper; the one thing I disagree with is that there are "important problems in RM". Please identify specific, and start a new thread for that subject.
(OO experts have not produced anything that lasted, they come and go like bright flashes of algae in the ocean, so I suggest you do not worry about them.)
> But more important than these specific examples is the theory behind all of this. One of the important theoretical questions is identification, surrogates are about identification. This is about the identification of abstract objects. Some of my abstract objects have more than one identifier. Identifiers of the complex abstract objects exist only in the databases, similar to surrogates, but unlike surrogates, my identifiers are doing well.
>
> All objects that are stored in the human's memory or in the db's memory are abstract objects.
(Again, the question begs: what is the exact difference between your "identifiers" and surrogates?)
I don't agree. In order to be able to discuss anything reasonably, we need to be using the same terminology, and meaning the same things. The dishonest people who write nonsensical papers and cite each other in mutual masturbation do that (abusing terminology, creating new conflicted application of existing terminology, and thus destroying it), on purpose. That is clearly not what you are doing, but you are using established terms in a way that means something else, which I would ask you to correct: either use the established term strictly in the established sense, or create a new term.
There is nothing abstract stored in the database, about:
a car
the attributes of a single car
a person
the attributes of a single person
at any given time
AFAIC, all data in any database is non-abstract.
Likewise, I think you should tighten up your loose use of these terms:
Key
Identifier (in the RM & IDEF1X it explicitly means Key)
Surrogate (use record or slot identifier, and do not use terms "row" or "key" in that definition)
Following that, statements in your para above are ambiguous and confusing. And I do not wish to dismiss it wholesale.
That is even more confusing than the previous para.
If the attributes are treated as Identifiers, then you solution *IS* the RM/T, unmodified! ... but you say it is not, which begs the question: what is the exact difference ?
If you don't mind, I would like to stay with one set of issues until we close them, rather than starting on another set of issues.
> See also my post in this thread since 28.January, 2013, this is also related to identification.
> > Note that this proposition appears to contradict your dbdesign10, which is based on surrogates (incorrectly named "Keys").
> On the other hand, identifier, which is given in my solution, has many advantages compared to surrogates.
Ok, so surrogates and "identifiers" are clearly different from your perspective. But that difference has not been identified clearly, at least to me. You mix up the terms in your different papers. I need a clear statement that uses terms in their established sense only; provides new terms for that which is not established; and does not mix up the terms.
> However, notice that my identifier also has certain problems with nulls. If I have this key and nulls, then I can solve many of the mentioned problems. The key, which is given in my solution, could find a real object and vice verse. If I have nulls, then I can apply three-valued logic, or I can extract the tuples with nulls and implement some of programming languages, etc. It is not possible apply nulls, surrogates and the three-valued logic, all together.
That makes me baulk. Now you are confusing "identifier" and "key" within two paras. Second, you have real objects (not abstract ones as you suggest elsewhere is the only content of a database). Third, and this is a long point, a Key cannot be Null; so the rest does not apply; even if for some reason the user had some problem in finding the Keys for which the (earlier rendition) attribute C of value c3, we would be relying on Keys, not attributes, for identification of the qualifying rows. In this rendition of the same example, you are using k3, which is a different problem again, nonetheless a problem.
Last, that has the stinking thinking of those who *implement* the "Null problem" in their databases. If they did not blindly implement that, they would not have the "Null problem". I do not think you are blind.
> Today more than 90% of the database has identifiers that are part of my solution; these are industry-standard identifiers. For entities with these keys does not make sense to introduce surrogates.
Yes, if you mean Identifiers in the normal established sense, but since they have not read your paper, you cannot suggest that they are using your solution, or part of it. Further, I argue the 90%, certainly good databases have good identifiers, but it is 5 or 10%. 90% are chock-full of surrogates and the Identifiers are badly handled.
> So, for over 90% of today's databases, it is nonsense to apply surrogates. This number is an astonishing example of the amount of misunderstanding. I am referring to the wide usage of surrogates in scientific papers, which are related to OOA, RM / T and Anchor Modeling.
Yes, I decry the wide use of surrogates in "scientific" papers, especially if they purport to explore aspects of the RM (but not RM/T, of course). OO is a joke. AM: well they can, if they have worked out the constraints diligently.
The number for me is 5 to 10%, the surrogates-users are 90%; they are already badly damaged; used; and abused. Your point is valid.
> On the other hand, the identifiers that are given in my solution do not have to be like industry-standard identifiers. Every company can define its own system of identifiers and identification, which is based on my solution. This db design is a great advantage and a great independence for each company. In this database design, it is essential that these identifiers are placed on the real objects of our business, for example, these identifiers should be in the documentation, receipts, invoices, etc. In this way, the identification is completely under our control. Of course, there are many variations on this solution.
That remains to be seen. The requirement to make an *additional* Identifier (correct use of term here, in your para) visible and maintained, is an onerous one. I think (not sure yet) that your purpose is to support temporal requirements. If that is the case, their are other methods for that. Yours look like it (a) implements surrogates (b) elevates them to Identifiers (c) demands the business takes on the burden. Xor, yours *is* RM/T, with some minor refinements.
> Imagine now a situation that everyone uses some of their surrogate keys. For example, that instead of the ISBN standard for books, every project leader uses his surrogates system. It is obvious that such a solution is impossible in real life. It is also obvious that if we use the ISBN identifier, then we do not need the surrogate key, at all.
I agree with the first part.
I disagree that we do not need surrogates *at all*. As per my first post, there is one condition where we have to use them, and when we do, we have to do so properly; in that situation, it remains a surrogate, it is not elevated to a Key.
> In this thread I pointed to a large group of objects from the business applications that can not be resolved with surrogates. This is the example about an Honda dealer who sells Honda cars, which all have the same attributes. Here we can not use surrogates, because they would show the same entities in a database. Therefore, we must introduce the VIN. And, again, it is obvious that if we use the VIN identifier, then we do not need the surrogate key, at all.
No.
the dealer is crazy if the system does not use VIN as an Identifier, agreed
they may have to use a surrogate *as well*
a surrogate is always an *additional* key and index. It does not *replace* the Key or Identifier. It is not either-one-or-the-other. It displaces the position of the PK to an AK, and takes the position of the PK. That is why we call it a surrogate (it has the same meaning in English; the term would not be correct otherwise). The original PK (VIN) cannot be released. It is now an Alternate Key (unique, Identifier, visible). The surrogate (invisible) is now the PK, and the FK in child tables.
so no, I disagree that the surrogate is not needed *at all*. Sometimes yes; sometimes, no; sometimes essential.
> Now after the above examples, we can set an important issue, it is the following question: is there a good theory of the surrogates. Note that such theory does not exist and this is the main problem with surrogates.
Let me assure that it exists, and I have used it since the 1980's. I do not have it written up as a theory; it is a practice or standard, explicitly identified in the documentation of every database that I wrote since then. I also write it up when I execute assignments, eg. Technical Audit of a database, when they use too many surrogates, or use *all* surrogates. I am quite sure that other capable Relational types have such a theory or practice, that I am not alone.
I am also quite sure that 90% of the database implementers out there have no clue about surrogates: they have neither theory nor practice/standards statement; they use surrogates on every table without appreciating the problems, and without adding the required protection.
> You can find my definition of abstract objects in my paper "Semantic Databases and Semantic Machines" section 1.1 at http://www.dbdesign11.com/
I am still labouring with dbdesign10. But I will get around to it.
That is great, and I would love to see a solution that does that, defined clearly. You do not mention here, but it appears to be a stated goal in dbdesign10, that your solution provides a temporal database.
But there seems to be a contradiction between that concept of "object" (para above) and the concept that "objects" are abstract (elsewhere).
> For example my solution completely solves what Codd unsuccessfully tried to solve with RM / T.
If that is the case, you should write that up clearly, as part of your paper (appendix ?): what RM/T provides; what RM/T does not provide or what it is missing or how it is incomplete: what your solution provides (that "completely solves" it).
To iterate that which I have already detailed, I do not view RM/T as either stand-alone or incomplete. I have been able to use RM/T in the context of RM, for the stated purpose and beyond, therefore I do not find it incomplete, or that it does not solve the problem identified in the synopsis. Since I use Relational Keys in 100% of my databasess, and I am aware of the problems with surrogates, I use them (a) only when I have to, and (b) implement the requirements for data & referential integrity that is lost by their use.
I would like to understand your paper and solution thoroughly, because I have a lot of experience implementing temporal requirements into genuine Relational databases, without any of the insanity proposed by Snotgrass, Wilderstein and Escher. But I have the biggest project of my life coming up, and it is going to be 100% temporal. I am currently polishing up my own temporal documentation and code, that is required (I have an extension to the SQL catalogue). I also have a completely different method that has a theoretical foundation, but I have no substantial experience (real implementations) with it, and it is surprisingly unheard of, not discussed. For this reason, I want to understand your solution and either accept it or reject it, sooner rather than later. Therefore I would ask you to maintain focus; to avoid repetition; and to just answer the points that are not closed (by number or by short quote).
I will erect a few models, which will hopefully be ready after your response to this.
Cheers
Derek
Received on Mon Feb 11 2013 - 08:41:14 CET