Re: Question on Structuring Product Attributes

From: <hugokornelis_at_gmail.com>
Date: Tue, 5 Feb 2013 04:59:38 -0800 (PST)
Message-ID: <1d93e1a3-b69b-4523-904b-dc140bbd0b42_at_googlegroups.com>

Hi Derek,

Thanks for your reply. I do need to address a couple of your points.

Op maandag 4 februari 2013 02:01:28 UTC+1 schreef derek.a..._at_gmail.com het volgende:
> Hugo
>
>
>
> On Sunday, 3 February 2013 22:45:43 UTC+11, hugoko..._at_gmail.com wrote:
>
> >
>
> > > A tighter calculation would be:
>
> > > 1. 10% additional overhead per Subtype (since it has only two columns)
>
> > > 2. 100% additional indexing overhead per Subtype (since it has only one index)
>
> > > 3. 100% additional indexing overhead (since it has only one index)
>
> > > [Corrected to]
>
> > > 3. 100% additional indexing overhead for the Supertype (since it has only one index)
>
> >
>
> > Well, that explains why your overhead estimation is so much higher than I could imagine. Your mathematics are off.
>
>
>
> <shopping snipped>
>
>
>
> > By your logic ...
>
>
>
> That is neither logic, nor my logic, nor my mathematics, not mathematics. That is your interpretation, and application to some other scenario. I do not see any point in commenting on it. I will answer on the basis that you are not being contrary, that you actually do not understand, and limit my answer to the statements made.
>
> If you want to understand what I stated, please just read my words (which I am responsible for), do not jump to conclusions (which in any case, you are responsible for).

You are indeed right that I jumped to a conclusion. I originally asked you where you got the 220% overhead figure that you introduced earlier in this discussion:

On Sat, 15 Oct 2011 02:41:24 -0700 (PDT), Derek Asirvadem <derek.a..._at_gmail.com> wrote:

(...)
>Ok, so you are both fools.
>
>It is not common, because 220% overhead and substantial cost, is just
>plain stupid. People are not quite as stupid as you think.

In your reply to my question about that 220%, you mentioned points 2 and 3 as above, plus point 1 but with a 20% estimate in that reply (you later changed that to 10%). You never told me how those 20, 100, and 100 numbers had to be combined to arrive at your claimed 220% overall overhead. Since adding happened to result in just the right number, I assumed that you had done that -which, of course, would violate logic and mathemetical rules. Apologies for this assumption. And, forgive my ignorance, I have to ask now - *how* do we get an overall 220% overhead from 10% on the subtype tables, 100% on the subtype indexes, and 100% on the supertype indexes?

> (In the genuine Relational world, we have implemented this Subtype from the outset. It is identified in the Relational Model, over 40 years ago, and it was fully catered for in the first publication for Standard for modelling Relational databases, which was 35 years ago. The difference is, until 2007 when Functions arrived, the constraints were implemented in the [required anyway] transactions. Since 2007, we can implement Functions, and implement Declarative CHECK Constraints for the Subtypes.)

And even before that, we could implement declarative CHECK constraints using subqueries, in products that implement the feature (SQL:2003, Feature F671 "Subqueries in CHECK constraints"). Or we could even directly declare the subtable hierarchy, again in products that implement the feature (SQL:2003, Feature S081 "Subtables"). These two options are far superior to both the method based on redundant columns as described by Joe in this topic and your function-based method.

> 2. Sure, the syntax is different. So what ? The syntax for every SQL statement is different, particularly in the area of (a) extensions and (b) features that each vendor implemented before the standard syntax was published. So what ?

So people who know that their code might be ported, or who for other reasons prefer to stay as close to the standard as possible, try to use ANSI features as much as possible, and non-standard features as little as possible. The method posted by Joe, that you criticize, uses CHECK and FOREIGN KEY constraints, directly portable across most (if not all) major RDBMS platforms. Your version requires much more work to port. That doesn't make it a bad solution. As I said before - it is a good solution, but it's not the be-all-end-all. There are situations where I might choose your solution over Joe's, but there are also many situations where I'd prefer Joe's.

> What is relevant, is that Functions (tables, datatypes, columns, everything in a database is User Defined, we do not have to keep telling ourselves that fact) are now Standard. And that you can use a function anywhere that you can use a column. So as long as the SQL platform does that (provide the standard requirement), you can implement standard Subtypes *declaratively* on that platform (and non-declaratively since whenever the platform was born). The syntax is irrelevant.

Before we go off on a debate on what is and what isn't declarative, maybe we should agree on the same definition?

The definition I have always heard and used so far is that declarative code only describes the intended effect without detailing the steps to achieve it, and that the alternative is procedural code that describes multiple steps that have to be taken in the specified order. By that definition, stored procedures, triggers, and functions (*) are all procedural, since they can have multiple statements in their body that have to be executed in order. (*) One exception to this is a very specific kind of function that SQL Server supports, "inline table-valued function" - actually a misnomer, for it should have been called a parametrized view.

The reason why I care is that the nature of procedural code makes it much harder for the DBMS to fully understand what the result of the code is. And that, in turn, makes it extremely hard, maybe even impossible, to implement an *efficient* way to guarantee that, no matter what we do, the constraint is never violated. (The unefficient way is trivial - simply check all CHECK constraints for all rows in all tables after every transaction (deferred) or data modifying statement (immediate). This is -if I read it correctly- how SQL:2003 defines the effect of CHECK constraints.)

> If you can't read my code, and make the necessary syntax changes on your platform to implement my code on your platform, then yes, there is a problem, of a different sort. Name your platform and I will write the code for you.
>

No worries, I can read and adapt your code just fine.

> Er, what "performance overhead" ? Please specify exactly, or withdraw the statement.

Executing the function, which involves evaluating a query, for every row inserted in each subtable, and for every row updated in each subtable with a statement that has the possiblity to update the primary key column.

How much that will be, and how much more or less that will be than the performance overhead of Joe's version, depends on lots of factors. Chosen platform (and choices made by the engineering team of that vendor) probably being the biggest.

> > When inserting or updating a single row, that won't be a big problem. But what happens when inserting or updating thousands or even millions of rows in a single statement?
>
>
>
> That is a very stupid thing to do, for any scenario, not only for Subtypes, because it will blow the transaction log. ANSI/IEC/ISO SQL is transactional. So whatever proscriptions apply to such attempts, apply to Subtype insertions as well, no more, no less. Each platform will handle that differently.

You think executing an INSERT ... SELECT is a stupid thing to do? Or an UPDATE ... WHERE?

Those are statements that potentially affect thousands or even millions of rows. I personally prefer to use them over writing code to loop over rowsets one by one - a single statement is less lines of code, easier to understand, and (at least on SQL Server) performs orders of magnitude better - as long as you don't do any stupid things. For SQL Server, using scalar user-defined functions qualifies as a stupid thing.

I don't see the relevance of your remark about log space. If I need to insert millions of rows in a single transaction, I'll need the same log space whether I use a single INSERT ... SELECT statement or a million INSERT ... VALUES statements. (In fact, the former case would probably use LESS log space, depending on how the vendor implements the transaction log).

> > This probably depends on the chosen DBMS. I know that SQL Server is unable to inline scalar user-defined functions into an execution plan, so for SQL Server, the only option is to execute a seperate call to the function for each row (or at least for each unique set of input parameters -which in this case would be the same, since the primary key is included in the input parameters-). So insert a thousand rows, and instead of converting the subquery that is hidden in the function to a join, SQL Server will have to perform thousand executions of the subquery. Your performance will, excusez mon French, suck.
>
>
>
> Excuse me, no, *your* performance will suck.

*Your* as in: the performance of your code on SQL Server.

> My performance (DB2 and Sybase) flies. I already have close to one hundred standard Subtype structures in production. Millions of transactional inserts have been tested. Probably a thousand full table unload/reloads (thousands to millions of rows each) have been tested.

Great! Have you done a performance comparison test of your method versus "the Joe construction"?

> When you get into contemplating that level of "performance" problem, it becomes very platform dependent, and dependent on the quality of code, for both transactional inserts, and streamed inserts. We are better of if we limit this thread to logical issues and measurements, and keep away from physical implementations of those logical constrcuts, which are heavily dependent of platforms, their specific limitations, etc.

The same can be said for storage overhead. Unique constraints don't have to be implemented by adding an index. Foreign keys don't have to be implemented by adding a column (some vendors use pointer chains instead). Etc.

Also, it is quite surprising to see someone who said "you are both fools" to Joe Celko and Roy Hann because they advocate a subtype/supertype solution that, allegedly, has 220% overhead now saying that we should keep away from physical implementations.

If we look at the logical side only, not caring about implementation details such as storage or performance, both solutions are good. They both achieve what they set out to achieve - ensure that there are no integrity violations in a subtype/supertype relationship.

If we look at implementation efffect as well, we should look at all effects. In this case, the effects are on storage and on performance.

> Actually, I have people who use my Subtype structure, and who have implemented it on MS SQL without any problems. So I can say for sure that (a) they did so without jumping through the hoops that you are speculating about, and (b) suffer no performance problems or "performance problems", but I do not know what MS SQL does internally in each version, to agree or refute your speculation about it. Therefore, I will state simply, I seriously doubt that that is the case for MS SQL.

I'll try to find the time to do a simple test. If I find the time, I'll post repro code and test results in a later message.

> > Another possible problem you have to be aware of with your solution is that, because the subquery is hidden in the UDF, it is unknown to the DRI mechanisms. Some scenarios will allow violations of the CHECK constraints, becuase the RDBMS does not know it has to re-check.
>
> Excuse me, but it *IS* a DRI CHECK Constraint. Please read the code again.
>
> "Some scenarios" is not correct, no, scenarios implies the general case, not platform specific. What you are talking about is how a particular platform operates, not scenarios. So the honest statement is, "some platforms will allow violations of the CHECK Constraint".

I don't have to read the code again. I know what it does. Will the sequence of two INSERT statements and one UPDATE statement cause a constraint violation on any of the platforms you use? On SQL Server, it won't. The UPDATE will be accepted, causing the data in my table to violate the subtype/supertype rules. The corresponding sequence of two INSERT statements and one UPDATE statement against Joe's implementation *will* cause an error to be raised. This can be fixed - I would need to add a CHECK constraint on the Type column in the supertype table that checks that the corresponding primary key value does not exist in any of the subtype tables that are mutually exclusive with the one indicated by the type value. for a supertype/subtype relationship with (n) mutually exclusive types, that requires (n-1) subqueries.

(snip)
> Yeah, sure. It appears that you do not understand context.

The context is that:

1. Joe Celko posted a generic framework that can be used to protect integrity in a subtype/supertype relationship;
2. You called him and Roy Hann "fools" for proposing and endorsing that framework;
3. You then posted your, supposedly better, alternative; and
4. I assumed (yup, did it again!) that this alternative was designed to meet the same goals as the original: being a generic framework to protect integrity in a subtype/supertype relationship.

If the code you posted is *not* intended to be a generic method for subtype/supertype relationships, or if protecting integrity is *not* what the code sets out to do, please say so now - that would save both you and me a lot of time and effort!

> 1. The example code provides the implementation details for standard Subtypes. There is no suggestion that it does anything more than that. It is not a full blown application.

Ah, there's part of the answer already.

> 2. The Supertype-Subtype structure is supposed to be considered as a single unit, not as a bunch of tables with disparate and singular relationships between them. You should not be updating anything in the cluster without considering the whole cluster, transactionally.

Agreed that I shouldn't. Just as I should not delete a "parent" (hate those terms!) that still has a "child" in a foreign key relationship. But if I do so anyway, I trust the FOREIGN KEY constraint to slap me and prevent the change. Similarly, I expect whatever construct I use to guard the supertype/subtype relationship to slap me and prevent the update if I do something that would break integrity. The code I posted proves that there is at least one case on at least one platform where it doesn't.

> 3. When writing an application, you are best advised to implement all inserts/updates/deletes as transactions; you should never write singular inserts/updates/deletes.

When writing singular inserts/updates/deletes, a transaction is implicitly started. SQL Server (and probably other vendors too) support two options on when these are committed - implicitly at the end of the statement (if no constraint was violated), or only after an explicit COMMIT statement.

But I fail to see the point. Constraints (and I consider both your and Joe's code as a way to work around a limitation of a constraint, so I expect the same behaviour) should always work. Even when application code does not follow best practices. (In fact, in that case constraints are even MORE important!)

(snip)
> 5. The example code, being an example of Subtype implementation, uses singular inserts/updates/deletes, rather than transactions, as is commonly accepted in posted examples.

So why slap _me_ for using singular inserts/updates/deletes in my posted examples?

(snip)
> 11. If you have trouble writing the required Declarative Constraints and transaction code (no offence, but this structure is new to you, and you do not understand it yet) that is required for ordinary Data and Referential integrity, that you are implying is part of the Subtype integrity, at your request, I would be happy to supply it. Using Standard SQL, and you will have to make the minor syntactical changes required for your platform. i won't be posting transaction basics or tutorials, I will assume the reader knows that, I will post just the code.

This is now the fourth time in a single post that you hint at my incompetence. You are free to believe anything you want of me, but saying this out loud and then repeating it multiple times does not really add anything to the credibility of your arguments.

> > Bottom line: I like your solution. But it's not 100% safe, and though you avoid the storage overhead, you pay for this with a performance overhead. I'd recommend people to choose between the two solutions on a case by case basis.
>
>
>
> I refute the "not 100% safe" on the basis described above: implement whatever safety you need, that is not shown in , and cannot reasonably be expected to be shown in, the example of Subtypes.

Your code as posted does not provide 100% safety against integrity problems in the supertype/subtype relationship on at least one platform. The code I provided will result in a supertype indicating a CD, with a row in the table for the Book subtype.

Before starting this reply, I spent a lot of time trying to decipher the relevant sections of the SQL:2003 standard (not the latest, I know, but I have to make do with what I have). If I didn't misunderstand anything, your code (or at least the ANSI-standard equivalent code) will provide 100% safety in a fully compliant RDBMS, because (again, if I understand completely) the standard says that the database should re-check all CHECK constraints for all rows in all tables after any modification - or at least behave as if it does.

I have no experience with other mainstream RDBMS products, but I would be very surprised if it turns out that all other vendors do manage to pull that off with CHECK constraints that use reference other tables through a user-defined function. (But I am not excluding the possibility - I have been surprised before). For those vendors that, like SQL Server, only check a CHECK constraint for inserted and modified rows, an extra CHECK constraint and an extra function would be required. I sketched a rough outline of this extra function above, but I'll be happy to provide code if you want me to.

> I refute the "performance overhead", as detailed above, but I cannot be specific because you have not identified specifically what the alleged "performance overhead" is.

I replied to this above.

> I strongly recommend that you do not recommend anything at all about this subject until:
>
>
>
> a. you understand it, which clearly, has not been achieved yet

And there's number five. :)

> b. you have actually implemented it and tested it (both transactional inserts/updates/deletes, and singular streams inserts, etc), and you have gained actual experience, and some basis for confidence

As I said above - if I find the time, I'll run some tests on SQL Server 2012 and post back with the results when finished.

> Until then, your recommendations are based on contemplation, not facts, not experience.

And until you run an actual side-by-side comparison test of your version against Joe's version, the exact same can be said about your recommendations.

(BTW, the only recommendation I actually made so far is to "choose between the two solutions on a case by case basis")

Cheers,
Hugo Received on Tue Feb 05 2013 - 13:59:38 CET

This message: [ Message body ]
Next message: hugokornelis_at_gmail.com: "Re: Question on Structuring Product Attributes"
Previous message: derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"
In reply to: derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"
In reply to derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"
Next in thread: derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"
Reply: derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message