Re: Question on Structuring Product Attributes

From: <hugokornelis_at_gmail.com>
Date: Fri, 8 Feb 2013 16:42:27 -0800 (PST)
Message-ID: <8907bf41-e4a3-47f4-b249-90a0d005eb01_at_googlegroups.com>


Hi Derek,

Wow, you sure put in a lot of effort. Thanks for your long reply! I'll try to keep my reply from being overly long by limiting the quotebacks, not responding to any ad-hom (that saves a LOT of time!), and only responding to the main issues.

> 100% plus 100% plus 10 equals 210%

That's not how percentages work. I tried to explain this in a previous post using a simple shopping analogy, but you still fail to see the fallacy in your computation. I can't explain it any better than this. You may want to pick up a highschool maths book and read up on basics. Or you can continue to yell insults my way from your high horse without really listening to what I actually say.

> In the meantime, in this thread, we have two submissions are being compared. Probably not a good idea to keep introducing obtuse issues. Better to stick to the thread.

I though you were interested in implementations that are possible in a full ANSI implementation. I apparently misunderstood. I'll stick to just the two options under consideration in this thread.

> • Declarative means declarative
>
> ••• as identified by E F Codd in the Relational Model
>
> ••• specified in SQL, as DDL
>
> ••• not procedural

Thanks for the definition. SQL Server only has a procedural implementation for functions that return a scalar value. I guess that makes your suggested method unusable for SQL Server. I'm glad I didn't yet do any performance comparisons, for it would not be a fair comparison.

> My question was intended to remain in the logical performance area (you know, count the indices, think about what overhead the constraint and the function would have).

"Count the indexes"? I see. That explains our misunderstandings. You seem to think number of indexes is a relevant performance metric. I don't. I measure response time, because I care about my end users who want the next screen to load as fast as possible. I measure disk space because I know that the company has no unlimited budget for enterprise storage. And I measure all kind of metrics that are early warning signs for response time and disk space.

But we're in c.d.t. here. In theory, one could measure performance by counting indexes and only thinking about their impact without ever actually measuring them. By that theory, your solution is great.

> (no idea what a "rowset" is.)

You know "row". You know "set". Now combine the two.

> The Auditor will make a declaration such as "maximum transaction size is 1000" or "maximum locks per connection is 100" or both.

You must have inherited some very inaptly designed databases, or be working with very inferior DBMS products and/or hardware. In a properly tuned database on modern hardware, affecting 10,000 or 100,000 rows in a single statement should not be any problem at all. Millions is indeed over the edge during business hours, but perfectly acceptable as part of a batch job during off hours.

> It is up to people like me to write the standards. I did that 20 years ago, and they have not changed,

Ah, that explains why you still work with 1000 as the cutoff number.

> And, yes, that (not knowing your platform specific limitations) would be a very stupid thing to do. You know, come in on a weekend to perform db maintenance; code and test the changes; set 32 parallel threads up; start them off; go for a long boozy lunch; come back in three hours; only to find that 29 threads crashed after 10 mins, three threads are still rolling back. MS type think that that is "normal". We call it subnormal.

Your last experience with the SQL Server platform must be well before my first. Which, for the record, is about twenty years ago now.

> > For SQL Server, using scalar user-defined functions qualifies as a stupid thing.
>
>
>
> So don't use them!

I don't use them, you do. In a solution you present to the world as being "generic" and superior.

> No, *your* as in, _your_ fixation about low-level performance of _your_ code on the _only_ server you know. Qualified professionals have no such problems. Recall the context. You made a declaration about code performance, and you provided excruciating detail about the performance of one plaform.

If you claim something to be better and generic, don't be surprised if someone challenges you on a specific implementation. If it sucks on one platform, it is not generic.

> Second it is relevant to only to one platform. (FYI, although not relevant to the thread, the two that I know well have no such problems, my code does not suck, it flies.)

So it is fast on two platforms. Maybe even faster than the Celko version. (You claim it does, based on your understanding of those platforms, but not based on actual performance comparisons). And it's probably slower than the Celko version on one other platform (based on my understanding of that platform, since I haven't yet done my performance comparisons either). Neither of us have sufficient knowledge of other platforms to make claims about them.

Based on the above, I'd call both methods valuable tools that have to be assessed on a case by case basis. I would not bash either method the way you bash Celko's method. Maybe your next job will take you to a platform that performs better with Celko's method. Always keep an open eye, never discard theories, however unlikely they seem, without proper evidence.

> Third, if God ever cursed me, the way so many others are cursed by HIM, and I had to implement Subtypes on MS SQL, I would spend some time and get to know what works well, and what does not, then I would implement Subtypes, using the best method, based on knowledge and actual testing. I would not be writing reams on c.d.t or SQL Central; I would not have to have long discussions; I would not be speculating and contemplating; I would just do it.

So that would probably be the extra indexes and extra columns.

> And whatever I do, it would be standard SQL (not pedantic or obsessed about syntax).
>
>
>
> And whatever I do, it would *NOT* have one redundant column plus one additional index per subtype, plus one additional index per supertype. No, I couldn't justify that to anyone, unless they came from an asylum.

So that would probably not be "the best method, based on knowledge and actual testing".

Hmmmm.

> > Great! Have you done a performance comparison test of your method versus "the Joe construction"?
>
(snip irrelevant rant)
>
> 2. Test
>
> No, I have not done a test on Joe's cancerous method on my platform. That would be very stupid. Given all the benchmarks that I have done, that include Functions, and indices, and duplicate/redundant indices, I already have a good idea of the logical cost. My memory has not failed. I do not take drugs. I do not need to test the same thing, from scratch, every time, to find out, gee whiz, it produces the same results.

The scientific method has five steps: phrase a question, formulate a hypothesis based on prior proven knowledge, do a prediction based on the hypothesis, test the prediction, analyse whether the test results support or disprove the hypothesis. You did the first three things, and now you claim that the reasoning for your hypothesis is so solid and logical that testing is not required.

There is a reason scientists work the way they do.

> > If we look at the logical side only, not caring about implementation details such as storage or performance, both solutions are good. They both achieve what they set out to achieve - ensure that there are no integrity violations in a subtype/supertype relationship.
>
>
>
> That's all ?
>
>
>
> You call that "logic" ?
>
>
>
> Must be a private definition.
>
>
>
> See my response [2] re Test.

Indexes are an implementation detail. Disk space used is an implementation detail. Response time is an implementation detail. You focus only on the one implementation detail that supports your position, and take a blind eye to the others. That definitely doesn't meet my definition of logic. I'll let others decide on theirs.

> Well, you could look at storage and performance logically, without getting into platform-specifics. But that does not appear to be in your private definition of "logical".

That would mostly appear in my private definition of "utter waste of time". Since the actual storage and performance impact of extra columns and extra constraints veries vastly by vendor, looking at those aspects without getting into platform-specifics would be moot.

(Note that I say "extra constraints" above, not "extra indexes". the code posted by Joe does not call for extra indexes. You suffer the same problem I suffered in some of my previous posts: looking at how *your* two platforms implement a feature and assuming that all others do so. The extra uniqueness constraints Joe adds are redundant, only needed to satisfy a requirement of the SQL syntax. Any vendor that wishes to do so could add in the code to detect redundant uniqueness constraints and avoid the unneccessary extra indexes).

> Last time I ran a very similar benchmark, for an Oracle guru, 2 x 16 million rows inserted took 13.2 secs, and that was on a tiny demo box (laptop), I did not have access to my dev box at the time. Because Oracle carks it with subqueries, the benchmark on that side never finished; the modified version that used derived tables took 30 something mins.

Thirty *MINUTES* ?????
Sheesh!
I guess that makes Oracle a second platform that doesn't like your solution.

> > 3. You then posted your, supposedly better, alternative; and
>
>
>
> No, no, no ! I didn't !
>
>
>
> It is only better, or less overhead, or standard, only to logical IT qualified people.
>
>
>
> Not for the rest. Your statement implies better overall, for everyone, and that is simply not true. I have evidence on c.d.t, of a couple of people who are absent logic, who have private definitions of "logic", that prove it it not "better" for everyone.

Thanks for overloading my sarcasm detector. Surely, any argument that is so dripping with sarcasm has to be true in spite of all other evidence.

Your version is "better, or less overhead" for everyone who considers counts of columns and indexes to be the most important metrics. For the (apparently ill-informed) people like me, who think response times and disk space are more important, your solution might or might not be better on your two platforms (I accept your claims that they perform well, but since you never tested the alternative there's no way to know if they perform better), on Oracle (though the 30 minute figure you quote suggests that it sucks), or on other platforms.

Comparisons on SQL Server would be unfair. SQL Server can't use either declarative UDFs or subqueries in a check constraint, so your solution cannot be properly implemented at all. The closest one could do is to use a procedural UDF. With the issues I pointed out before.

> > The code I posted proves that there is at least one case on at least one platform where it doesn't.
>
>
>
> No, it doesn't. Read [1] and [2] again, slowly.

If by "It is not a full blown application", you meant "depending on your platform, you may have to add a check constraint on the supertype table and a function that references all the subtype tables to be used in that check constraint - then yes, you are right.

If you meant anything else, you are wrong. I can read [1] and [2] a million times, but if I then hit the Execute button on my SQL script again, I still get the same results. In at least one implementation (SQL Server), check constraints that use a UDF are only checked when rows are updated or inserted, and only for those rows. That means that in at least one implementation (SQL Server), you *need* that extra check constraint and extra function.

I may not be as smart as you (hey, you say that so often that is simply has to be true, right?), but I was at least smart enough to see this potential problem. Other people might take your words at face value, copy your code, make the minor modifications to make it run on their platform and implement it. Then, depending on their test matrix, they either have wastes some development time, or they may introduce inconsistent data in their database.

If you claim your code to be generic, then at least add a disclaimer that some platforms require an additional check constraint on the supertype table.

> I didn't slap you, I informed you that one had a transactional context (there's that word you do not understand again) and the other did not. And it is for you (assuming you can write the required code), to apply the transactional context when required, especially after that it pointed out to you.

With a transactional context, there can still be bugs in the code. If that were not the case, we would not need to define any constraints in the database at all - but we do, and for good reasons.

If we define a constraint and then still can execute code that introduces a constraint violation without raising an error, it's a bug. On at least one platform, your code has a bug, and fixing that bug requires the addition of an extra check constraint and an extra function (that includes as many subqueries as there are subtypes).

> > > We can implement Declarative Constraints to ensure that a child cannot exist without the appropriate parent (CONSTRAINT Child.FK REFERENCES Parent.PK), but we cannot do so to ensure that at least one child exists for each parent. How do you implement that ? In Standard SQL, please, not worrying about syntactical differences.
>
> >
>
> > In standard SQL, this *is* in fact possible with declarative constraints. On the OrderItems table, add a constraint CHECK (EXISTS (SELECT * FROM Orders WHERE Orders.OrderID = OrderItems.OrderID)) [actual syntax may be slightly different; I never worked on a product that supports subqueries in a CHECK constraint so I don't know the exact syntax for implementing the correlation]. Then make this constraint deferred (unless you want a very complicated way of making the two tables read-only).
>
> >
>
> > In a real database, this only works if both subqueries in CHECK constraints and deferrable constraints (feature F271 in SQL:2003) are supported.
>
>
>
> That is a contradiction

Prove. Or retract.

But you might want to reread first. To help you understand, here is a short overview of my logic:

[1] You can ensure that a parent has at least one child in ANSI standard SQL, using (list of specific features).
[2] In an actual implementation, you can only use this method if that implementation supports (list of optional ANSI features used in [1]).
[3] Hence, an implementation that does not support those features cannot use the method suggested in [1].


> Real databases do not support deferred constraint checking.

By what definition of "real databases"?
Are you even aware that deferred constraint checking is ANSI standard?

> Real databases support standard Subtypes, with full Data & Referential Integrity (both directions, yes!) without deferred constraint checking.

Doesn't that make your whole function-based idea redundant?

> > > It is worth noting that the seriously deranged will implement an FK in the parent, that references the child, which results in a circular reference, and then insist that the platform must support "deferred constraint checking". These people have no sense of Rules, or Order of Rules. But those of us who don't have that form of insanity in the first place, don't have the demands in the second place. DB2 and Sybase do not have "deferred constraint checking"; we have enjoyed secured databases with full data and referential integrity for 35 years without it. Typically the platforms that provided "deferred constraint checking" did not have transactions, and were not SQL anyway. Some of them have implemented a weird form of non-ACID non-SQL "transactions" since then. Therefore as stated, SQL only, and no weirdness with non-SQLs that provide asylum facilities.
>
> >
>
> > A foreign key instead of a check constraint would even be better, indeed. So, does this make me "fully deranged"? And if so - why?
>
>
>
> Er, the foreign key that you suggest *IS* the circular reference.

Yes. And any other implementation of the same business rule would be the same circular reference, just less visible.

You dismiss deferred constraint checking without any argument (other than "real" databases don't do it), then say that the very problem they are designed to solve is unsolvable.

But this is getting off-topic, so I'll stop here and return to the main issue.

> > I am fully aware that your function-based framework does not force people to insert a row in the ProductBook table for a product of type 'B', or to insert a row in the ProductCD table for a product of type 'C'.
>
>
>
> Er, it does force exactly that !
>
>
>
> You say you can read and understand code, but the evidence is, you cannot! Maybe you can read and understand isolated code snippets. But you definitely cannot understand the effect, the overall total logic, of all the posted code snippets working together in an integrated "system".

My ability to read code is obviously way better than your ability to read English. Please look at what I actually wrote, not at what you want to believe I wrote.

"your function-based framework does not force people to insert a row in the ProductBook table for a product of type 'B'" - in case you still don't understand, let me illustrate this using an example:

INSERT Product VALUES (99, "B", "Gone with the Wind")

This statement inserts a product of type 'B'. It does not raise an error. In other words, it does not force me to insert a row in the ProductBook table. So again, I am fully aware that your code allows this. And that Joe's code allows it too. No difference between the two methods here.

> I agree it is not allowed. But the simple evidenced fact is, you have allowed it. Further proof of my point immediately above.

I assume that by "you have allowed it" you mean "I [Derek] posted code and claimed it to be generic, you [Hugo] saw that it has an issue on at least one platform, you [Hugo] then posted a simple script to prove the issue, and then you [Hugo] outlined how to fix this issue and warned about the performance implications".

Also note that the code to fix this is far longer than the code you posted. It's not a simple, small change. If it were, I wouldn't be making such a fuss about it. This is too big to be dismissed as minor platform-dependent changes.

> Now please, we do not need to hear about what one idiotic, broken platform does or does not do with regard to performance.

You claim that your solution is generic and better. I disprove that statement by proving it is not better in at least one case. And now you try to refute that proof by calling that case "idiotic" and "broken". That is logic I cannot refute. I bow to the sheer logic of these arguments.

P.S.: Don't forget to exclude Oracle as well, given the 30 minute duration you measured on your test. It only has the largest market share overall, and MS SQL Server has the largest market share on the Windows market, so who cares about those niche products, eh? It works on the two platforms you happen to have access to, so who cares about the rest. And you can still call it generic because it works on more than one platform.

> > 1. < portions irrelevant to a logical evaluation of the subject <snipped> >
>
> >
>
> > 2. < portions irrelevant to a logical evaluation of the subject <snipped> >
>
> >
>
> > Add 1 and 2 together, and ... < portions irrelevant to a logical evaluation of the subject <snipped> >

Earlier, you wrote that tests are not necessary because you can predict performance based on logical understanding of how constructs are implemented in a specific database. Now you label my explanation of how constructs are implemented in a specific database as irrelevant.

Please, make up your mind. Don't switch between positions all the time.

> > And the number of statements in the function is not related. See for example this blog post, where I demonstrate the performance difference with a UDF to triple the input value: http://sqlblog.com/blogs/hugo_kornelis/archive/2012/05/20/t-sql-user-defined-functions-the-good-the-bad-and-the-ugly-part-1.aspx
>
>
>
> Nah, it is not for me. The article takes far too long to explain simple things that logical people can understand in one sentence.

In my blog posts, I try to ensure that everyone can understand me. Smart people can glance over it, zoom in on the relevant bits and skip the rest. Normal people can read the rest as well. I write like that because I try to get my message across to as many people as possible.

If I had a deep desire to come across as way superior to everyone else, I would probably write differently.

Cheers,
Hugo Received on Sat Feb 09 2013 - 01:42:27 CET

Original text of this message