Re: Question on Structuring Product Attributes

From: <hugokornelis_at_gmail.com>
Date: Sun, 3 Feb 2013 03:45:43 -0800 (PST)
Message-ID: <32225297-5492-4292-bad6-af1c6e49d4b7_at_googlegroups.com>

Hi Derek,

Thanks for your reply. (And for your email pointing me to it - I stopped doing usenet a long time ago, so I would not have found it otherwise. That's also why I may "look" different: I no longer have a usenet account, and I'm posting for the first time in my life through Google Groups).

Op zondag 3 februari 2013 05:32:54 UTC+1 schreef derek.a..._at_gmail.com het volgende:
> I must have missed it that day, sorry. Your questions are posted in a reply to Kevin, which I did not read too much white noise in other people's posts on this thread). Sorry about the one year lag! I am responding to keep the thread complete.

I did have a reply to Kevin, but my original response in this topic, where I ask you how to implement the CHECK constraints you hinted at in other posts, was a direct response to one of your posts. However, I can understand that you'd miss posts in this long topic, so no problem. I'm glad you posted now!

> The 220% overhead figure is loosely calculated as:
>
> 1. One additional redundant column in each Subtype
>
> 2. One additional redundant index in each Subtype
>
> 3. One additional redundant index in the Supertype
>
>
>
> Of course the base for computing "additional overhead" is the model with the redundant columns removed, which has one index per supertype & subtype. So two extra indices are 200%, and one extra column times two subtypes is 20%
>
>
>
> A tighter calculation would be:
>
> 1. 10% additional overhead per Subtype (since it has only two columns)
>
> 2. 100% additional indexing overhead per Subtype (since it has only one index)
>
> 3. 100% additional indexing overhead (since it has only one index)

Well, that explains why your overhead estimation is so much higher than I could imagine. Your mathematics are off.

Example one: Yesterday, I bought bread, milk, and eggs. Each of them cost me exactly a dollar, so $3 total. Today, the bread is up 20%, to $1.20, and the milk and eggs are up 100%, to $2.00 each. By your logic, my total pricetag would have gone up by 20%+100%+100%=220%, to $9.60. In reality, I now pay $5.20, an increase of "only" 73 1/3%.

Example two: Yesterday, I bought bread for $1.50, milk for $0.80, and eggs for $0.70, so (again) $3 total. Today, the bread is up 20%, to $1.80, and the milk and eggs are up 100%, to $1.60 and $1.40. Now, I now pay $4.80, and the relative increase is now 60%.

Example three: Same as above, but I also bought ham, jelly, and ground beef, for $12.00 total. They have not changed in price. The price increase is still $1.80, but as a percentage, it's now down to 12%.

The actual overhead of the extra columns and indexes depends on a lot of things, like how many other indedxes there are, how many columns, what data types for the extra columns and the other columns, how the RDBMS implements indexes, unique constraints, and foreign key constraints, etc. To get a first approximation by RDBMS, I would implement both versions (with a minimal set of columns and no extra indexes), populate them with a number of rows high enough to minimize the impact of overhead (e.g. 10,000 per subtype), then calculate space used for each scenario. That tells me the overhead as an exact number (divide by 10,000 to get a per-row figure), or the MAXIMUM overhead as a percentage (extra columns and extra indexes make the percentage go down).

(snip)

> Anyway my basic point is, it is silly to implement sub-standard constructs substantial addition overhead for each subtype, and for the supertype, when
>
> (a) the standard that has been with us for 35 years, does so without the additional columns and indices, and
>
> (b) the implementation is pure DRI since 2007 (it was code until then).

The last version of the ANSI standard I have access to is 2003, so I can't comment on point b. re point a: I was under the impression that user-defined functions were not in the ANSI standard, but luckily I verified and turned out to be wrong. They are. So technically your are right - your method that relies on user-defined functions is in the standard. However, the goal of the standard is ease of moving (both code and people) to other platforms. Unfortunately, the major DBMS vendors all have their own legacy syntax for functions. I checked SQL Server, Oracle, DB/2, and MySQL (using syntax documentation I found through Google). The only implementation that is at least close to the standard is DB/2, all others have their own syntax - so until that changes, whether or not the feature is part of the standard is a moot point.

(snipped links)

I checked the links you provided, and despite all I've said above, I do like your approach. In the future, if people ask me how to implement subtypes, I will probably give them the two versions.

Choosing which one to use depends on various things. Your version eliminates the storage overhead, but at the price of a performance overhead. When inserting or updating a single row, that won't be a big problem. But what happens when inserting or updating thousands or even millions of rows in a single statement? This probably depends on the chosen DBMS. I know that SQL Server is unable to inline scalar user-defined functions into an execution plan, so for SQL Server, the only option is to execute a seperate call to the function for each row (or at least for each unique set of input parameters -which in this case would be the same, since the primary key is included in the input parameters-). So insert a thousand rows, and instead of converting the subquery that is hidden in the function to a join, SQL Server will have to perform thousand executions of the subquery. Your performance will, excusez mon French, suck.

Another possible problem you have to be aware of with your solution is that, because the subquery is hidden in the UDF, it is unknown to the DRI mechanisms. Some scenarios will allow violations of the CHECK constraints, becuase the RDBMS does not know it has to re-check. Here's an example:

I've got a new book: INSERT Product (ProductId, ProductType, Name) VALUES (1, 'B', 'Talk That Talk'); INSERT ProductBook (ProductId, NumPages) VALUES (1, 11);
Oh, wait - it's not a book, it's a CD! UPDATE Product SET ProductType = 'C' WHERE ProductId = 1;

This won't trigger an error in SQL Server, because the CHECK constraint is only checked when the specific row in ProductBook is inserted or updated. Now this is an unusual scenario, but there are also lots of subtype situations where instances can migrate to another subtype. E.g. an employee that starts as a contractor but then decides to join the company and moves to the Payroll subtype. Or a supertype Person with subtypes Single, Married, Widowed, Divorced.

Bottom line: I like your solution. But it's not 100% safe, and though you avoid the storage overhead, you pay for this with a performance overhead. I'd recommend people to choose between the two solutions on a case by case basis.

Cheers,
Hugo Received on Sun Feb 03 2013 - 12:45:43 CET

This message: [ Message body ]
Next message: derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"
Previous message: derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"
In reply to derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"
Next in thread: derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"
Reply: derek.asirvadem_at_gmail.com: "Re: Question on Structuring Product Attributes"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message