Re: Question on Structuring Product Attributes

From: <derek.asirvadem_at_gmail.com>
Date: Sun, 3 Feb 2013 17:01:28 -0800 (PST)
Message-ID: <341d8a98-9245-4b46-b233-5066d8fa371e_at_googlegroups.com>


Hugo

On Sunday, 3 February 2013 22:45:43 UTC+11, hugoko..._at_gmail.com wrote:
>
> > A tighter calculation would be:
> > 1. 10% additional overhead per Subtype (since it has only two columns)
> > 2. 100% additional indexing overhead per Subtype (since it has only one index)
> > 3. 100% additional indexing overhead (since it has only one index)
> > [Corrected to]
> > 3. 100% additional indexing overhead for the Supertype (since it has only one index)
>
> Well, that explains why your overhead estimation is so much higher than I could imagine. Your mathematics are off.

<shopping snipped>

> By your logic ...

That is neither logic, nor my logic, nor my mathematics, not mathematics. That is your interpretation, and application to some other scenario. I do not see any point in commenting on it. I will answer on the basis that you are not being contrary, that you actually do not understand, and limit my answer to the statements made.

If you want to understand what I stated, please just read my words (which I am responsible for), do not jump to conclusions (which in any case, you are responsible for).

2. If my Subtype table has one index, and Joe's Subtype table (to perform the same function, provide the same integrity, etc) has two indices, then it is a simple fact that Joe's table has "100% additional indexing overhead per Subtype"

3. If my Supertype table has one index, and Joe's Supertype table (to perform the same function, provide the same integrity, etc) has two indices, then it is a simple fact that Joe's table has "100% additional indexing overhead per Supertype"

I did not state "100% additional overall overhead", I limited my calculations to the subject matter, the implementation of SubType. I did not imply anything about the rest of the content of the tables.

Now if you wish to apply my statements to the price of sausage, that's your business, but I can't join you there.

A more reasonable example, if you are trying to apply it to everyday life, would be: • I [Supertype] move around the world, and go about my consulting business with one heavy briefcase; Joe [Supertype] moves around carrying two heavy briefcases wherever he goes. • Additionally, my whole development team [Subtypes] moves around the country carrying one heavy briefcase each; Joe's subordinates [Subtypes] each carry two heavy briefcases. • Joe's team performs the same tasks as mine, using 100% additional *briefcase* overhead. No implications are made re the weight of each person, or their speed, or their eyesight, or the price of fish.

> The actual overhead of the extra columns and indexes depends on a lot of things,

Yes.

> like how many other indedxes there are, how many columns, what data types for the extra columns and the other columns, how the RDBMS implements indexes, unique constraints, and foreign key constraints, etc.

That is not so relevant. All that exercise would produce is a difference between 100% and 98% or between 100% and 102%.

In the example we have one index, and Joe has two, so unless you are moving away from the example, the fact of the existence of the second *additional, redundant* index is 100% (two minus one divided by one).

The number of indices that the table may have for other reasons is irrelevant to the problem at hand, we are discussing Subtype implementation, so regardless of the number of indices on the table, if you implement the standard Subtypes structure, you need one more index; if you implement Joe's structure, you need two more indices, which is "100% additional indexing overhead per Subtype, and 100% additional indexing overhead per Supertype" than the standard.

> To get a first approximation by RDBMS, I would implement both versions (with a minimal set of columns and no extra indexes), populate them with a number of rows high enough to minimize the impact of overhead (e.g. 10,000 per subtype), then calculate space used for each scenario. That tells me the overhead as an exact number (divide by 10,000 to get a per-row figure), or the MAXIMUM overhead as a percentage (extra columns and extra indexes make the percentage go down).

Yes, that might be a good approach to benchmark the *overall* overhead, a new subject, not relevant to the statements that I made.

The fact remains the that two minus one divided by one = 100% additional. If there are 10 Subtypes, Joe has 11 additional redundant indices, and 10 additional redundant columns.

> > Anyway my basic point is, it is silly to implement sub-standard constructs substantial addition overhead for each subtype, and for the supertype, when
> > (a) the standard that has been with us for 35 years, does so without the additional columns and indices, and
> > (b) the implementation is pure DRI since 2007 (it was code until then).
>
> The last version of the ANSI standard I have access to is 2003, so I can't comment on point b.
>
> re point a: I was under the impression that user-defined functions were not in the ANSI standard, but luckily I verified and turned out to be wrong. They are. So technically your are right - your method that relies on user-defined functions is in the standard.

Yes, I know.

(In the genuine Relational world, we have implemented this Subtype from the outset. It is identified in the Relational Model, over 40 years ago, and it was fully catered for in the first publication for Standard for modelling Relational databases, which was 35 years ago. The difference is, until 2007 when Functions arrived, the constraints were implemented in the [required anyway] transactions. Since 2007, we can implement Functions, and implement Declarative CHECK Constraints for the Subtypes.)

> However, the goal of the standard is ease of moving (both code and people) to other platforms. Unfortunately, the major DBMS vendors all have their own legacy syntax for functions. I checked SQL Server, Oracle, DB/2, and MySQL (using syntax documentation I found through Google). The only implementation that is at least close to the standard is DB/2, all others have their own syntax - so until that changes, whether or not the feature is part of the standard is a moot point.

  1. You statement re the goal is too simplistic, but let's not get distracted.
  2. Sure, the syntax is different. So what ? The syntax for every SQL statement is different, particularly in the area of (a) extensions and (b) features that each vendor implemented before the standard syntax was published. So what ?

What is relevant, is that Functions (tables, datatypes, columns, everything in a database is User Defined, we do not have to keep telling ourselves that fact) are now Standard. And that you can use a function anywhere that you can use a column. So as long as the SQL platform does that (provide the standard requirement), you can implement standard Subtypes *declaratively* on that platform (and non-declaratively since whenever the platform was born). The syntax is irrelevant.

If you can't read my code, and make the necessary syntax changes on your platform to implement my code on your platform, then yes, there is a problem, of a different sort. Name your platform and I will write the code for you.

3. Regardless of the consideration of Subtype implementation, migration from any flavour of SQL to another flavour has a series of problems. Those problems are neither more nor less due to the use of standard support for Subtypes or a sub-standard support of Subtypes which doubles the number of indices. The point does not apply to the discussion.

> I checked the links you provided, and despite all I've said above, I do like your approach. In the future, if people ask me how to implement subtypes, I will probably give them the two versions.
>
> Choosing which one to use depends on various things. Your version eliminates the storage overhead, but at the price of a performance overhead.

Er, what "performance overhead" ? Please specify exactly, or withdraw the statement.

And please, limit your mathematics to the problem at hand, and identify whatever factors you determine to be "overhead" against the same for Joe's structure. We are comparing the two structures, not identifying "overhead" on one side only, that is the point of the thread. So identify whatever "performance overhead" Joe's structure has as well (hint: inserting/deleting to a table with two indices will perform slower than with one index).

> When inserting or updating a single row, that won't be a big problem. But what happens when inserting or updating thousands or even millions of rows in a single statement?

That is a very stupid thing to do, for any scenario, not only for Subtypes, because it will blow the transaction log. ANSI/IEC/ISO SQL is transactional. So whatever proscriptions apply to such attempts, apply to Subtype insertions as well, no more, no less. Each platform will handle that differently.

> This probably depends on the chosen DBMS. I know that SQL Server is unable to inline scalar user-defined functions into an execution plan, so for SQL Server, the only option is to execute a seperate call to the function for each row (or at least for each unique set of input parameters -which in this case would be the same, since the primary key is included in the input parameters-). So insert a thousand rows, and instead of converting the subquery that is hidden in the function to a join, SQL Server will have to perform thousand executions of the subquery. Your performance will, excusez mon French, suck.

Excuse me, no, *your* performance will suck.

My performance (DB2 and Sybase) flies. I already have close to one hundred standard Subtype structures in production. Millions of transactional inserts have been tested. Probably a thousand full table unload/reloads (thousands to millions of rows each) have been tested.

Oracle carks itself with subqueries, so I wouldn't try even the single row insert on it. If an Oracle person reads this, and decides that they cannot implement Standard Subtypes, that they have to implement Joe's monstrosity instead, that would be a valid decision. (There are many standard features with Oracle does not support, resulting in their famously bloated databases, but let's not get distracted.)

When you get into contemplating that level of "performance" problem, it becomes very platform dependent, and dependent on the quality of code, for both transactional inserts, and streamed inserts. We are better of if we limit this thread to logical issues and measurements, and keep away from physical implementations of those logical constrcuts, which are heavily dependent of platforms, their specific limitations, etc.

Actually, I have people who use my Subtype structure, and who have implemented it on MS SQL without any problems. So I can say for sure that (a) they did so without jumping through the hoops that you are speculating about, and (b) suffer no performance problems or "performance problems", but I do not know what MS SQL does internally in each version, to agree or refute your speculation about it. Therefore, I will state simply, I seriously doubt that that is the case for MS SQL.

> Another possible problem you have to be aware of with your solution is that, because the subquery is hidden in the UDF, it is unknown to the DRI mechanisms. Some scenarios will allow violations of the CHECK constraints, becuase the RDBMS does not know it has to re-check.

Excuse me, but it *IS* a DRI CHECK Constraint. Please read the code again.

"Some scenarios" is not correct, no, scenarios implies the general case, not platform specific. What you are talking about is how a particular platform operates, not scenarios. So the honest statement is, "some platforms will allow violations of the CHECK Constraint".

Well, that is a serious issue, and on the face of it I would say the indicated platform is totally unreliable, as well as non-compliant.

> Here's an example:
>
> -- I've got a new book:
>
> INSERT Product (ProductId, ProductType, Name) VALUES (1, 'B', 'Talk That Talk');
> INSERT ProductBook (ProductId, NumPages) VALUES (1, 11);
>
> -- Oh, wait - it's not a book, it's a CD!
>
> UPDATE Product SET ProductType = 'C' WHERE ProductId = 1;
>
> This won't trigger an error in SQL Server, because the CHECK constraint is only checked when the specific row in ProductBook is inserted or updated.
>
> Now this is an unusual scenario, but there are also lots of subtype situations where instances can migrate to another subtype. E.g. an employee that starts as a contractor but then decides to join the company and moves to the Payroll subtype. Or a supertype Person with subtypes Single, Married, Widowed, Divorced.

Yeah, sure. It appears that you do not understand context.

  1. The example code provides the implementation details for standard Subtypes. There is no suggestion that it does anything more than that. It is not a full blown application.
  2. The Supertype-Subtype structure is supposed to be considered as a single unit, not as a bunch of tables with disparate and singular relationships between them. You should not be updating anything in the cluster without considering the whole cluster, transactionally.
  3. When writing an application, you are best advised to implement all inserts/updates/deletes as transactions; you should never write singular inserts/updates/deletes.
  4. Due to the limitations of the SQL Spec (the Committee is about 20 years behind the high-end vendors, and about 35 years behind E F Codd), part of your data and referential integrity will be implemented declaratively, and the remainder will be implemented in transaction code. Those of us who care about quality and maintenance maximise the declarative part.
  5. The example code, being an example of Subtype implementation, uses singular inserts/updates/deletes, rather than transactions, as is commonly accepted in posted examples.
  6. The code is syntactically Sybase as declared in the header, but it is presented as standard code; you need to make the minor syntactic changes required for your platform.
  7. More important, when contemplating an application of said Subtypes, (as you have pointed out) you need to consider how your platform operates with regard to constraint checking; deferred constraint checking, limitations of functions; limitations of check constraints; limitations of transactions; your transaction design; etc, and design accordingly. I have not provided that; I have provided standard code only, and it is up to the competent implementer to implement it appropriately for their platform.
  8. I consider the constructs required to prevent your UPDATE to be ordinary Data Integrity or ordinary Referential Integrity, and that is another reason why the considered declarations or code is not in my posted example. UPDATE Product without the corresponding DELETE ProductBook, INSERT ProductCD falls into the same category of: • INSERT Product without a INSERT ProductBook • DELETE ProductBook without the corresponding DELETE Product • etc

Therefore:

9. It is a no-brainer for me, on my platform, to ensure that your example UPDATE as well as those of the same category [8] are disallowed, as part of the implementation of ordinary Data and Referential Integrity for the whole database. How I do that will be standard SQL, but the implementation will be specific to my platform; its limitations; my existing transactions; my transaction standards; etc.

  1. However, you are free to perceive those considerations as part of Subtype Integrity, limited to this cluster, and not as ordinary Data Integrity or ordinary Referential Integrity that would be part of your whole database (which means that your database outside this cluster is devoid of that basic integrity!). In which case, feel free to add them in. In a way that is appropriate to your platform. Any way that you choose to do it, is fine with me, as long as it maintains Data and Referential integrity. Given your discourse re MS SQL, I would not be asking for it to be limited to Standard SQL.
  2. If you have trouble writing the required Declarative Constraints and transaction code (no offence, but this structure is new to you, and you do not understand it yet) that is required for ordinary Data and Referential integrity, that you are implying is part of the Subtype integrity, at your request, I would be happy to supply it. Using Standard SQL, and you will have to make the minor syntactical changes required for your platform. i won't be posting transaction basics or tutorials, I will assume the reader knows that, I will post just the code.

> Bottom line: I like your solution. But it's not 100% safe, and though you avoid the storage overhead, you pay for this with a performance overhead. I'd recommend people to choose between the two solutions on a case by case basis.

I refute the "not 100% safe" on the basis described above: implement whatever safety you need, that is not shown in , and cannot reasonably be expected to be shown in, the example of Subtypes.

I refute the "performance overhead", as detailed above, but I cannot be specific because you have not identified specifically what the alleged "performance overhead" is.

I strongly recommend that you do not recommend anything at all about this subject until:

  1. you understand it, which clearly, has not been achieved yet
  2. you have actually implemented it and tested it (both transactional inserts/updates/deletes, and singular streams inserts, etc), and you have gained actual experience, and some basis for confidence

Until then, your recommendations are based on contemplation, not facts, not experience.

Eg. I recommend that people use standard structures which require half the number of indices, over one that requires twice the number of indices. That is a general recommendation, not limited to Subtypes.

Cheers
Derek Received on Mon Feb 04 2013 - 02:01:28 CET

Original text of this message