Re: Dreaming About Redesigning SQL

From: Anthony W. Youngman <thewolery_at_nospam.demon.co.uk>
Date: Wed, 22 Oct 2003 22:46:10 +0100
Message-ID: <BEP3KaCiqvl$EwcX_at_thewolery.demon.co.uk>


In article <bn4cca$dj0$1_at_nyytiset.pp.htv.fi>, Lauri Pietarinen <lauri.pietarinen_at_atbusiness.com> writes
>Anthony W. Youngman wrote:
>
>>
>>Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
>>MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
>>say the spec said "extract maximum performance from the hardware
>>available".
>>
>So what's wrong with gettng a machine with lots of memory? How much
>does 2G of
>memory for an Intel-box cost now a days? Is this some kind of new
>ultimate sport, trying
>to get along with as little memory as possible?

I presume you didn't read the bit below ... what if you have SEVERAL tables, and EACH of them is a gigabyte or two in size?

If an engineer has a problem, throwing brute force at it is rarely the solution. Let's be topical (near enough) and look at the Titanic (seeing as there was this film recently). If they'd forseen the problem, they could have thrown brute force at it and doubled the thickness of the steel plate. Except she would have then sunk when they launched her, before she even had a chance to hit the iceberg. Or look at aviation - especially in the early years. They had gliders that could fly, and they had engines that could easily provide the power to get a glider airborne. The problem was, every time they increased the power of the engine they got *further* *away* from the possibility of powered flight, because the increased power came at the price of increased weight.

You're welcome to live in your mathematical world where power can be gained for no cost, but that doesn't work in the real world. And the cost isn't necessarily dollars. Like in the aircraft example, the cost could be a case of "sorry, technology ain't that advanced yet mate!"
>
>>You're assuming that you can throw hardware at the problem - fine, but
>>that's not always possible. You might have already maxed out the ram,
>>you might have a "huge" database, you might be sharing your db server
>>with other programs (BIND really likes to chew up every available drop
>>of ram, doesn't it :-).
>>
>>I'm not saying that you shouldn't throw hardware at it, but what if you
>>can't?
>>
>>
>>Except my example was an *average* case, and yours is a *best* case. Oh,
>>and my data is still normalised - I haven't had to denormalise it! AND I
>>haven't run an optimiser over it :-)
>>
>Are you hiding your optimiser behind the curtain? ;-)

Well, if you include getting optimisation for free because "that's the way things work", maybe I am ;-)
>
>Well, if it is normalised, how easy is it for you to change the
>customer_id of an order? Anyway,

Incredibly easy. Just update the "customer_id" field of the invoice record. A single change to a single "row"

>if we stick to your example and even if we don't normalise using e.g.
>clustering features of Oracle,
>as Bob pointed out, we are getting at most the same number of I/O's.
>So, answer to your
>question: our formula is at least as good as yours.

Except I think Bob said we could "optimise to favour *certain* transactions". I think actually ANY transaction benefits. You're relying on stuff that's outwith your theory, we're relying on stuff that's inherent to our model.
>
>>>Now, that was a *conservative* estimate, and we assumed that we did not have
>>>any rows lying around in the (global!) cache. As the size of the cache grows
>in
>>>proportion to the size of the total database we can assume less and less disk
>>>I/O.
>>>
>>>
>>
>>You're relying on the hardware to bale you out :-) We can do the same!
>>
>Well why don't you?

We let the hardware help us out if it can. There's a big difference. If you can't get the hardware, you're stuffed. We don't need it, so while we may have a hard time of it it's nowhere near as bad for us.

And again, relational separates the physical from the logical. You're being hypocritical if you call upon the physical representation to help out with the (speed of the) logical presentation.
>
>>>Note also that the cache can be configured many ways, you can put different
>>>tables (or indexes) in different caches, and even change the size of the cache
>>>on the fly (you might want a bigger cache during evening and night when your
>>>batch programs are running) so you can rig your system to favour certain
>>>types of queries.
>>>
>>>I havn't even gone into the topic of using thick indexes so table access can
>>>be totally avoided (=we are reading into memory only interesting columns).
>>>
>>>Now, in your example, what if the product department comes along and
>>>wants to make a report with sales / product? What would be your formula
>>>in that case?
>>>
>>>
>>
>>I'm not quite sure what you're trying to do. I'll assume you want a
>>report of all invoices which refer to a given product. Assuming I've got
>>the relevant indices defined, I can simply read a list of invoices from
>>the product code index, a second list of invoices from the month index,
>>and do an intersect of the two lists.
>>
>I want a list with all products with corresponding total sales, read
>from order detail e.g.
>
>Hammer 10000$
>Nail 5000$
>Screw 1200$
>
>How many disk reads (or head movements)?

Actually, probably the same as you here. If we're indexed on order detail. If Hammer appears in N invoices, then T = (1+N) * ST * 1.05 for hammers, and the same for all the other products.

Theory favours us, in that if a product appears X times in one invoice, that's one read for us and X for you, but hardware will probably help you more than us (that is, assuming thrashing cuts in) in that you stand a marginally higher chance of getting multiple instances of a product in any given read.
>
>>So again, T = (2+N) * ST * 1.05 where N is the number of invoices that
>>reference that product. And now ALL the invoice data has been retrieved
>>from disk to ram ...
>>
>>
>>>And: what if I was just reading customer-data. Would the same formula
>>>apply (= (2+N)*ST*1.05)?
>>>
>>>
>>
>>Nope. If I understand you correctly, you want attributes that belong to
>>the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
>>way, billing and/or invoice address (for example) are invoice
>>attributes, not company attributes.)
>>
>No, I want you to give me a list of all your customers. How many disk
>reads?

T = N * 1.05 where N is the number of customers. What do you want to know about those customers? Address? Phone number*s*? Anything else? That's *all* at no extra cost.
>
>>>>But as I understand relational theory, such a question is completely
>>>>outside the scope of the theory. Seeing as it tries to divorce the
>>>>database logic from the practical implementation ...
>>>>
>>>>
>>>>
>>>The theory, indeed, does not say anything about buffer pools, but by
>decoupling
>>>logic
>>>
>>>
>>>from implementation we leave the implementor (DBMS) to do as it feels fit to
>do.
>>
>>
>>>As DBMS technology advances, we get faster systems without having to change
>our
>>>programs.
>>>
>>>
>>
>>But with MV, if our database is too large for current technology, we
>>kick the shit out of relational for speed ...
>>
>>Don't forget. You've already said that, if nothing is cached, my average
>>case exceeds your best. And my case is *already* assuming that the
>>system is seriously stressed and struggling ...
>>
>>
>>>When we design databases we can decouple logical planning from performance
>>>considerations, which, you must agree, are two separate issues.
>>>
Yes. BUT what's the point of having a database that is logically perfect, and who's performance is slow to the point of being unusable?

Don't forget - in practice MultiValue ends up with a database that is *inherently* optimised such that it almost invariably outperforms an equivalent SQL database, AND we don't normally have DBAs to help us achieve that nirvana ...
>>>
>>
>>I can't find the post now :-( but is Christopher reading this? You know
>>I compared that relational system on a twin Xeon 800, to an MV system
>>running on a P90? Christopher made the (reasonable in the circumstances)
>>assumption that the relational consultants must be crap, and the MV guy
>>a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
>>experience tells me that MV query was probably thrown together, by an
>>average programmer, in 30 seconds. On the other hand, those SQL
>>consultants had an axe to grind and a point to prove. They couldn't
>>afford to let this "old fashioned" system beat them. That SQL query
>>would have been optimised to within an inch of its life over weeks.
>>Don't forget how proud they were to beat this MV system! Yet with
>>hardware that was so much more powerful and a query that was heavily
>>optimised, they had great difficulty beating a query that was thrown
>>together in seconds by an average MV guy (or even just a luser!).
>>
>>Don't forget. I said I am a database *engineer*. Engineers believe in
>>elegance, they believe in beauty. And when I look at relational, all I
>>see is the theorists pleading "power", "hardware", "brute force", to get
>>them out of trouble.
>>
You said that logical planning and performance are separate issues. And I wouldn't expect you to address the above example in a discussion of relational, because performance is irrelevant to relational.

But surely, the fact that I am SUPREMELY CONFIDENT that I can get superior performance from inferior hardware should give you pause for thought that maybe, just maybe, the relational model is flawed from an engineer's or scientist's viewpoint?

From the mathematician's (or logician's) viewpoint I agree it's flawless. But that's true of plenty of broken scientific theories...

Cheers,
Wol

-- 
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let 
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett
Received on Wed Oct 22 2003 - 23:46:10 CEST

Original text of this message