Re: What is a good blocksize to use.

From: Ben Brugman <benbrugman_at_onbekend.nl>
Date: Fri, 27 Sep 2002 11:44:12 GMT
Message-ID: <3d943e55.14500718@news.nl.uu.net>

On Fri, 27 Sep 2002 09:49:51 +0100, "Jonathan Lewis" <jonathan_at_jlcomp.demon.co.uk> wrote:

Thanks for your information,
good arguments, some notes inline.

>
>The reasoning above is not entirely consistent with the
>original description below. You have very skewed data,
>so any solution is likely to be a compromise of some sort -
>but you need to decide on the more important factor, and
>bias yourself in that direction. It's too easy to say (as you
>have done) "I don't like X because of argument A, and Y is
>wrong because of argument B"
>

This is completely correct. Wanting the best off all worlds is not realistic. So I'll have to realise that more often.

>
>In the notes above you don't like clustering because it
>will waste space. But the original requirement below
>pointed out that though the average number of rows per
>client was 40, the average number of rows per client ACCESSED
>was 300-400. If the critical performance requirement is to
>access those customers with 300-400 rows then who cares
>that 20,000 blocks with tiny customers are nearly empty - you
>don't access them much, the space (per se) doesn't matter.
>
>Estimating from your figures:
> you have 10 customers accounting for about 32M rows.
> you have 20,000 customers at about 400 rows each
> you have 180,000 customers with virtually no data.
>

Yes, being Dutch and used to work with a non Relational legacy database, I tend to not to waste space. (With our legacy database we did this to extremes; RAD50 coding, a boolean took only one bit (no administration in the row)). Going from legacy to Relational already made the database at least 5 fold and using a 'better' datamodel gave another 5 fold. So we are now using 25 times the the diskspace for the same information. (At less cost I must add).
But we must definitely revaluate this. Does is hurt to lose space when gaining the advantages of clustering. GOOD ARGUMENT.
>At 10 customers per cluster, your 180,000 customer will
>leave about 18,000 blocks nearly empty in the worst case.
>
>By the way - if you put 10 customers per cluster, and 10%
>of your customers are large customers, then on average
>1 customer per cluster will be a large customer - so you won't
>be wasting space anyway - the large customer will fill the block.
>

My thought was that the spread would be very bad. Just did the calculations, based on the above assuming random spread of large customers, I get the following

34.8 percent of the clusters contain no large customer.
38.7 percent of the clusters contain one large customer.
19.3 percent of the clusters contain two large customers.
5.7 percent of the clusters contain three large customers.
(next values are  1.12     0.15     and 0.0001 percent).

So the wasted space is less than 40 %. And about 25 percent of the large customers share the cluster with other large customers.

I do not know how the overflow affects the large or the small customers. Some small customers might get hidden in clusters with a lot of overflow.

300 rows of 200 bytes gives 60000 bytes. So most large customers will get a few overflow blocks.

so for a large customer on average 8 to 16 blocks have to be read. (8 K blocks assumed).
For the main part of small customers 1 to 8 blocks will be read. (25 percent of the small customers still fall in a cluster with a bit more overflow).

>
>Overflow is also not an issue -- in your context -- You are interested
>in clients who have 400 rows to report. How big are the rows, how
>many blocks will they fill ? Your target is to reduce 400 rows located
>in 400 different places to 400 rows located in a small number of places.
>You only get overflow if the data requires the space.
>
>

Conclusion.
Space wasted is not to bad. (About 40 percent of clusters contain a lot of empty space).
Access to large customers is fairly optimal. (on average 1.5 times the optimal number of blocks is read).
Access to small customers is less optimal, but because of the number of occurences of this, this will only account for a small increase over other storage designs.

Thank you for pointing me in this direction. And letting me realise that this method is far more effective than my initial guess. THANKS. ben brugman
Ben Brugman Received on Fri Sep 27 2002 - 06:44:12 CDT