Re: What is a good blocksize to use.

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Fri, 27 Sep 2002 09:49:51 +0100
Message-ID: <an163g$ipb$1$8300dec7@news.demon.co.uk>
Notes inline.
--
Regards

Jonathan Lewis
http://www.jlcomp.demon.co.uk

Next Seminar dates:
(see http://www.jlcomp.demon.co.uk/seminar.html )

____England______September 24/26, November 12/14

____USA__________November 7/9 (MI), 19/21 (TX)

The Co-operative Oracle Users' FAQ
http://www.jlcomp.demon.co.uk/faq/ind_faq.html








Ben Brugman wrote in message <3d940f99.2536218_at_news.nl.uu.net>...


>On Fri, 27 Sep 2002 08:20:59 +0100, "Jonathan Lewis"

><jonathan_at_jlcomp.demon.co.uk> wrote:

>

>>

>

>We did look into clustering, but the spread of data

>is very skewed. about 10 of the clients are responsible for

>80-90 percent of the data. So most clients only have a few

>rows. About 10 percent of the clients have about a few

>hundred rows. Only a very  few clients have thousands of rows.

>

>Using clustering and estimating the average size of

>a cluster (number of clients in the cluster) probably would

>not work because at a cluster factor of 10 clients per cluster,

>a lot of clusters would be empty or almost empty, and a lot

>of clusters would have a lot over overflow because those

>clusters contain a very large client or several large clients.

>

>We did not try this, because of the above assumptions.

>If there is something wrong with the above reasoning, please

>inform me.

>



The reasoning above is not entirely consistent with the
original description below.  You have very skewed data,
so any solution is likely to be a compromise of some sort -
but you need to decide on the more important factor, and
bias yourself in that direction.  It's too easy to say (as you
have done) "I don't like X because of argument A, and Y is
wrong because of argument B"


In the notes above you don't like clustering because it
will waste space. But the original requirement below
pointed out that though the average number of rows per
client was 40, the average number of rows per client ACCESSED
was 300-400. If the critical performance requirement is to
access those customers with 300-400 rows then who cares
that 20,000 blocks with tiny customers are nearly empty - you
don't access them much, the space (per se) doesn't matter.

Estimating from your figures:
    you have 10 customers accounting for about 32M rows.
    you have 20,000 customers at about 400 rows each
    you have 180,000 customers with virtually no data.

At 10 customers per cluster, your 180,000 customer will
leave about 18,000 blocks nearly empty in the worst case.

By the way - if you put 10 customers per cluster, and 10%
of your customers are large customers, then on average
1 customer per cluster will be a large customer - so you won't
be wasting space anyway - the large customer will fill the block.


Overflow is also not an issue -- in your context -- You are interested
in clients who have 400 rows to report. How big are the rows, how
many blocks will they fill ?  Your target is to reduce 400 rows located
in 400 different places to 400 rows located in a small number of places.
You only get overflow if the data requires the space.




>>> (From 40 miljoen plus records on average between 300 and 400

>>> get selected on a 'client_id') Some clients have 3 records some

>>> clients have 3000 records. On average a client has 40 records,

>>> but the average client which is accessed has as said 300 to 400

>>> records. Offcourse there is an index on the client_id, but even then

>>> the table has to be accessed 300 to 400 times. This means in

>>> 300 reads from disk and 300 blocks in the cache.
Received on Fri Sep 27 2002 - 03:49:51 CDT