Re: How to optimize a table for segment data compression?

From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk>
Date: Mon, 25 Nov 2002 09:35:30 -0000
Message-ID: <arsrth$64h$1$830fa17d@news.demon.co.uk>

Interesting question - I've been thinking about it over the weekend, but not tested anything.

A compression factor of 12 suggests that more 90% of EVERY row was duplicated, and could be compressed with virtually no overhead - which seems a little unlikely in most cases. However, compression is done at a block level, with column re-ordering within the block - so if on a block by block basis there were lots of null columns, Oracle could choose to migrate them to the ends of the rows and make an interesting 'extra' win there.

The mechanism of compression appears to take a single copy of each repeated value and store it in a 'table zero' set of rows in the block, and then change each row to have pointers to the correct 'table zero' entry instead of holding the value. Given this, I would expect very short columns to offer no benefit on compression. As far as medium to long columns go, I think there is too much room for variation.

Critically you need to get enough compression to allow extra rows into the block if you really want to save space - concealing 30 copies of 10 bytes may give you two extra rows in a block, whereas concealing two spare copies of 150 bytes could have the same effect, so I would be inclined to go for sorting by

long with small numbers of repeats
before

show with large numbers of repeats

But I wouldn't expect this to be a perfect solution in every case.

--
Regards

Jonathan Lewis
http://www.jlcomp.demon.co.uk

Next Seminar dates:
(see http://www.jlcomp.demon.co.uk/seminar.html )

____England______January 21/23


The Co-operative Oracle Users' FAQ
http://www.jlcomp.demon.co.uk/faq/ind_faq.html





Vsevolod Afanassiev wrote in message
<4f7d504c.0211220233.1255e5f_at_posting.google.com>...


>First, thanks to people who responded on my previous post regarding

>data

>compression.

>

>* * *

>

>We have  a database with large historic table, partitioned by day.

>New records are constantly loaded "at the top" and N days old

>partition

>is truncated "at the bottom". Once loaded, the records never change.

>Looks like a perfect candidate for 9.2-style data compression.

>

>Our tests show compression ratio of about 2 - 2.5, with 16K block

>(biggest block size for Solaris) and row size approx. 200 bytes.

>We thought that compression ratio can be improved by loading

>pre-sorted records to increase the probability of records in the

>same block having similar values. It is possible to dump

>a whole partition into a file, sort it, and then load back into the

>database.

>

>However, what columns to use for sort? Probably, start from columns

>with smallest number of distinct values (like Yes/No)?

>Do we need to take into account the column size as well?

>It may be more important to group together similar VARCHAR(100)

>values even if there are 1000 of them per partition then to group

>CHAR(1) values with 10 distinct values.

>

>So my suggestion is: rank all columns by ratio

>

>Ratio=(number of distinct values)/(average size of one value)

>

>and then sort, starting from the column with smallest Ratio,

>then second smallest, and so on.

>Does it make sense?

>

>

>* * *

>

>The While Paper on Data Segment Compression mentions

>that the highest compresion ratio they achieved was 12 for Call Data

>Records (CDRs). Well, my table also contains CDRs, but so far I

>couldn't get compression ratio more than 2.5.

>

>Thanks,

>Sev

Received on Mon Nov 25 2002 - 03:35:30 CST