Re: Sotirng data in row versus storing the same in column

From: Arijit Mukherjee <arijit_at_lucent.com>
Date: Fri, 17 May 2002 09:44:01 +0530
Message-ID: <3CE48389.2E0AE062@lucent.com>

All,
I must have misunderstood this email - or the other ones. The "newfangled way" here is, perhaps, referring to the relational way - where the parameters of table A (i.e., a,b,c,...) are stored as columns. Please note these are columns of table A - which will have rows containing values for these columns in rows. And the number of rows, even, in this case will be huge - a very small amount of data for a busy enough table goes upto 50000+ rows.

I have got replies saying that the model itself is a mess (to this email - that I later forwarded to comp.databases.oracle.misc) - YES!!! - and we are trying to make a little bit of sense out of this mess - as we will have to live with the mess for some time now. So, we do not have the luxury of scrapping this model altogether - as the model is something of a legacy.

Now, therefore, given the situation, I will need to store these information. The way that I could think of was to use the column model - with a cap on the number of columns that I can have on a table set to 600(leaving a healthy 400 to take care of arrays in the columns as Oracle sets a limit of 1000 columns per table - and this cap can be briught up/down based on an empirical value definition in my code) - and the following 600 columns go over to the next extended table with a row identifier stringing the rows in the extended tables together. However, what I have also done is that, when, I actually start working on the rows that span multiple extended table - I have noticed (atleast for now, might have surprise later!!!) that I do not need all these column data for rows in one go. So, at the time of actual working - I create a new table having only those columns that I need spanning multiple extended table and get the row information into this table - mostly after a select query - so that the data I am left with is much smaller - both in terms of rows and columns - on which I create index and do the actual work. Therefore at the time of storing data into these extended table is not slowed down - for creating index values at the time of storage - and I do not use up huge space to stoe these indexes. However, when I actually start to use this data - the creation of index speeds up my work - which I remove, once I am done with the index. Also, that ensures that I create index on only those columns that I need.

There was an existing implementation that was using the name value pair in the 2-column format as mentioned as row format of data storage. There is a logic floating around that there will be performance cost to pay if we have to handle this bizzare case where the columns (including the array size) goes beyond thousands if we use the column way (the RDBMS way) - as opposed to the name-value 2 column way - where the number of rows will be huge (for a small sized database - a table of the hard hitting variety takes more than a few million rows of data) - but the number of columns will not be big.

Now, I would like to know whether anyone can advise me on a comparison between the 2 approaches - so far as speed is concerned - as I am pretty certain that so far as ease of use is concerned - the column way(Relational way) wins hands down,

Regards,
Arijit

Galen Boyer wrote:
>
> On Thu, 16 May 2002, arijit_at_lucent.com wrote:
>
> [...]
>
> > In column format - I will be using the normal RDBMS format - with
> > the only exception that I will need to put additional columns in
> > extended tables having an identifier between the same rows in the
> > extended table. Regards, Arijit
>
> Do you think this will be faster or something? Lets say you have 100
> columns, with 99 indexes. This way will have an index for every column,
> but every column will only have one row (?). In Oracle, at least, 99
> blocks would be used to store these 99 single values along with their
> rowids. If you stored it in relational form, row form as you say, a
> very small number of blocks (two in my example below) would be used to
> store this one index.
>
> Lets say each column takes 4 bytes. Lets say a block is 2K. Lets say
> the rowid is 10 bytes. Then, 2000/40 = 50. You would have to have more
> than 50 rows in your index before one extra block would be used in your
> relational index (Not totally true with headers and other stuff but
> close enough for argument). In your newfangled way, 99 2K blocks would
> be used to store your 99 indexes. (To tell you the truth, with 100
> rows, an index would even make sense. Oracle could scan much faster
> than reading both index and then table)
>
> Now, remember that the database will read an entire block to get at the
> data in that block. It will have to do almost exactly the same work to
> read the two, 100 value index as reading any of your 99 one value
> indexes. Do you really think you are speeding things up? You do realize
> that any dml on this newfangled table would have to maintain all these
> 99 indexes don't you?
>
> Now, lets say you have 1 million values. Are you going to have
> multitudes of tables all with one row but 100's of columns, all indexed?
> Do you really think this is speeding things up? Are all of these tables
> going to be linked together by the same id? Whew!
> --
> Galen deForest Boyer
> Sweet dreams and flying machines in pieces on the ground.
Received on Thu May 16 2002 - 23:14:01 CDT