Re: Modelling objects with variable number of properties in an RDBMS

From: Frank Hamersley <terabitemightbe_at_bigpond.com>
Date: Fri, 04 Nov 2005 00:49:02 GMT
Message-ID: <2eyaf.8082$Hj2.1929_at_news-server.bigpond.net.au>


VC wrote:
> "Bernard Peek" <bap_at_shrdlu.com> wrote in message
> news:xGctaeGgBWaDFwgS_at_shrdlu.com...
>

>>In message <xpednZGpW5rOofTeRVn-pQ_at_comcast.com>, VC 
>><boston103_at_hotmail.com> writes
>>
>>>>Should these tests really be considered attributes?  Wouldn't the act
>>>>of presenting them as a two-tuple relation <test#, test result> simple
>>>>be an wise step of normalization - not an example of an EAV
>>>>decomposition?  Am I totally missing the boat on this?
>>>
>>>It's a single test with ~6000 measurements.
>>
>>That doesn't sound like something that requires multiple attributes. If 
>>it's a single test then the results are presumably in one domain so a 
>>single entity should work.

>
> I am not intimately familiar with the testing process. According to the
> person who created the model, it's a drug discovery chemical compound
> testing process which runs daily. The test result for a given compound is
> represented by many thousands of numbers, hence 600 attributes per entity.
> They generated close to half a terabyte of experimental data each day.
>
>>What distinguishes one set of test results from another? Is it the time >>the test was done, the temperature?

It seems to me that this sounds like a situation where practical concerns to cope with the volume of data has induced an attempt to achieve compression (by reducing the repeated occurence of key information were the schema to be normalised per RM theory.

Consider an automated gel electrophoresis device that scans a plate at discrete time and distance intervals (indeterminate and 6000 respectively) and records a dye density at each point.

To conserve space you might be inclined to form a relation like ....

     TestRun,PlateName,SampleDateTime,Density1,Density2,....,Density6000.

... until you blow the page size :-) Basically the Distance attribute is lodged in metadata.

The (simplistic) RM view of this would be ....

     TestRun,PlateName,SampleDateTime,Distance,Density.

If the first model produces 1/2 Tb then the second will surely occupy much more space. However I don't think I could ever sell out the RM and go with option 1. It is much better IMO to find ways to overcome the problems of the second option by choosing the right DBMS and configuring it appropriately, rather than (to coin a metaphor) to "use an SUV (or even a fleet of SUV's) to move 20 tonnes of freight from coast to coast"!

Salut, Frank. Received on Fri Nov 04 2005 - 01:49:02 CET

Original text of this message