Re: A searchable datastructure for represeting attributes?

From: Obnoxio The Clown <obnoxio_at_hotmail.com>
Date: Wed, 27 Feb 2002 21:35:59 GMT
Message-ID: <3c7d4f20.20729737@News.CIS.DFN.DE>

On Wed, 27 Feb 2002 14:13:00 -0500, no_spam_for_me_at_thankyou.very_much wrote:

>Obnoxio The Clown wrote:
>
>> On Wed, 27 Feb 2002 11:16:15 -0500, no_spam_for_me_at_thankyou.very_much
>> wrote:
>
>> Use approach 2.
>
>I would love to. In fact we did use approach_2 for a a long time and
>unfortunately it proved impractical. Yes approach_2 has all the benifits I
>described plus as you correctly pointed out it's easier to report on. The
>problem is that in our case is 2-fold:
>
>1) the attribute keys change frequently and on the fly (ie when the
>application is running) so "alter table drop/add column" is not practical
>when the table is large.
>
>2) There are many many attrubutes (~ 500) and only 3 or 4 at most are set
>for any given object. The application is structured in such a way that when
>an object does not have an attribute set it gets a default value associated
>with that attribute key. So from the perspective of the application all
>objects have all fields specified but the database only stores separate rows
>for those object that have value different from default. If you structure
>the schema with approach_2 your usage of space increases by a factor of
>about 100 consumed for storage of NULLs or defaulft values.
>
>Even though the object model of my particular situation is too long and
>boring to go into I will give the following example: Imagine you have a
>database of resdidents in a particular town and you have attributes like
>phone number, address etc. But you also have attribtutes like place of birth
>which by nature of the data is the same for 99% of the residents of the town
>(eg the town itself)and is different for the 1% who moved in from elsewhere.
>Now imagine you have 500 attibutes with this quality 99% of objects have the
>same value and 1% has a different value. If you use the trivial solution (as
>I described in approach_2) your database is going to be 100 bigger than it
>needs to be. It becomes a case of denormalization in a way. Imagine also
>that attributes are constantly added and deleted as the application is
>running. When you think about this example you will see that approach_1
>makes a lot more sense except for the fact that searching it requires lots
>of self-joins (or correlated subqueries). If it was not for searching
>approach_1 below would be a far superior solution. Now my question is: Is
>there any datastructure that has the fliexibility and "normalization" of
>approach_1 and the ease of search of approach_2. I do not mean ease in a
>sense that the query has to be trivial but in a sense that it would be fast.
>Combining SQL with application logic is also fair game as long as it
>achieves the results.

Hey! This is a forum for discussion of a relational database. What you're describing sounds like something else, and denormalisation is the closest I come to swearing.

OK, I'll try something else (9.x only): why don't you build a row that has a structure for all the stuff that *is* "unique" or relatively so, and has a set component at the end, with pairs of values that indicate which field(s) differ and the associated value.

Unfortunately, I'm a bit busy at the moment, I just dipped in here for some light relief, but if you have 9.x get the manuals out and look into collections....

>> >THE QUESTION: Is there any good datastructure for representing
>> >attributes which has the advantages of approach_1 and the search and
>> >sort convenience of approach_2 below? Seems like a problem many people
>> >would
>> >have run into.
>> >
>> >
>> >1)
>> >
>> >objectid key value
>> >78 phone 555-1212
>> >78 addr Main Street
>> >79 phone 222-2222
>> >79 addr Green Street
>> >
>> >(keys could be normalized of course but I skip that for this example)
>> >
>> >
>> >2)
>> >
>> >objectid phone addr
>> >
>> >78 555-1212 Main Street
>> >79 222-2222 Green Street
>> >
>> >The first approach has several advantages
>> >
>> >a) It is flexible. That is you do not need to change the schema to add
>> >new attribute types.
>>
>> I think this could best be described as a false economy. Also, it
>> doesn't play well with third-party report writers, etc.
>>
>> >b) It is normalized in a sense that if an attribute does not have a
>> >value you can represent it by its absense in the table istead of
>> >duplicating NULL in every feild that does
>> > not have a value. In my case this is actually a critical thing
>> >because there are a lot of attributes whose values are not set.
>>
>> <shudder>
>>
>> >Overall the first approach serves my case much better than second
>> >approach. However it has important disadvantages in that the search by a
>> >
>> >combination of attributes as
>> >well is sort by attributes is very cumbersome and slow. For example if I
>> >
>> >wanted to find all objects where phone number is 555-1212 AND address
>> >equals "Main Street" I would need to
>> >do a self join like
>> >
>> >select t1.objectid from mytable t1, mytable t2 where
>> >t1.objectid=t2.objectid and t1.phone='555-1212' and t2.addr='Main
>> >Street'. There are many problems here
>> >
>> >1) very slow when table is large
>> >2) you need to do a self join for every attribute in the AND-ed
>> >expression.
>> >3) If you want to sort by an attribute other than the search key you
>> >need to self join yet again.
>> >
>> >
>> >By contrast with approach 2 all these problems go away:
>> >
>> >select id from mytable where phone='555-1212' and addr='Main Street'
>> >order by some_field
>
Received on Wed Feb 27 2002 - 15:35:59 CST