Re: eliminating redundant data

From: Alan <alan_at_erols.com>
Date: Thu, 3 Jun 2004 10:10:55 -0400
Message-ID: <2i8prgFkm5tvU1_at_uni-berlin.de>


No database vendor, version, platform... Why are you using tables as indexes when you can use indexes as indexes? Normalizing a data model generally slows it down. Denormalizing generally speeds things up.

"gordy" <gordy_at_dynamicsdirect.com> wrote in message news:40be960f$1_1_at_Usenet.com...
> edit: this came out longer than I thought, any comments about anything
> here is greatly appreciated. thank you for reading
>
> My system stores millions of records, each with fields like firstname,
> lastname, email address, city, state, zip, along with any number of user
> defined fields. The application allows users to define message templates
> with variables. They can then select a template, and for each variable
> in the template, type in a value or select a field.
>
> The system allows you to query for messages you've sent by specifying
> criteria for the variables (not the fields).
>
> This requirement has made it difficult to normalize my datamodel at all
> for speed. What I have is this:
>
> [fieldindex]
> id int PK
> name nvarchar
> type datatype
>
> [recordindex]
> id int PK
> ...
>
> [recordvalues]
> recordid int PK
> fieldid int PK
> value nvarchar
>
> whenever messages are sent, I store which fields were mapped to what
> variables for that deployment. So the query with a variable criteria
> looks like this:
>
> select coalesce(vm.value, rv.value)
> from sentmessages sm
> inner join variablemapping vm on vm.deploymentid=sm.deploymentid
> left outer join recordvalues rv on
> rv.recordid=sm.recordid and rv.fieldid=vm.fieldid
> where coalesce(vm.value, rv.value) ....
>
> this model works pretty well for searching messages with variable
> criteria and looking up variable values for a particular message. the
> big problem I have is that the recordvalues table is HUGE, 1 million
> records with 50 fields each = 50 million recordvalues rows. The value,
> two int columns plus the two indexes I have on the table make it into a
> beast. Importing data takes forever. Querying the records (with a field
> criteria) also takes longer than it should.
>
> makes sense, the performance was largely IO bound.
>
> I decided to try and cut into that IO. looking at a recordvalues table
> with over 100 million rows in it, there were only about 3 million unique
> values. so I split the recordvalues table into two tables:
>
> [recordvalues]
> recordid int PK
> fieldid int PK
> valueid int
>
> [valueindex]
> id int PK
> value nvarchar (unique)
>
> now, valueindex holds 3 million unique values and recordvalues
> references them by id. to my suprise this shaved only 500mb off a 4gb
> database!
>
> importing didn't get any faster either, although it's no longer IO bound
> it appears the cpu as the new bottleneck outweighed the IO bottleneck.
> this is probably because I haven't optimized the queries for the new
> tables (was hoping it wouldn't be so hard w/o the IO problem).
>
> is there a better way to accomplish what I'm trying to do? (eliminate
> the redundant data).. does SQL have built-in constructs to do stuff like
> this? It seems like maybe I'm trying to duplicate functionality at a
> high level that may already exist at a lower level.
>
> IO is becoming a serious bottleneck.
> the million record 50 field csv file is only 500mb. I would've thought
> that after eliminating all the redundant first name, city, last name,
> etc it would be less data and not 8x more!
>
> -
> Gordon
>
> Posted Via Usenet.com Premium Usenet Newsgroup Services
> ----------------------------------------------------------
> ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
> ----------------------------------------------------------
> http://www.usenet.com
Received on Thu Jun 03 2004 - 16:10:55 CEST

Original text of this message