eliminating redundant data

From: gordy <gordy_at_dynamicsdirect.com>
Date: Wed, 02 Jun 2004 20:06:40 -0700
Message-ID: <40be960f$1_1_at_Usenet.com>



edit: this came out longer than I thought, any comments about anything here is greatly appreciated. thank you for reading

My system stores millions of records, each with fields like firstname, lastname, email address, city, state, zip, along with any number of user defined fields. The application allows users to define message templates with variables. They can then select a template, and for each variable in the template, type in a value or select a field.

The system allows you to query for messages you've sent by specifying criteria for the variables (not the fields).

This requirement has made it difficult to normalize my datamodel at all for speed. What I have is this:

[fieldindex]

id int PK
name nvarchar
type datatype

[recordindex]

id int PK
...

[recordvalues]

recordid int PK
fieldid int PK
value nvarchar

whenever messages are sent, I store which fields were mapped to what variables for that deployment. So the query with a variable criteria looks like this:

select coalesce(vm.value, rv.value)
from sentmessages sm
inner join variablemapping vm on vm.deploymentid=sm.deploymentid left outer join recordvalues rv on
rv.recordid=sm.recordid and rv.fieldid=vm.fieldid where coalesce(vm.value, rv.value) ....

this model works pretty well for searching messages with variable criteria and looking up variable values for a particular message. the big problem I have is that the recordvalues table is HUGE, 1 million records with 50 fields each = 50 million recordvalues rows. The value, two int columns plus the two indexes I have on the table make it into a beast. Importing data takes forever. Querying the records (with a field criteria) also takes longer than it should.

makes sense, the performance was largely IO bound.

I decided to try and cut into that IO. looking at a recordvalues table with over 100 million rows in it, there were only about 3 million unique values. so I split the recordvalues table into two tables:

[recordvalues]

recordid int PK
fieldid int PK
valueid int

[valueindex]

id int PK
value nvarchar (unique)

now, valueindex holds 3 million unique values and recordvalues references them by id. to my suprise this shaved only 500mb off a 4gb database!

importing didn't get any faster either, although it's no longer IO bound it appears the cpu as the new bottleneck outweighed the IO bottleneck. this is probably because I haven't optimized the queries for the new tables (was hoping it wouldn't be so hard w/o the IO problem).

is there a better way to accomplish what I'm trying to do? (eliminate the redundant data).. does SQL have built-in constructs to do stuff like this? It seems like maybe I'm trying to duplicate functionality at a high level that may already exist at a lower level.

IO is becoming a serious bottleneck.
the million record 50 field csv file is only 500mb. I would've thought that after eliminating all the redundant first name, city, last name, etc it would be less data and not 8x more!

-
Gordon

 Posted Via Usenet.com Premium Usenet Newsgroup Services


Received on Thu Jun 03 2004 - 05:06:40 CEST

Original text of this message