Re: When and when not to put information in a database.

From: Steve Kass <skass_at_drew.edu>
Date: Mon, 03 Mar 2003 02:31:41 -0500
Message-ID: <b3v0a1$h39$1_at_slb6.atl.mindspring.net>

Jesse,

I don't quite see how you figure 10,000 times as many reads with normalization. The same number of prices need to be read in either case. The amount of data stored does go up (but the non-normalized table exceeds SQL Server requirements on the length of a row).

Not normalized: 10,000 floats + 3-character tag per instrument Normalized: 10,000*(1 float + 1 simulation_id (2 bytes?) + one 3-byte tag)

So each instrument goes from 80,003 bytes to 130,000 bytes of raw data. Yes, many more rows need to be accessed, but I don't think accessing all 10,000 prices from one row (even if rows were allowed to be so long) would be as fast as accessing the information from one very short row of the normalized table.

Indexes can change this analysis. But there may actually be a benefit here. The
short rows may provide more opportunities for caching or parallelism, and more
possible locking strategies. There may also be some creative approaches to query writing, that allow

It may turn out not to work at all, but before you decide not to normalize, you may want to do either a more thorough analysis of required i/o as well as some testing. I'm not sure a normalized table is better, but the query optimizer and the reliability of the theory constantly surprise me.

Jesse wrote:

>Steve,
>
>I'm in agreement with a normalized database being a good fit for the
>problem, in theory. I agree with everything you've said, in fact. My
>problem with accepting an RDBMS is the overhead of normalization on any
>implementation that I've ever seen.
>
>The "aggressive timeframe" in which the main application must run is about
>three hours, but I always like to use as little of this as possible. This
>is an inflexible process flow constraint, and so whatever we do has to
>perform. Making all 13 million accesses in three hours means about 1200
>reads per second. If the database is normalized, we're now looking at 1200
>* 10,000, or 12,000,000 rows per second. Can this be done in 2003 in any
>RDBMS? I don't know.
>
>My solution, OTOH, is up and running (as of this afternoon) at over 2000
>records per second on my P3-500 with 640megs ram, however on a reasonably
>small subset of the data (but still 10K prices per record). On a big
>multiprocessor machine with ample main memory, it should scream. (If not
>then I'll have to eat my words and my wasted time)
>
>But back to the main point, 12 million records per second? The normalized
>database strategy is both robust and aesthetically pleasing on paper, but if
>it's infeasible, then what? Does this not seem to break the normalized
>camel's back? Or are there (RDBMS) solutions/optimizations/tricks that I
>don't know about?
>
>Jesse
>
>
>"Steve Kass" <skass_at_drew.edu> wrote in message
>news:b3udqh$g2k$1_at_slb5.atl.mindspring.net...
>
>
>>Jesse,
>>
>> It makes a little more sense. But when you say that price 4249
>>is not 4249th in sequence, presumably 4249 is still an important
>>distinguishing factor of price4249, since when you evaluate a
>>portfolio using price4249 for ABC, you will be using price4249
>>for DEF, GHI etc. So 4249 has no numerical meaning, but it is
>>still data (and not metadata).
>>
>>In order to be able to associate the ABC price of 54.223
>>with the JKL price of 13.781 (and use them together
>>when evaluating), you need to answer the question "which
>>price for JKL?". It's not enough to say "the one in the same
>>column as the ABC price of 54.223," since in the relational
>>model, columns contain metadata, not data, and questions must
>>be answerable using data. (The question can contain metadata,
>>of course, as in "what _price_, what _age_, what _date")
>>
>>So the number 4249 specifies the particular "simulation instance"
>>a price belongs to. This is why I want it stored in the table, not
>>misplaced in a column name.
>>
>>I don't recommend it for efficiency, but using GUIDs to answer
>>"which" by specifying a price list instance would make it harder
>>to misunderstand the meaning of that information. This is always
>>a risk when one uses 1, 2, 3 to label things that don't have
>>respective 1-ness, 2-ness, 3-ness. But integers are handy, and
>>safe here when you realize this.
>>
>> In any case, I do see that you have more data than I originally
>>thought. It looks to me like you are still doing nothing more than
>>matrix multiplication (your statistical stuff aside). If you need to
>>serve ad hoc queries as well as this matrix multiplication, I would
>>still recommend trying a normalized table with the appropriate
>>clustered index (Instrument, "simulation_instance) presumably.
>>Whether another solution is faster, I don't know. If you need
>>both query-like information and computational information from
>>the same data, it's not a surprise that no one solution works, and
>>it may be that the solution you've developed is a good one when
>>both are needed.
>>
>>Again, though, an RDBMS has a great many features beyond
>>query processing, and if those, such as atomicity, integrity, etc.
>>are important, be careful if you think you can provide them with
>>something that wasn't invented yesterday.
>>
>>Steve
>>
>>Jesse wrote:
>>
>>
>>
>>>Steve,
>>>
>>>I'll try to explain the data and it's use-cases a little better.
>>>
>>>200K instruments, 10,000 simulated prices each, but let me do this with 5
>>>
>>>
>
>
>
>>>instruments and 10 simulated prices:
>>>
>>>
>>>Instrument: price1, price2, ..., price10
>>>
>>>ABC: 50.234, 54.223, ..., 49.841
>>>DEF: 23.401, 22.198, ..., 28.598
>>>GHI: 34.422, 34.199, ..., 34.841
>>>JKL: 15.208, 13.781, ..., 14.234
>>>MNO: 86.942, 81.295, ..., 78.384
>>>
>>>Now we have some portfolios, with positions:
>>>
>>>Portfolio #
>>>(Instrument Quantity)
>>>(Instrument Quantity)
>>>. . .
>>>
>>>
>>>Portfolio #1:
>>>ABC 150
>>>JKL 200
>>>
>>>#2:
>>>GHI 150
>>>JKL 1000
>>>MNO 5000
>>>
>>>#3:
>>>ABC 8000
>>>GHI 1000
>>>JKL 4000
>>>
>>>Etc.
>>>
>>>For each portfolio, we grab each instrument with all 10 simulated prices,
>>>and then do some statistical stuff. (This involves all 10 prices for
>>>
>>>
>each
>
>
>>>instrument included)
>>>
>>>But in my case, I've got ~200,000 instruments, not 5, and 10,000
>>>
>>>
>simulated
>
>
>>>prices, not 10. Several hundred portfolios with anywhere from 1 or 2, up
>>>
>>>
>to
>
>
>>>a over a hundred thousand positions each. (Some of these portfolios are
>>>large firms like Merryl Lynch, Goldman Sachs, etc.) A total of around
>>>13,000,000 positions, when summed over all portfolios.
>>>
>>>In a given day, about 150K of the 200K instruments will end up being
>>>accessed, and all 10,000 prices are needed for each position. (Which
>>>instruments will be needed won't be known in advance, so we calculate for
>>>all instruments)
>>>
>>>The data storage should be able to serve ad-hoc queries, as well as daily
>>>batch jobs. But when you access an instrument, the 4249th simulated
>>>
>>>
>price
>
>
>>>has no useful meaning out of the context of the other 9999 prices. SQL
>>>
>>>
>on a
>
>
>>>completely normalized database: "SELECT Price FROM TheoreticalPrices
>>>
>>>
>WHERE
>
>
>>>InstrumentID = ? and Sequence = 4249" So any ad-hoc queries will be
>>>accessing an instrument with ALL prices, not just one or few. (I could
>>>explain the why of this, but that would take a while--our quants
>>>
>>>
>explained
>
>
>>>it to me and it makes perfect sense.)
>>>
>>>Does this make it a little more clear?
>>>
>>>
>>>"Steve Kass" <skass_at_drew.edu> wrote in message
>>>news:b3n5ur$969$1_at_slb3.atl.mindspring.net...
>>>
>>>
>>>
>>>
>>>>Jesse,
>>>>
>>>> If I understand this right, while you have a large volume of data,
>>>>you need relatively little I/O and arithmetic. You don't say whether
>>>>you check every simulated move for each portfolio, but even if you
>>>>don't, 13M positions processed, even assuming no overlap, means you
>>>>access less than 1% of the data.
>>>>
>>>> What am I missing or what haven't you mentioned? - this doesn't
>>>>sound hard for a normalized table that's keyed on InstrumentID
>>>>and ScenarioSequence. It sounds like there's a good chance it
>>>>will parallelize well, too, since the simulated moves sound independent
>>>>of one another.
>>>>
>>>> It sounds like you are basically doing some (large) matrix
>>>>multiplication, and it might be faster in a traditional array-type
>>>>data structure, but I'm guessing there are other issues - updates,
>>>>the frequency with which the scenarios are re-enacted, etc.
>>>>
>>>>SK
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>
>
>
>
>
>
Received on Mon Mar 03 2003 - 08:31:41 CET

This message: [ Message body ]
Next message: Lauri Pietarinen: "Re: Extending my question. Was: The relational model and relational algebra - why did SQL become the industry standard?"
Previous message: Oodie: "Need help/advice for database structure"
Maybe in reply to: Jesse: "Re: When and when not to put information in a database."
In reply to Jesse: "Re: When and when not to put information in a database."
Next in thread: Jesse: "Re: When and when not to put information in a database."
Reply: Jesse: "Re: When and when not to put information in a database."

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message