Re: When and when not to put information in a database.

From: Steve Kass <skass_at_drew.edu>
Date: Sun, 02 Mar 2003 21:15:41 -0500
Message-ID: <b3udqh$g2k$1_at_slb5.atl.mindspring.net>


Jesse,

  It makes a little more sense. But when you say that price 4249 is not 4249th in sequence, presumably 4249 is still an important distinguishing factor of price4249, since when you evaluate a portfolio using price4249 for ABC, you will be using price4249 for DEF, GHI etc. So 4249 has no numerical meaning, but it is still data (and not metadata).

In order to be able to associate the ABC price of 54.223 with the JKL price of 13.781 (and use them together when evaluating), you need to answer the question "which price for JKL?". It's not enough to say "the one in the same column as the ABC price of 54.223," since in the relational model, columns contain metadata, not data, and questions must be answerable using data. (The question can contain metadata, of course, as in "what _price_, what _age_, what _date")

So the number 4249 specifies the particular "simulation instance" a price belongs to. This is why I want it stored in the table, not misplaced in a column name.

I don't recommend it for efficiency, but using GUIDs to answer "which" by specifying a price list instance would make it harder to misunderstand the meaning of that information. This is always a risk when one uses 1, 2, 3 to label things that don't have respective 1-ness, 2-ness, 3-ness. But integers are handy, and safe here when you realize this.

  In any case, I do see that you have more data than I originally thought. It looks to me like you are still doing nothing more than matrix multiplication (your statistical stuff aside). If you need to serve ad hoc queries as well as this matrix multiplication, I would still recommend trying a normalized table with the appropriate clustered index (Instrument, "simulation_instance) presumably. Whether another solution is faster, I don't know. If you need both query-like information and computational information from the same data, it's not a surprise that no one solution works, and it may be that the solution you've developed is a good one when both are needed.

Again, though, an RDBMS has a great many features beyond query processing, and if those, such as atomicity, integrity, etc. are important, be careful if you think you can provide them with something that wasn't invented yesterday.

Steve

Jesse wrote:

>Steve,
>
>I'll try to explain the data and it's use-cases a little better.
>
>200K instruments, 10,000 simulated prices each, but let me do this with 5
>instruments and 10 simulated prices:
>
>
>Instrument: price1, price2, ..., price10
>
>ABC: 50.234, 54.223, ..., 49.841
>DEF: 23.401, 22.198, ..., 28.598
>GHI: 34.422, 34.199, ..., 34.841
>JKL: 15.208, 13.781, ..., 14.234
>MNO: 86.942, 81.295, ..., 78.384
>
>Now we have some portfolios, with positions:
>
>Portfolio #
>(Instrument Quantity)
>(Instrument Quantity)
>. . .
>
>
>Portfolio #1:
>ABC 150
>JKL 200
>
>#2:
>GHI 150
>JKL 1000
>MNO 5000
>
>#3:
>ABC 8000
>GHI 1000
>JKL 4000
>
>Etc.
>
>For each portfolio, we grab each instrument with all 10 simulated prices,
>and then do some statistical stuff. (This involves all 10 prices for each
>instrument included)
>
>But in my case, I've got ~200,000 instruments, not 5, and 10,000 simulated
>prices, not 10. Several hundred portfolios with anywhere from 1 or 2, up to
>a over a hundred thousand positions each. (Some of these portfolios are
>large firms like Merryl Lynch, Goldman Sachs, etc.) A total of around
>13,000,000 positions, when summed over all portfolios.
>
>In a given day, about 150K of the 200K instruments will end up being
>accessed, and all 10,000 prices are needed for each position. (Which
>instruments will be needed won't be known in advance, so we calculate for
>all instruments)
>
>The data storage should be able to serve ad-hoc queries, as well as daily
>batch jobs. But when you access an instrument, the 4249th simulated price
>has no useful meaning out of the context of the other 9999 prices. SQL on a
>completely normalized database: "SELECT Price FROM TheoreticalPrices WHERE
>InstrumentID = ? and Sequence = 4249" So any ad-hoc queries will be
>accessing an instrument with ALL prices, not just one or few. (I could
>explain the why of this, but that would take a while--our quants explained
>it to me and it makes perfect sense.)
>
>Does this make it a little more clear?
>
>
>"Steve Kass" <skass_at_drew.edu> wrote in message
>news:b3n5ur$969$1_at_slb3.atl.mindspring.net...
>
>
>>Jesse,
>>
>> If I understand this right, while you have a large volume of data,
>>you need relatively little I/O and arithmetic. You don't say whether
>>you check every simulated move for each portfolio, but even if you
>>don't, 13M positions processed, even assuming no overlap, means you
>>access less than 1% of the data.
>>
>> What am I missing or what haven't you mentioned? - this doesn't
>>sound hard for a normalized table that's keyed on InstrumentID
>>and ScenarioSequence. It sounds like there's a good chance it
>>will parallelize well, too, since the simulated moves sound independent
>>of one another.
>>
>> It sounds like you are basically doing some (large) matrix
>>multiplication, and it might be faster in a traditional array-type
>>data structure, but I'm guessing there are other issues - updates,
>>the frequency with which the scenarios are re-enacted, etc.
>>
>>SK
>>
>>
>>
>>
>
>
>
>
Received on Mon Mar 03 2003 - 03:15:41 CET

Original text of this message