Re: When and when not to put information in a database.

From: Jesse <me_at_here.com>
Date: Sun, 2 Mar 2003 17:34:48 -0600
Message-ID: <b3u4a8$mr8$1_at_bob.news.rcn.net>


Steve,

I'll try to explain the data and it's use-cases a little better.

200K instruments, 10,000 simulated prices each, but let me do this with 5 instruments and 10 simulated prices:

Instrument: price1, price2, ..., price10

ABC:  50.234, 54.223, ..., 49.841
DEF:  23.401, 22.198, ..., 28.598
GHI:  34.422, 34.199, ..., 34.841
JKL:  15.208, 13.781, ..., 14.234
MNO:  86.942, 81.295, ..., 78.384

Now we have some portfolios, with positions:

Portfolio #
(Instrument Quantity)
(Instrument Quantity)
. . .

Portfolio #1:
ABC 150
JKL 200 #2:
GHI 150
JKL 1000

MNO 5000 #3:
ABC 8000
GHI 1000
JKL 4000 Etc.

For each portfolio, we grab each instrument with all 10 simulated prices, and then do some statistical stuff. (This involves all 10 prices for each instrument included)

But in my case, I've got ~200,000 instruments, not 5, and 10,000 simulated prices, not 10. Several hundred portfolios with anywhere from 1 or 2, up to a over a hundred thousand positions each. (Some of these portfolios are large firms like Merryl Lynch, Goldman Sachs, etc.) A total of around 13,000,000 positions, when summed over all portfolios.

In a given day, about 150K of the 200K instruments will end up being accessed, and all 10,000 prices are needed for each position. (Which instruments will be needed won't be known in advance, so we calculate for all instruments)

The data storage should be able to serve ad-hoc queries, as well as daily batch jobs. But when you access an instrument, the 4249th simulated price has no useful meaning out of the context of the other 9999 prices. SQL on a completely normalized database: "SELECT Price FROM TheoreticalPrices WHERE InstrumentID = ? and Sequence = 4249" So any ad-hoc queries will be accessing an instrument with ALL prices, not just one or few. (I could explain the why of this, but that would take a while--our quants explained it to me and it makes perfect sense.)

Does this make it a little more clear?

"Steve Kass" <skass_at_drew.edu> wrote in message news:b3n5ur$969$1_at_slb3.atl.mindspring.net...
> Jesse,
>
> If I understand this right, while you have a large volume of data,
> you need relatively little I/O and arithmetic. You don't say whether
> you check every simulated move for each portfolio, but even if you
> don't, 13M positions processed, even assuming no overlap, means you
> access less than 1% of the data.
>
> What am I missing or what haven't you mentioned? - this doesn't
> sound hard for a normalized table that's keyed on InstrumentID
> and ScenarioSequence. It sounds like there's a good chance it
> will parallelize well, too, since the simulated moves sound independent
> of one another.
>
> It sounds like you are basically doing some (large) matrix
> multiplication, and it might be faster in a traditional array-type
> data structure, but I'm guessing there are other issues - updates,
> the frequency with which the scenarios are re-enacted, etc.
>
> SK
>
>
Received on Mon Mar 03 2003 - 00:34:48 CET

Original text of this message