Re: schema help

From: David BL <davidbl_at_iinet.net.au>
Date: Tue, 25 Dec 2007 20:10:38 -0800 (PST)
Message-ID: <259688da-b70f-4595-9610-15ae0458047c_at_s8g2000prg.googlegroups.com>


On Dec 26, 7:40 am, Rob <rmpsf..._at_gmail.com> wrote:
> On Dec 25, 12:37 pm, magawake <magaw..._at_gmail.com> wrote:> Hello:
> > I am trying to learn SQL and RDBMS theories. To acheive this, I
> > decided to assign myself a project.
>
> > My project is basically, account for my NFS mounts, and keep track of
> > them. I want to see which filesystems are growing, and which
> > filesystems are staying the same. I will get an inventory daily. I
> > will grab the data and place it into a database, using PERL or AWK.
>
> [some illustrative data]
>
>
>
> > So my intentention is, to show do stats on these filesystems, and see
> > when it grows and when it stays the same. I will get these stats
> > daily, and hopefully in a month, I can see what is growing, shrinking,
> > and staying the same.
>
> > Any ideas on the schema? do I really need a RDBMS for this?
>
> > TIA
>
> A fascinating question I've long pondered. Since I am interested in
> the general question, I'll take the liberty to restate it, accurately
> I hope:
>
> Since a relational database represents the model state of a known
> universe, is it possible to ask/answer the question: How has the
> database changed over time?
>
> Snodgrass advocates adding begin/end time attributes, for the data
> entry window and the data validity window (with representations for
> minus- and plus infinity). But am I really modeling change-over-time
> or just adding timeframes to snapshots? (Like using a timed exposure
> to take a picture with a camera when there's too little light.)
>
> Would it not be better to utilize the OP's data capture approach by
> periodically evaluating a query, saving the query response externally
> and computing "deltas" between query responses (extensions) over
> time?? (This is fundamentally the data warehouse approach in which a
> relational database is used to capture periodic snapshots.)
>
> What do other people think about this? Is it better to implement a
> temporal database or to capture snapshots of a non-temporal database
> (externally or in a data warehouse) and compute the changes?

Interesting to how this relates to a discussion in another thread...

Let S' = S + d denote the application of delta d to state S to yield a new state S'. Often S+d is much easier to calculate than d = S' - S. In fact the latter is not always uniquely defined.

Consider that there are two separate databases. The State-DB records the current state S without any regard for history. The History-DB only records the history as a sequence of deltas [d1,d2,...,dn] but is not directly concerned with calculating the current state.

There is a concept of applying outstanding deltas to S to bring it up to date:

    S' = S + [dm,...,dn].

In that sense the State-DB is just a read only view and only the History-DB is regarded as the primary source of the data to which updates are applied with transactions. Furthermore these updates tend to only add extra tuples to relvars rather than removing or modifying existing tuples. Therefore in theory the History-DB can provide excellent ingestion rates and be optimised for writing new deltas as sequential data to disk. Note that the information (as deltas) in the History-DB is total ordered.

IMO bringing the State-DB up to date is best performed asynchronously. The State-DB can persist a sequence number to ensure it applies each delta in the total order exactly once. A thread can access the State-DB and by reading the sequence number, easily determine the subset of the History-DB that is associated with the current state of the State-DB. Furthermore, in theory with pure additive changes to the History-DB it would be possible to allow read only threads to access the History-DB concurrently with its mutative transactions. This is kind of like MVCC for free. See

    http://en.wikipedia.org/wiki/Multiversion_concurrency_control

Often deltas can be regarded as events occurring at some point in space and time. For example, a marriage event causes the state to change from unmarried to married. A birth event causes a person to be added to a family tree database.

Clearly such events are to be recorded but shouldn't require subsequent modification. I tend to think of a History-DB as an *events* database, and often represents the proper way to record the primary source of information (as distinct from the current state which is going to depend a lot more on one's conceptual model and the application). I would go as far as saying that the events DB should be associated with the base relvars, and the DBMS should support the asynchronous calculation of the current state as a read only view.

I imagine that only an events DB has some realistic chance of making the world's data look like a single information source - because it has less application bias and completely avoids the quandaries of distributed transactions.

For certain applications I anticipate that an OO physical representation could be desirable for caching specialised uses (navigations) of the data in a derived State-DB. However there is little doubt that the RM is best for general use.

Note that a History-DB can easily and efficiently make its deltas available to any number of asynchronous State-DBs without blocking its mutative threads (and upsetting the History-DB ingestion rate).

When scaling to distributed data sources (ie a State-DB works against multiple History-DBs), vector times are relevant. See

    http://en.wikipedia.org/wiki/Vector_clocks Received on Wed Dec 26 2007 - 05:10:38 CET

Original text of this message