Re: TRM - Morbidity has set in, or not?

From: Bob Badour <bbadour_at_pei.sympatico.ca>
Date: Fri, 19 May 2006 03:00:07 GMT
Message-ID: <Xwabg.9678$A26.240438_at_ursa-nb00s0.nbnet.nb.ca>


Keith H Duggar wrote:

> Bob Badour wrote:
>

>>I think now is a good time to revisit Keith's statement
>>about simulation. One should first note that there are
>>different types of simulation.
>>
>>Simula was invented for the sort of simulation that builts
>>large unpredictable state machines by combining lots of
>>small predictable state machines arranged in complex
>>patterns.

>
> [snip interesting circuit simulation example]
>
> (forgive my ignorance, what language was your example
> written in? And would I need a particular DBMS to run a
> simulation coded in this way? I'm trying to work through the
> example it's just a little new to me.)

Basically, it is Tutorial D. I didn't have a copy of The Third Manifesto handy so I used the grammar from an extended Tutorial D I found online at http://dbappbuilder.sourceforge.net/TutorialDGrammar.html

> The type of simulation I had in mind was one having a large
> (though fairly simple) state with a complex unpredictable
> (stochastic) transition function. For example a few years ago
> I implemented (in C++) a bovine spongiform encephalopathy
> (BSE) or mad cow disease simulator as part of a team working
> for the USDA. The state of the simulation is simply a
> population (up to a hundred million or so) of bovines. Some
> of these bovines may be infected with BSE. Those that are
> infected possess additional attributes. For example here is
> my attempt at describing this with two relations:
>
>
> bovines
>
> purpose : BEEF DAIRY BREED
> gender : MALE FEMALE
> birth : date
>
> sick bovines
>
> purpose : BEEF DAIRY BREED
> gender : MALE FEMALE
> birth : date
> infected : date -- when infected
> clinical : date -- when symptoms will appear
> contagious : date -- when maternal transfer is possible
> termination : date -- when BSE will kill this animal
>
>
> The stochastic transition function is composed of such
> things at natural birth and death, culling and subsequent
> rendering (conversion to animal feed components) or
> slaughter for human consumption, death from BSE and
> subsequent disposition (burial or rendering), feeding of the
> bovines with possible infection from rendered BSE,
> spontaneous infection, etc. These functions are complicated
> and stochastic though typically do not vary with time
> (though they can).
>
> As the simulation progresses from state to state we observe
> and summarize the state as well as track some aspects of the
> transition. For example we might summarize the number of
> infected cattle and we might track how much BSE went into
> the human food supply.
>
> Now of course we made choices on how to structure the state
> and parameters which leads to this point
>
> Marshall wrote:
>

>>Physical independence comes in when one considers how much
>>work many C++ programmers have to do to, for example, lay
>>out their data in a way that will satisfy a graphics
>>coprocessor, or enable them to use SIMD instructions. It
>>would be better if this was abstracted from the code.

>
> Or even laying out data in a structure convenient for your
> own analysis. A particular network may be very convenient
> for one analysis and a nightmare for other. I'm beginning to
> understand this is the concept of "access paths" correct?
> And that a network model encodes and optimizes a particular
> access path whereas a relational model does not? And thus a
> RM allows many access paths? Efficiently?

Actually, the point you are asking about above is more about expression bias than access paths. Network and hierarchic data models limit both by combining separate concerns.

The RM is about expression. One wants to express things as close to the level of intent as is practical in any given situation. That was the basis of the title for Date's _What Not How_, for instance.

Because one might want to express anything with some given data, it makes no sense to bias what one can express. Unidirectional pointers bias expression from one end of the pointer toward the other end. Relations and values have no such bias. With the RM, one expresses what one wants and not how to get it.

Because it deals with expression and with intent, the RM deals effectively with the concern for correctness.

Performance is a separate and important concern. The principle of physical independence states the dbms should provide as much flexibility as possible for translating what is expressed to how to get it. The DBMS should automate the translation as much as possible to achieve the desired performance characteristics.

For example, in SQL, when one creates an index, one adds a new access path that the DBMS can use. Sometimes the dbms will use the index, and sometimes it will use another access path instead.

Sadly, with respect to physical independence, the 'barely good enough' has proved the enemy of the truly good.

I know of one individual who performs complex simulations on a farm of OS/2 boxes he keeps networked in his basement and built--I believe--from office discards. He has created everything at the lowest level to achieve performance and because nothing existed above the network transport protocols that met his needs. He has to allocate processes among the boxes, implement message queues etc. Everything.

The fine folks at google did something similar with linux boxes and search.

Ideally, I think he would benefit from a truly relational dbms that would allow him to distribute the data and the workload across a very heterogeneous network of computers. While some higher-end dbmses support distribution, I don't think they would support what he needs.

In the simple simulation example I gave previously, the bulk of the simulation is a single compound statement:

WITH EXTEND devices ADD

         state_transitions(
             device
             ,nodes JOIN ( connections(device) {node} )
             ,t
         ) AS transitions
         {transitions}
     AS new_transitions,

         WITH UNGROUP new_transitions (transitions)
         AS new_events,

             events = events UNION new_events;

In the environment I described above, somehow that statement would have to get broken up and distributed (and possibly redistributed) among the available CPU's. One should have the ability to specify which CPU's have local copies of which devices, nodes and events. One should have the ability to cluster data or pointers by node. One should have the ability to create physical pointers from devices to nodes or their clusters.

Just as SQL has "CREATE INDEX", an industrial D would have to have statements for specifying all of those physical options.

I don't think any product allows all of that, and if one does, it would probably take the fiscal budget of a small country to get licenses for all of the throw-away computers in the basement.

That said, I want to get back to the distinction between access paths and expression bias.

Suppose you need to distribute your bovines across scores of networked computers to achieve adequate performance for your simulation. Afterward, you want to perform some analysis. A truly relational dbms will allow you to express what you want very close to the level of intent, which will make writing the program relatively easy.

You might find that having distributed all of your bovines makes the new analytic query run slow. You might not care about that as long as you get your analytic result within a day or two. You might decide you need to create some snapshot or index or physical structure to meet performance requirements for the analysis. In any case, you deal with it.

The problem I perceived that you expressed above amounts to: How the hell am I going to write that analytic program in the first place?!?

Because the concern for performance is mixed with the concern for correctness in network data models, one often finds that--after the performance needs of one requirement are met--it is extremely difficult to express anything else one wants. Even after creating such a program, one will encounter the same performance issues. If you have to change the access paths for the other analysis, you are pretty much screwed.

> I realize that those questions are somewhat flawed because
> matters of implementation, efficiency, etc are orthogonal
> (partially? totally?) to the relational data model.

Totally orthogonal or at least as orthogonal as possible.

  For some
> reason, however, I haven't fully assimilated this concept.
> Probably from years of following the Tao of pointers.

It's true. When one programs in C or C++, one becomes conditioned to think in terms of how and not what.

> Finally forgive my repeating this question (from a thread I
> tried to move this discussion to); as I say I'm simply VERY
> curious and would love some input.
>
> Marshall wrote:
>

>>To achieve the big wins, though, we need a programming
>>language that uses the RM at its core, and that has
>>support for physical independence. I am afraid that at
>>this time this is just a wish.

>
> I'm glad you brought this up because I'm VERY curious about
> this. Is it so that such a language is still just a dream?

The language(s) exist(s). It's the physical independence part that is missing.

> What about APL, Joy, K, and Prolog for example? What are
> their good and bad points from a relational support
> perspective?

They are all first and foremost programming languages, which makes them orthogonal to the RM.

  Is there a RM programming language on the
> horizon or under development?

In a sense, "RM programming language" is an oxymoron. Tutorial D, though, is a sample language designed by Date and Darwen that tightly couples truly relational data management with a turing complete programming language. Received on Fri May 19 2006 - 05:00:07 CEST

Original text of this message