Re: Relational Model and Search Engines?

From: Nick Landsberg <hukolau_at_NOSPAM.att.net>
Date: Mon, 03 May 2004 21:24:15 GMT
Message-ID: <3Mylc.15583$Ut1.466323_at_bgtnsc05-news.ops.worldnet.att.net>


Anne & Lynn Wheeler wrote:
> Nick Landsberg <hukolau_at_NOSPAM.att.net> writes:
>

>>For a 30 GB database, this is still on the order
>>of 10+ minutes to initialize even if there were
>>no journal files to read (graceful shutdown as
>>opposed to crash.)  This is limited by the throughput
>>of the array because, even with an array, there
>>is a physical limit on how fast you can get the
>>data off the disks. (1-2 ms. clock-time per logical disk
>>read - measured on a large array).

>
>
> lets say i initialize from some checkpointed image ... then recovery
> is place in the journal from the checkpointed image to the current
> entry. the size of journal is proportional the update rate since the
> last checkpoint (and could be independent of database size, for some
> databases it could be trivial over extended period of time).

Hey! Someone who speaks my language :)

>
> lets say build stripe array that when read sequential saturates a
> 30mbyte i/o interface. 20+ years ago, i demenstrated sequential
> recovery sequences that would read single disk recovery and do 15
> tracks in 15 revolutions (effectively achieving very near disk media
> transfer rate at 3mbyte/sec, in this situation the i/o bus rate and
> the disk transfer rate were the same) ... and be able to do multiple
> in parallel on different channels. so say a single striped array
> _at_30mbytes/sec recovers 30gbytes in approx. 1000 seconds or 17
> minutes. spread across two such i/o interfaces would cut it to
> 8.5minutes and spread across four such i/o interfaces cuts it to a
> little over four minutes.

I would have liked this, but we were constrained by the DBMS vendor to use a standard "file" in a file system for each checkpoint image. Our actual measured time for a RAID 5 on a 30 GB file was about 10.5 minutes. (And I was *VERY* impressed with that performance. The RAID was tuned for sequential access rather than random access and the checkpoints are guaranteed to write the file sequentially since two empty checkpoint files are created initially when the database is first set up.)

In previous projects where we "rolled our own DBMS" we spread data over multiple spindles (or "arms" as you put them) and had parallel reads going on at startup. It worked very nicely, thank you!

>
> The problem now is that you getting into some operating system restart
> times. Given that you are doing 30gbyte image recovery ... it is too
> bad that you can't go a little further and do like the laptop suspend
> operations that write all of memory to protected disk location for
> instant restart. With trivial amount more I/O, checkpoint all of
> physical memory for "instant on" system recovery.

That would have been nice too, but we're constrained to commercially avialable HW. Total RAM is 96 GB. I, personally, don't know how to flush all of that to a disk in case of powerfail.

Operating system restart times were measured at about 2 minutes, application restart times were constrained to 3 minutes *after* the DBMS came up for a total restart of 15 minutes for a restart after a graceful DBMS shutdown. (More about the times later.)

>
> Backup image can be done like some of the hot disk database backups.
> Since you aren't otherwise doing a lot of disk i/o ... and you
> probably have at least ten times the disk space in order to get enuf
> disk arms, then you could checkpoint versions to disk with journal
> cursors for fuzzy image states. Frequency could possibly be dictated
> by trade-off between overhead for doing more frequent checkpoint
> vis-a-vis having to process more journal records ... as well as
> projected MTBF.

Absolutely! At checkpoints every 10 minutes or so and an update rate of about 8,000 TPS, we computed that there would be an additional 5 GB of journal files to process if the failure occurred during busy hour.

> ... Five-nines availability allows 5minutes downtime per
> year. At four minutes recovery ... that says you get one outage per
> year. For really high availability ... you go to replicated
> operations. Recovery of a failed node than is slightly more
> complicated since it has to recover the memory image, the journal and
> then the journal entries done by the other processor.

Agreed again. We're running replicated and we've computed all of that using simple Markov models. Of course, it's based on assumptions on how many HW, SW and Pilot errors will happen over that time. (BTW - Commercial HW is only 99.98% in our case, excluding travel time for delivery of spare parts, so we knew when we started that we had to have a replicated scheme.)

>
> With replicated systems, then there is some issue of whether you can
> get by with two 30mbyte/sec transfer arrays per system for 8min system
> recovery time .... since the other system would mask downtime. Each
> system would have two 30mbyte/sec transfer array configurations rather
> than single system have four 30mbyte/sec transfer arrays.
>

We didn't get into that level of detail for the reasons noted above. We knew we had to have a replicated scheme and plugged the numbers into our reliability models. Fortunately, they came up to 5 and a half 9's with our assumptions, so we didn't investigate any deeper than we had to. We did specify a replicated network, though, to try to avoid any issues regarding synchronization after network connectivity had been lost. Even so, that also has to factor into the computations.

Now, it's time to validate the assumptions from field data (which is beginning to trickle in).

Nick L.

P.S. - I really enjoyed this exchange of views/opinions/data. It's relatively rare to find folks who have an appreciation for the physical realities of reliability engineering, like MTTR and the speed of disk transfers. :)

P.P.S. - In case you couldn't guess, I'm both the "performance engineer" and the "reliability engineer" on the project. Why they chose to split the titles apart is beyond me, unless it is also because I have to deal with the database schema designers in order to get those 8,000 plus TPS from the system. But, that's a different discussion.

-- 
"It is impossible to make anything foolproof
because fools are so ingenious"
  - A. Bloch
Received on Mon May 03 2004 - 23:24:15 CEST

Original text of this message