Re: RAC versus simplistic load-balancing of replicated databases

From: Kevin Murphy <murphy_at_genome.chop.edu>
Date: 4 Jun 2003 10:42:31 -0700
Message-ID: <7782b51c.0306040942.2ec3c9a8@posting.google.com>

All who replied to my post about RAC,

Thanks very much for your replies, and I apologize for my DB naivete. The gist of the replies seemed to be that RAC is way overkill, that a UNIX 4-way with enough memory to cache the entire database should be more than sufficient, and that my concern for scalability seemed to be puzzling. Actually, a UNIX 4-way is what I had already spec'd out in our grant proposal, so I'm not entirely from Mars.

I have one more question, and the rest of this message is optional reading for those of you with way too much time on your hands.

Q: one person referred to a ram disk. This is interesting to me. When using a ramdisk, would you try to disable Oracle's buffering as much as possible? When would you use a ram disk as opposed to giving Oracle tons of buffer RAM?

The reason for my original questions was that the grant reviewers seemed skeptical that our proposed architecture would be sufficient for our purposes. Our first submission was rejected, probably not because of the technical architecture, but in any case we'd like to smooth off all the rough corners for the second submission.

Our originally proposed architecture was roughly as follows: database is Oracle 9i, webserver is AOLserver, 99.9% of application logic is in stored procedures, pages are thin templates.

Year 1: Dell PowerEdge 7150 4-way Itanium machine (or equivalent) running Linux, the first year being outfitted with memory and CPU's up to a total of $20K or so. (Today a Dell 6650 4-way 2GHz Xeon with two CPU's and 8 GB RAM would be around $20K, maxing out with 4 CPU's, 32 GB RAM, and 5 internal disks at around $50-60K). The web server would be a preexisting dual-CPU 1GHz G4 PPC. The two development servers are preexisting dual 1GHz G4 PPC's, which can also function as production servers in an emergency, although we would have to throttle traffic. Other hardware expenditures the first year would be UPS, tape backup, and gigabit ethernet switch. Year 2-5: Depending on how production load progresses, the plan is to add CPU's and memory to the original 4-way, purchase an additional 4-way for development and failover use, and add load balancers and web servers as required.

The proposal allocates roughly $30K for hardware and software per year for five years, assuming traffic growth over time. It is not believed that we could get any more than this, but we could perhaps slide more money up front.

Let me know if something seems outrageously wrong with this. It really doesn't to me (except for the possible Oracle license situation). I'm afraid too many people in the bioinformatics community are addicted to large clusters, which are mostly required for computational purposes rather than data retrieval. And even the big bioinformatics groups that are mostly database-focused, such as http://www.ensembl.org/, have large setups -- they run their website off of six 4-CPU1.25 GHz Alpha ES45's with 16 GB RAM each. They take 3 million requests per week, which doesn't seem like very much for that amount of hardware. I got a price quote on the Ensembl configuration, and it's ~ $100K per machine, not counting the fact that they are clustered. (They use mySQL, by the way; maybe that's the reason why they need so much power ;-)

There is now no question in my mind that our application can run even on a partially populated 4-way machine ... partly because a similar precursor site already works in a lame database (4D: http://www.4d.com/ -- I don't recommend this product for large-ish databases!) on a wimpy Mac OS X box . As you all have pointed out, the size of the DB and the fact that it is read-only and updated infrequently makes this a no-brainer.

I would have hoped that the reviewers would find it persuasive that we already have a site which is probably 2/3 of the way there in terms of technical complexity, if not functionality, running on a single CPU in a crappy database. Our precursor site is up at http://genome.chop.edu/, running on an 867 MHz G4 PPC with 1.5 GB RAM and Mac OS X, the database engine being the aforementioned 4D (hey, we're porting, we're porting!). The webserver for the current site is a separate 2x450MHz G4 PPC running Mac OS X and WebSTAR, although dynamic pages are actually served from the DB machine by a web server built into 4D.

Currently all of the application logic exists as stored procedures in 4D (written in the proprietary 4D language), and when we port to Oracle or other database, we probably would keep the same basic design. So in the case of Oracle we would be porting from 4D to PL/SQL or Java. Our web pages would remain as light-weight templates. In contrast, the vogue in much of the academic bioinformatics world seems to be to have all application logic in Perl (sometimes Java) in the web/application server and use a simple database such as mySQL.[1]

Thanks for your previous comments. Any further comments are appreciated but not expected ;-)

Kevin Murphy

[1] The rationale for this is that bioinformaticists tend to know Perl and write a lot of data filters in it. Perl is a beautiful language for manipulating text and getting one-offs done quickly. And with a small, underpaid, and/or work-study programming staff characteristic of academia, there are good reasons to use as few languages as possible. Received on Wed Jun 04 2003 - 12:42:31 CDT