Re: Hardware requirements question

From: Andrew Mobbs <andrewm_at_chiark.greenend.org.uk>
Date: 10 Apr 2001 22:34:02 +0100 (BST)
Message-ID: <I9q*s8nTo@news.chiark.greenend.org.uk>

Bastiaan Schaap <bschaap_at_desyde.nl> wrote:
>Hi all,
>
>Currently I am working on a project to build a website that will gather
>statistics about intranet applications. We're trying to determine what kind
>of hardware we'll need. Because I've seen that some of you have experience
>with *huge* databases, I'd like to ask for your help.
>
>Actually my question is very simple: we're estimating that in the database
>for this website, > 10,000,000 records a day will be inserted. After
>insertion of this data no mutations will be performed on this data, only
>calculations and selects. If it's possible we want to keep this data in the
>database as long as possible, with the minimum of a month. This means that
>about 3,000,000,000 records will exist within the database at all times.
>Which hardware/software platform should we choose to install our database
>on?
>
>As with most simple questions, I don't know a simple answer.... I am *not* a
>DBA, and I think I have experience with too little platforms to make this
>kind of a decision. Performance is a great issue, since we need to calculate
>statistics out of this data on-the-fly.
>
>Please share your thoughts with me,

Sorry, but the information above isn't anywhere near enough to give detailed answers, but I'll attempt to explain how to go about sizing.

In general, it's a case of building a model of your development system through measurments and scaling it to your production system.

First, estimate your I/O requirements.

For the insertions, you say 10,000,000 per day, you need to decide what is the peak over a much shorter time scale. Peak numbers of inserts per second are a useful metric. If you're lucky enough to have fairly constant load, then it's easy. If you have massive peaks in the load, you should pick an acceptible maximum.

Writes are relativly easy, you should run tests on your development system to determine things like average numbers of I/O operations required for each insert. This can be reasonably safely scaled up, assuming your development system and production system are configured identically for things like block_size.

Pay attention to the amount of redo you're generating, and your commit rate. This will scale too, and keeping redo logs performing well is vital in maintaining a high insert rate. As a side note on application design, for heavy insert activity, _don't_ commit for every row inserted.

For reads, the physical read rate is more difficult to scale up, as it's heavily influenced by the amount of buffer cache you have. If you have enough memory on your development system, you may be able to develop a model relating average number of I/Os per query to size of the buffer cache, and the number of queries per second. You can then scale this model to your production environment, where you know the number of queries per second that you require, and work out where on the curve from in-memory database to everything on disk you want to be.

All this will give you a total number of I/Os per second that you'll need to perform. This will indicate how many spindles you'll need in the stripe sets for each tablespace. I really can't state strongly enough that for performance you buy number of spindles, not gigabytes. Even if this means "wasting" some space on the drives. Buy many smaller disks, rather than the biggest available ones. This is, of course, a performance specialists fantasy, and in the real world more disks cost much more, both for the disks, and the infrastructure. However, it's another trade off you have to make, if you want high performance I/O you need to buy four 18GB disks, not one 72GB. High performance redo logs *really* uses up disks at an alarming rate, for example you might want a 10 disk stripe with just 20GB of redo log on them.

Right, since you've already started sizing your memory requirements, it may be time to think about the other memory requirements, most likely sort area and library cache sizes, but you may have other requirements. You'll need enough log buffer, but on a large system this is trivial compared to everything else. If your queries are never going to fit the sorts in memory you need to make a comprimise, but you may be able to size your sort area to keep all sorts in RAM.

If nearly all of your SQL is using bind variables, you can get away with quite a small library cache, which is the same on your development system to the server. Otherwise you may need to do the measure and scale exercise for this too.

CPU use is the usual scaling exercise. If your development system has different CPUs to the production one, I tend to use SPECint_rate benchmarks to compare them.

Storage requirements may well just be satisfied by the I/O requirements. Remember, for performance you buy spindles, not gigabytes. Otherwise, simply find the number of bytes each row takes on average, and multiply by the number of rows.

Another design hint, you should strongly consider partitioning your 3 billion row table.

Wow, that was much longer than I expected. I've put a copy of this on my web site at :
<URL:http://www.chiark.greenend.org.uk/~andrewm/ora/sizing.html> I may add to it over the next few weeks, I'd welcome any review comments on things I've missed out or got wrong.

-- 
Andrew Mobbs - http://www.chiark.greenend.org.uk/~andrewm/

Received on Tue Apr 10 2001 - 16:34:02 CDT