RE: CPU Capacity Planning

From: Cary Millsap <cary.millsap_at_hotsos.com>
Date: Sun, 07 Dec 2003 08:24:25 -0800
Message-ID: <F001.005D91D9.20031207082425@fatcity.com>

My answers are in-line, preceded with “[Cary Millsap]”...

Cary Millsap
Hotsos Enterprises, Ltd.
http://www.hotsos.com

Upcoming events:

- Performance Diagnosis 101: 12/16 Detroit, 1/27 Atlanta
- SQL Optimization 101: 12/8 Dallas, 2/16 Dallas
- Hotsos Symposium 2004: March 7-10 Dallas
- Visit www.hotsos.com for schedule details...

-----Original Message-----
Boris Dali
Sent: Sunday, December 07, 2003 9:54 AM
To: Multiple recipients of list ORACLE-L

Thanks a lot for the reply, Cary. Yes, your explanation makes all the sense in the world even though it is precisely the weighted average approach that I've seen on some capacity planning spreadsheets.

Two additional questions if I may, Cary. Would it be correct to say that when I throw additional users on a system it is only queueing component of a response time that climbs up, while service time stays the same?

[Cary Millsap] “Sort of,” but not exactly. There are lots of scalability threats that begin to manifest in reality when you crank up the load. For example, you’ll see “latch free” waiting on applications that parse too much, but only at higher user volumes (never in unit test). You can consider the new appearance of “latch free” events to be a type of queueing if you want, but it’s really not queueing in the sense of a simple CPU queueing model.

If that's true, than does
it matter how I measure service time of my Bus.Tx1 - on a loaded system where hundreds of users run this operation or when nobody executes it all? Also is it important to have the other two operations - Bus.Tx2 and Bus.Tx3 - running concurrently (as they would in a real life) for the c measurements?

[Cary Millsap] You’ll put yourself at risk if you simply try to use a queueing model to extrapolate big-system performance from data collected in a unit testing environment. It’s because of the potentially out-of-model scalability threats.

In other words assuming I have an identical replica of a production environment where I am the only user - would service time/rate measured there be applicable for a loaded system with heterogeneous workload?

[Cary Millsap] ...Only if you your production environment doesn’t trigger any new serialization issues that weren’t visible on your unit test env.

And another stupid question.
Knowing individual business tx. characteristics (response time, number of CPUs required to comply with SLA requirements, average utilization per CPU, etc), how does one go about sizing the box in terms of the overall "system" required CPU capacity? Or put it another way - what do I tell a hardware vendor?

That is, if what comes out of a queueuing exercise is:

           m       pho
         --------  ---
Bus.Tx1   2-way    70%

Bus.Tx2 3-way 50%
Bus.Tx3 4-way 80%

What should be the optimistic (let's assume perfect liner CPU scalability for now) recommendation to decision makers in terms of the horsepower required to run this "system" on?
After all, yes individual business transactions have their own SLA requirements (e.g. worst tolerated response time), but they all use the same resources, don't they? So even though a service time of Bus.Tx1 might remain constant the queueing delay (and hence the response time) would likely to increase due to other concurrent activities on the system. Is there a way to account for this if capacity planning is done at the individual bus.tx level?

[Cary Millsap] The hardest part about capacity planning is that there’s no useful industry-wide standard unit of CPU work to use. You can’t use MHz, you can’t use MIPS, and you can’t use SPECints, or anything else like that. But you can use Oracle LIOs. It’s not hard to test a system to see how many LIOs/sec it can handle; this is your supply (capacity). It’s also not hard to see how many LIOs/sec an application needs; this is your demand (workload). With this realization, capacity planning is much simpler. The game is to ensure that supply exceeds demand at all times, and by a sufficient amount so that you don’t have unstable response times.

[Cary Millsap] ...And, of course, as I mentioned previously, you have to keep your peripheral vision open for the possibility that some new scalability threat will manifest and surprise you.

Thanks,
Boris Dali.

Cary Millsap <cary.millsap_at_hotsos.com> wrote: > Boris,
>
> If you mean that some people on your system execute
> Bus.Tx1, some others
> execute Bus.Tx2, and some others (maybe with some
> overlap) execute
> Bus.Tx3, then my answer to your question is:
>
> No, I would strongly encourage you *not* to do
> this!
>
> It was exercises like this that first led me to
> discover the fact that
> there's no such thing as a "system" in the sense
> that most people use
> the term (that is, as a big mishmash of different
> transactions, in which
> averages have any real meaning).
>
> Combining your three CDFs will hurt you in the way
> described in "Why
> understanding distribution is important" on
> pp238-239 of the Optimizing
> Oracle Performance book. Here's another example:
> Imagine the following
> "system"...
>
> avg. avg.
> runs/day sec/run who uses it
> Tx1 10,000 1 Group A
> Tx2 1,000 10 Group B
> Tx3 100 100 Group C
>
> So, what's this "system's" average response time? A
> naïve
> "mathematician" might think it's the weighted
> average of all the
> response times: (10000*1 + 1000*10 + 100*100) /
> (10000+1000+100) = 2.7
> sec. But what use is this figure? Nobody's response
> time is "ever"
> really 2.7 sec. <footnote>I say "ever" here because
> it's of course
> possible that a program whose "avg. sec/run" is 1
> (or even 10) will
> occasionally have a true response time of
> 2.7.</footnote>
>
> If you're *anybody* actually using the system, the
> number "2.7 sec/run"
> is just stupid! The 2.7s figure is especially
> ludicrous if you're a
> member of Group B or C, because your average
> response time is either
> really 3.7x that number (B) or 37x that number (C)!
> The mathematical
> explanation for the stupid-looking-ness is that, no
> matter what you're
> doing, this 2.7 number is an average influenced by
> stuff that you're
> *not* doing.
>
> There is no such thing as an "average user" (any
> more than there's an
> American family with 2.3 children); in this example,
> there are only
> members of Groups A, B, and C. What if you're a
> member of two groups
> simultaneously (e.g., you run different transaction
> types in the same
> day)? It's the same problem, because your
> expectation of, for example,
> Tx1 response time is completely different from your
> expectation of Tx3
> response time. Clumping response times from Tx1 and
> Tx2 into one average
> makes no sense even then. Expecting *anything* you
> do to take 2.7
> sec/run is going to leave you unfulfilled.
>
> The number 2.7 has no useful meaning here.
> Certainly, if you have some
> kind of service level agreement (SLA) wired to the
> number 2.7 sec/run in
> this case, then I would say you have a SLA that's
> worse than having no
> SLA at all.
>
> Maybe an easier analogy is this... How would you
> respond to the
> question, "Using the global air transportation
> system, how long does it
> take to fly someplace?" One way to answer would be
> to compute a weighted
> average of all flight durations recorded by IAPA for
> the past 12 months.
> Imagine that the worldwide average is 2.7 hours. How
> much good does this
> do someone who really wants to know how long it
> takes to get from
> Chicago to Sydney? How 'bout Chicago to Detroit? No,
> it's fundamentally
> the wrong way to respond. The right way to respond
> to "How long does it
> take to fly someplace?" begins with asking the
> question "From where to
> where?"
>
> The "problem with averages" comes when the
> statistics you're trying to
> average don't come from a single well-behaved
> distribution. See
> pp236-254 of "Optimizing Oracle Performance" for a
> more complete
> explanation of what I mean by this.
>
> Virtually every computer system used by participants
> on this list has
> two or more transaction types whose performance
> characteristics do not
> come from a single well-behaved statistical
> distribution. On these
> systems, it is impossible to come up with a single
> number (an average)
> that will present a useful description of your
> "system." And I mean
> "impossible" in the strictest, most carefully
> considered sense.
>
> Bottom line: Do not attempt to combine your Bus.Tx#
> data in the way you
> describe.
>
>
> Cary Millsap
> Hotsos Enterprises, Ltd.
> http://www.hotsos.com
>
> Upcoming events:
> - Performance Diagnosis 101: 12/16 Detroit, 1/27
> Atlanta
> - SQL Optimization 101: 12/8 Dallas, 2/16 Dallas
> - Hotsos Symposium 2004: March 7-10 Dallas
> - Visit www.hotsos.com for schedule details...
>
>
> -----Original Message-----
> Boris Dali
> Sent: Saturday, December 06, 2003 11:59 AM
> To: Multiple recipients of list ORACLE-L
>
> Let's say I have 3 business transactions (consisting
> of numerous Oracle transactions each) and I know
> total
> service time for each (from c readings off sql
> traces
> for the length of the bus.tx). Doing queuing theory
> exercise I can also get CDF(r max) for each. Let's
> say
>
> Bus.Tx1 - CPU time=5s CDF(r)=97%
> Bus.Tx2 - CPU time=8s CDF(r)=95%
> Bus.Tx3 - CPU time=10s CDF(r)=90%
>
> How can I combine these three together and make any
> conclusions as to what the overall CDF(r) would be
> for
> the whole system consisting of the above 3 business
> transactions? Is this doable?
>
> Thanks,
> Boris Dali.
=== message truncated ===

Post your free ad now! http://personals.yahoo.ca

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.net
-- 
Author: Boris Dali
  INET: boris_dali_at_yahoo.ca

Fat City Network Services    -- 858-538-5051 http://www.fatcity.com
San Diego, California        -- Mailing list and web hosting services
---------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.net
-- 
Author: Cary Millsap
  INET: cary.millsap_at_hotsos.com

Fat City Network Services    -- 858-538-5051 http://www.fatcity.com
San Diego, California        -- Mailing list and web hosting services
---------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: ListGuru_at_fatcity.com (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Received on Sun Dec 07 2003 - 10:24:25 CST