Oracle FAQ Your Portal to the Oracle Knowledge Grid
HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US
 

Home -> Community -> Usenet -> c.d.o.server -> Re: oracle cannot produce valid timed statistics on linux machine with 2 non-identical CPUs

Re: oracle cannot produce valid timed statistics on linux machine with 2 non-identical CPUs

From: Tim X <timx_at_spamto.devnul.com>
Date: 30 Jan 2003 23:58:28 +1100
Message-ID: <878yx2ssjv.fsf@tiger.rapttech.com.au>


>>>>> "Sergey" == Sergey Lukashevich <lukash33_at_mail.ru> writes:

 Sergey> Below I will describe a problem of using oracle for linux on
 Sergey> an SMP intel machine with CPUs of a different
 Sergey> bogomips-measured speed.  This can be a linux kernel bug.

Its been a while since I did anything with Linux and SMP kernels. However, one thing which use to be very important was that in addition to your CPUs being of the same speed, cache, type etc was that they also had the same stepping rate. Apparently it is possible to have two CPUs which appear to be the same, but are from different batches and ahve different stepping rates and this might explain things.
 Sergey> First of all what's wrong:

 Sergey> 1) We cannot receive reasonable figures in all the 'elapsed'
 Sergey>    columns
 Sergey> in all the oracle statistic when timed_statistics in 'on' in
 Sergey> the init.ora.  All the figures we'll see look like
 Sergey> '##########' or are enormous, very big (totally unreal).  It
 Sergey> does not matter whether we SELECT them from a V$ view or we
 Sergey> look at TKPROF result or we take a StatsPack snapshot. No
 Sergey> problem when the CPU is only one.

 Sergey> 2) Even more, Oracle rdbms obviously becomes ill-behaved --
 Sergey>    strange
 Sergey> unresolvable performance problems arise, especially with
 Sergey> different kind of latches like 'free buffer waits'. Users
 Sergey> wait, wait, and stuck.  Checkpoint is executed VERY lengthy -
 Sergey> some 20-30 minutes while DBWR does almost nothing (we watch
 Sergey> 'top') and the I/O is less than 10% of the power of the disk  Sergey> subsystem (we have our disks benchmarked).
 Sergey> 3) There is one more linux sympthom possibly:
 Sergey> the 'top' output looks wrong: sometimes we saw some 5 to 8
 Sergey> processes consuming 99.9% of cpu. That's impossible while
 Sergey> having only 2 CPUs!

unless it has been fixed, Top is broken with respect to SMP kernels. You cannot rely on its output because it is not able to handle more than one cpu correctly.

 Sergey>   PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
 Sergey>   136 root 15 0 0 0 0 RW 99.9 0.0 20:22
 Sergey> kjournald 18767 root 15 0 1928 1868 1552 S 99.9 0.0 0:00 sshd
 Sergey> 20601 oracle 15 0 5728 5728 5176 S 99.9 0.2 0:00 oracle 20603
 Sergey> oracle 16 0 192M 192M 190M D 99.9 9.5 0:22 oracle 20605
 Sergey> oracle 17 0 7112 7112 6536 S 99.9 0.3 0:15 oracle 20618
 Sergey> oracle 15 0 56152 54M 55224 S 99.9 2.7 0:12 oracle 22011
 Sergey> oracle 26 0 222M 222M 217M R 99.9 11.0 4:42 oracle 22045 me
 Sergey> 15 0 904 904 728 R 99.9 0.0 0:00 top
 Sergey>     1 root 15 0 468 428 412 S 0.0 0.0 0:10 init 2 root 15 0 0
 Sergey>     0 0 SW 0.0 0.0 0:00 keventd 3 root 34 19 0 0 0 SWN 0.0
 Sergey>     0.0 0:00
 Sergey> ksoftirqd_CPU0
 Sergey>     4 root 34 19 0 0 0 SWN 0.0 0.0 0:00
 Sergey> ksoftirqd_CPU1

 Sergey> 4) The simpliest way to determine wether you have this linux
 Sergey> bug is to
 Sergey> run the command:

 Sergey> yes date | bash | uniq

 Sergey> If the result looks like mine then that's the case:

 Sergey> Wed Jan 29 20:22:17 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed
 Sergey> Jan 29 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan
 Sergey> 29 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29
 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29
 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29
 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29
 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29
 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:18 MSK 2003 Wed Jan 29
 Sergey> 20:22:16 MSK 2003 Wed Jan 29 20:22:18 MSK 2003 Wed Jan 29
 Sergey> 20:22:16 MSK 2003 Wed Jan 29 20:22:18 MSK 2003 Wed Jan 29
 Sergey> 20:22:16 MSK 2003

 Sergey> The time/date continuously jumps forward and backward in a
 Sergey> range of a few seconds Who guess why? Possibly different CPUs
 Sergey> show different date/time. But I appreciate they almost agree  Sergey> each other ;)

Not sure about this. It does look a bit odd, but have you ever run this command on a dual processor and not seen something like this? It is possible it has more to do with I/O contention than anything sinister in the CPUs.

 Sergey> http://groups.google.com/groups?selm=375BF011.C600C5EB%40best.com&oe=UTF-8&output=gplain

 Sergey> We reproduced all the sympthoms on several dual-pentium  Sergey> machines and now we have to replace their CPUs I think.

 Sergey> My hardware is:

 Sergey> Intel based server of 2*Pentium III (Coppermine)

>> grep bogomips /proc/cpuinfo

 Sergey> bogomips : 1861.22 bogomips : 1599.07

Thats a pretty large difference in bogomips - but don't forget bogo in bogomips does stand for 'bogus'.

 Sergey> but

>> grep -i mhz /proc/cpuinfo

 Sergey> cpu MHz : 932.943 cpu MHz : 932.943

What about the motherboard, bus speeds, clock rates etc?

 Sergey> You can see that both CPUs look like of the same speed of  Sergey> 932.943 MHZ, but bogomipses differ.

 Sergey> My software is:

 Sergey> Linux host.domain 2.4.18-3custom #4 SMP Thu Jan 23 09:14:40  Sergey> MSK 2003 i686 unknown

I heard there was a security issue with 2.4.18. Maybe try upgrading to 2.4.21. I also notice it looks like you have built a custom kernel, maybe there is something in there which is not quite correct? You are not running oracle in parallel mode are you (it needs 3 processors for that).

If your bogomips were only slightly different, I wouldn't be too worried, but the difference you have does seem to be very big. However, I don't know how good reliable bogomips readings are on multi-processor systems. Check the stepping rate of your cpus - if they are different, then it is possible you have problems.

Tim

-- 
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you 
really need to send mail, you should be able to work it out!
Received on Thu Jan 30 2003 - 06:58:28 CST

Original text of this message

HOME | ASK QUESTION | ADD INFO | SEARCH | E-MAIL US