Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Usenet -> c.d.o.server -> Re: oracle cannot produce valid timed statistics on linux machine with 2 non-identical CPUs
>>>>> "Sergey" == Sergey Lukashevich <lukash33_at_mail.ru> writes:
Sergey> Below I will describe a problem of using oracle for linux on Sergey> an SMP intel machine with CPUs of a different Sergey> bogomips-measured speed. This can be a linux kernel bug.
Its been a while since I did anything with Linux and SMP
kernels. However, one thing which use to be very important was that in
addition to your CPUs being of the same speed, cache, type etc was
that they also had the same stepping rate. Apparently it is possible
to have two CPUs which appear to be the same, but are from different
batches and ahve different stepping rates and this might explain
things.
Sergey> First of all what's wrong:
Sergey> 1) We cannot receive reasonable figures in all the 'elapsed' Sergey> columns Sergey> in all the oracle statistic when timed_statistics in 'on' in Sergey> the init.ora. All the figures we'll see look like Sergey> '##########' or are enormous, very big (totally unreal). It Sergey> does not matter whether we SELECT them from a V$ view or we Sergey> look at TKPROF result or we take a StatsPack snapshot. No Sergey> problem when the CPU is only one. Sergey> 2) Even more, Oracle rdbms obviously becomes ill-behaved -- Sergey> strange Sergey> unresolvable performance problems arise, especially with Sergey> different kind of latches like 'free buffer waits'. Users Sergey> wait, wait, and stuck. Checkpoint is executed VERY lengthy - Sergey> some 20-30 minutes while DBWR does almost nothing (we watchSergey> 'top') and the I/O is less than 10% of the power of the disk Sergey> subsystem (we have our disks benchmarked).
Sergey> 3) There is one more linux sympthom possibly: Sergey> the 'top' output looks wrong: sometimes we saw some 5 to 8 Sergey> processes consuming 99.9% of cpu. That's impossible while Sergey> having only 2 CPUs!
unless it has been fixed, Top is broken with respect to SMP kernels. You cannot rely on its output because it is not able to handle more than one cpu correctly.
Sergey> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND Sergey> 136 root 15 0 0 0 0 RW 99.9 0.0 20:22 Sergey> kjournald 18767 root 15 0 1928 1868 1552 S 99.9 0.0 0:00 sshd Sergey> 20601 oracle 15 0 5728 5728 5176 S 99.9 0.2 0:00 oracle 20603 Sergey> oracle 16 0 192M 192M 190M D 99.9 9.5 0:22 oracle 20605 Sergey> oracle 17 0 7112 7112 6536 S 99.9 0.3 0:15 oracle 20618 Sergey> oracle 15 0 56152 54M 55224 S 99.9 2.7 0:12 oracle 22011 Sergey> oracle 26 0 222M 222M 217M R 99.9 11.0 4:42 oracle 22045 me Sergey> 15 0 904 904 728 R 99.9 0.0 0:00 top Sergey> 1 root 15 0 468 428 412 S 0.0 0.0 0:10 init 2 root 15 0 0 Sergey> 0 0 SW 0.0 0.0 0:00 keventd 3 root 34 19 0 0 0 SWN 0.0 Sergey> 0.0 0:00 Sergey> ksoftirqd_CPU0 Sergey> 4 root 34 19 0 0 0 SWN 0.0 0.0 0:00 Sergey> ksoftirqd_CPU1 Sergey> 4) The simpliest way to determine wether you have this linuxSergey> bug is to
Sergey> yes date | bash | uniq
Sergey> If the result looks like mine then that's the case:
Sergey> Wed Jan 29 20:22:17 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Sergey> Jan 29 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan Sergey> 29 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:16 MSK 2003 Wed Jan 29 Sergey> 20:22:19 MSK 2003 Wed Jan 29 20:22:18 MSK 2003 Wed Jan 29 Sergey> 20:22:16 MSK 2003 Wed Jan 29 20:22:18 MSK 2003 Wed Jan 29 Sergey> 20:22:16 MSK 2003 Wed Jan 29 20:22:18 MSK 2003 Wed Jan 29 Sergey> 20:22:16 MSK 2003 Sergey> The time/date continuously jumps forward and backward in a Sergey> range of a few seconds Who guess why? Possibly different CPUsSergey> show different date/time. But I appreciate they almost agree Sergey> each other ;)
Not sure about this. It does look a bit odd, but have you ever run this command on a dual processor and not seen something like this? It is possible it has more to do with I/O contention than anything sinister in the CPUs.
Sergey> http://groups.google.com/groups?selm=375BF011.C600C5EB%40best.com&oe=UTF-8&output=gplain
Sergey> We reproduced all the sympthoms on several dual-pentium Sergey> machines and now we have to replace their CPUs I think.
Sergey> My hardware is:
Sergey> Intel based server of 2*Pentium III (Coppermine)
>> grep bogomips /proc/cpuinfo
Sergey> bogomips : 1861.22 bogomips : 1599.07
Thats a pretty large difference in bogomips - but don't forget bogo in bogomips does stand for 'bogus'.
Sergey> but
>> grep -i mhz /proc/cpuinfo
Sergey> cpu MHz : 932.943 cpu MHz : 932.943
What about the motherboard, bus speeds, clock rates etc?
Sergey> You can see that both CPUs look like of the same speed of Sergey> 932.943 MHZ, but bogomipses differ.
Sergey> My software is:
Sergey> Linux host.domain 2.4.18-3custom #4 SMP Thu Jan 23 09:14:40 Sergey> MSK 2003 i686 unknown
I heard there was a security issue with 2.4.18. Maybe try upgrading to 2.4.21. I also notice it looks like you have built a custom kernel, maybe there is something in there which is not quite correct? You are not running oracle in parallel mode are you (it needs 3 processors for that).
If your bogomips were only slightly different, I wouldn't be too worried, but the difference you have does seem to be very big. However, I don't know how good reliable bogomips readings are on multi-processor systems. Check the stepping rate of your cpus - if they are different, then it is possible you have problems.
Tim
-- Tim Cross The e-mail address on this message is FALSE (obviously!). My real e-mail is to a company in Australia called rapttech and my login is tcross - if you really need to send mail, you should be able to work it out!Received on Thu Jan 30 2003 - 06:58:28 CST