Re: LIO/sec per CPU limit? Is it Hardware or Oracle code?

From: Henry Poras <henry.poras_at_gmail.com>
Date: Fri, 11 Aug 2017 13:27:46 -0400
Message-ID: <CAK5zhLLia49axxGzTHkv5=o=+u_rSCPH0N2A8TkHG39y_8=GdQ_at_mail.gmail.com>



Some more minor updates and responses to suggestions:

Lothar - I've had that happen too, but don't think that is the case here. Everything running is CPU heavy so each process basically pegs a CPU. however, typically, %idle runs at ~50%-70%.

Niall - I don't like working without evidence either. Just trying to stretch and see what to look at next. As far as your points 1 &3, proc/cpuinfo is identical (same number, model, version of CPU/box. And dmidecode --17 (for memory) is also identical. BTW, thanks all. I've never used dmidecode before. One new command. Not sure how to check for Power management

Stefan - nice articles. Unfortunately neither turbostat or perf stat is available. I've asked the sysadms about this.

Mladen - Thanks for the good wishes. Hey, it's been a while (not for good wishes. You know what I mean.) Running SLOB would be nice. Not sure if I can get a window in which to do this. Memory types are the same. Don't see anything with paging/swapping, but this seems to be a constant effect regardless of time of day, load, ...

$ sar -u 3 20 (slow)
Linux 2.6.32-696.1.1.el6.x86_64 (xxx) 08/11/2017 _x86_64_ (72 CPU) 01:07:14 PM CPU %user %nice %system %iowait %steal %idle
01:07:17 PM all 8.36 0.00 0.05 0.00 0.00 91.59
01:07:20 PM all 8.37 0.00 0.09 0.00 0.00 91.54
01:07:23 PM all 8.35 0.00 0.06 0.00 0.00 91.58
01:07:26 PM all 8.49 0.00 0.08 0.00 0.00 91.42
01:07:29 PM all 8.55 0.00 0.08 0.00 0.00 91.37
01:07:32 PM all 8.36 0.00 0.06 0.00 0.00 91.58
01:07:35 PM all 8.44 0.00 0.12 0.00 0.00 91.44
01:07:38 PM all 8.35 0.00 0.07 0.00 0.00 91.58
01:07:41 PM all 8.37 0.00 0.08 0.00 0.00 91.55
01:07:44 PM all 8.36 0.00 0.06 0.00 0.00 91.57
01:07:47 PM all 8.36 0.00 0.04 0.00 0.00 91.60
01:07:50 PM all 8.36 0.00 0.06 0.00 0.00 91.58
01:07:53 PM all 8.36 0.00 0.06 0.00 0.00 91.57
01:07:56 PM all 8.42 0.00 0.07 0.00 0.00 91.51
01:07:59 PM all 8.40 0.00 0.07 0.00 0.00 91.52
01:08:02 PM all 8.41 0.00 0.08 0.00 0.00 91.51
01:08:05 PM all 8.39 0.00 0.06 0.00 0.00 91.55
01:08:08 PM all 8.39 0.00 0.10 0.00 0.00 91.50
01:08:11 PM all 8.39 0.00 0.06 0.00 0.00 91.54
01:08:14 PM all 8.39 0.00 0.06 0.00 0.00 91.56
Average: all 8.39 0.00 0.07 0.00 0.00 91.53

$ sar -u 3 20 (fast)
Linux 2.6.32-696.1.1.el6.x86_64 () 08/11/2017 _x86_64_ (72 CPU) 01:07:20 PM CPU %user %nice %system %iowait %steal %idle
01:07:23 PM all 15.04 0.00 0.29 0.00 0.00 84.67
01:07:26 PM all 15.04 0.00 0.28 0.01 0.00 84.66
01:07:29 PM all 14.80 0.00 0.26 0.20 0.00 84.74
01:07:32 PM all 14.74 0.00 0.27 0.14 0.00 84.85
01:07:35 PM all 14.68 0.00 0.30 0.20 0.00 84.81
01:07:38 PM all 15.03 0.00 0.25 0.07 0.00 84.65
01:07:41 PM all 15.10 0.00 0.24 0.00 0.00 84.66
01:07:44 PM all 15.12 0.00 0.23 0.00 0.00 84.64
01:07:47 PM all 15.10 0.00 0.25 0.00 0.00 84.66
01:07:50 PM all 15.10 0.00 0.24 0.00 0.00 84.66
01:07:53 PM all 15.08 0.00 0.25 0.01 0.00 84.67
01:07:56 PM all 15.02 0.00 0.27 0.06 0.00 84.66
01:07:59 PM all 14.94 0.00 0.38 0.04 0.00 84.64
01:08:02 PM all 15.00 0.00 0.32 0.02 0.00 84.66
01:08:05 PM all 15.06 0.00 0.28 0.00 0.00 84.66
01:08:08 PM all 15.07 0.00 0.26 0.00 0.00 84.66
01:08:11 PM all 15.09 0.00 0.28 0.01 0.00 84.62
01:08:14 PM all 15.08 0.00 0.27 0.00 0.00 84.64
01:08:17 PM all 14.97 0.00 0.32 0.03 0.00 84.67
01:08:20 PM all 15.05 0.00 0.29 0.00 0.00 84.65
Average: all 15.01 0.00 0.28 0.04 0.00 84.68

Ran timing test right after running sar and still see the same behavior.

Karl - I'll take a look at the scripts, thanks. Hesitate on running on live system. proc/cpuinfo shows identical versions.

Jon - no VM.numactl --hardware shows 2 nodes on each.

Thanks again to everyone.

Henry

On Fri, Aug 11, 2017 at 1:05 PM, Henry Poras <henry.poras_at_gmail.com> wrote:

>
> ---------- Forwarded message ----------
> From: Jon Crisler <jcrisler_at_us.ibm.com>
> Date: Thu, Aug 10, 2017 at 9:43 PM
> Subject: Re: LIO/sec per CPU limit? Is it Hardware or Oracle code?
> To: Henry Poras <henry.poras_at_gmail.com>
> Cc: joncrisler_at_gmail.com
>
>
> Do me a favor and CC the Oracle list :) I am not registered yet.
>
> 1) Is NUMA active at the server level or Oracle level ?
> 2) Check the memory speeds per other suggestions.
> 3) Are they native machines, or some sort of VM ? If VM then other
> workload may be impacting your VM.
>
> You said the CPU were identical in /proc/cpuinfo, so I assume its the
> exact same CPU model. Here is a snippet from one of my machines-
>
> processor : 31
> vendor_id : GenuineIntel
> cpu family : 6
> model : 47
> model name : Intel(R) Xeon(R) CPU E7- 4820 _at_ 2.00GHz <--
> E7-4820 is the exact cpu model.
> stepping : 2
> cpu MHz : 1997.881
> cache size : 18432 KB
>
> If the cpu models are in any way different, then - well, they are
> different and your performance could be different. I would not worry
> about a few percentage points on the cpu MHz- that is not going to make
> much of a diffference. You can look on the Intel ARK website for specific
> specs on the cpu. Also look if your cpu speed changes due to the tuned,
> cpuspeed or similar daemons.
> cat /proc/cpuinfo | grep MHz If the various numbers are less than
> the rated speed of the CPU, then the clock speed is being shifted up and
> down based on load. There is nothing wrong with that in principle, and
> should barely change execution times, but it can confuse Oracle and cause
> different execution / explain plans. Its best to just turn it off until it
> can be ruled out as the cause, but it comes at the expense of $10-50 higher
> electric bill per month, which for a production system is probably trivial.
>
> cat /proc/cpuinfo | grep MHz
>
> cpu MHz : 1997.881
> cpu MHz : 1997.881
> cpu MHz : 1997.881
> cpu MHz : 1997.881
> cpu MHz : 1997.881
> cpu MHz : 1997.881 <- if it comes back like 1200, 1600 etc or
> something lower, then it has a energy saving mode engaged (aka cpuspeed ,
> Intel Speed Step).
>
> -----Henry Poras <henry.poras_at_gmail.com> wrote: -----
> To: Jon Crisler <jcrisler_at_us.ibm.com>
> From: Henry Poras <henry.poras_at_gmail.com>
> Date: 08/10/2017 12:50PM
> Subject: Re: LIO/sec per CPU limit? Is it Hardware or Oracle code?
>
>
> Jon,
>
> Thanks for the suggestions. Looked through most of your stuff and nothing
> shows up. I think it's more cpu/memory related, however. Basically every
> session (including my test sql) grabs and pegs a cpu. Snapper and v$session
> (and top) all show ~100% cpu (well, minimal pio). So cutting logical read
> rate in half will double runtimes. That's what I am seeing. But what would
> cut the lio rate like that?
>
> Henry
>
> On Wed, Aug 9, 2017 at 10:26 PM, Jon Crisler <jcrisler_at_us.ibm.com> wrote:
>
>> I cannot respond directly to the list, but obviously you have some
>> difference. it could be hardware related, OS related or Oracle related.
>> A few things to check-
>> 1) Are your disks on SAN or NFS / Ethernet ? Sometimes what seems to be
>> identical disks are really not the same on the backend disk array. Channel
>> speeds, ethernet differences etc.
>> - example- iSCSI on ethernet- one system has jumbo frames turned on, the
>> other does not, and the end result is a significant difference in IO
>> 2) Are the CPU's exactly identical ? Same exact model as shown in
>> /proc/cpuinfo ? Look at the reported cpu speed- is cpuspeed maybe
>> downshifting the cpu ?
>> 3) Does oracle understand about the underlying hardware ? See this as a
>> starter to compare oracle's knowledge of hardware- SELECT * FROM
>> SYS.AUX_STATS$;
>> - that might not be the only view to look at-
>> #3 has bit me many times when otherwise identical systems give different
>> execution plans.
>> 4) DBMS_STATS.GATHER_SYSTEM_STATS - proc to run various stats so Oracle
>> understands the hardware
>> 5) Otherwise identical systems- but installed memory is not the same.
>> One has support for NUMA, the other does not, or something goofy with
>> memory mirroring / interleaving. So memory access is not the same as one
>> machine appears much faster.
>>
>> Are they on VM's or something similar ? If that is the case, then other
>> workload on the system might be affecting you. I put a lot of emphasis on
>> PIO although your question is LIO, but you hav eto check everything.
>> ORACHK might be helpful as well to try to identify differences.
>>
>> -----oracle-l-bounce_at_freelists.org wrote: -----
>> To: ORACLE-L <oracle-l_at_freelists.org>
>> From: Henry Poras
>> Sent by: oracle-l-bounce_at_freelists.org
>> Date: 08/09/2017 05:47PM
>> Subject: LIO/sec per CPU limit? Is it Hardware or Oracle code?
>>
>>
>> I have two identical servers (or so I am told), but application work is
>> running 2-3 times slower on one than the other. Using Tanel's snapper, I
>> see that all active sessions are all on CPU. Viewing top shows me the same
>> thing, each session pegs a cpu. We also found that it wasn't particular SQL
>> that slowed down across severs, but it looked like everything was slow. A
>> select count(*) from dba_objects showed this behavior as did Jonathan
>> Lewis's kill_cpu script. This gave me something to test with. Running a
>> 10046, I saw the same amount of resource utilization (parse count, fetch
>> count, cr count, ...), no contention (wait events), but one server finished
>> 2.5 times faster than the other. Looking at session stats through snapper,
>> I see that the number of session logical reads per sec (~all of which are
>> consistent reads) is ~ 2.5 times higher on one server than the other. That
>> explains why it takes one longer to finish.
>>
>> So, now what?? Why is one server giving me 350k consistent gets/per
>> second and the other is ~800k? Is it hardware? /proc/cpuinfo shows the same
>> cpu for each box. Is it hidden in the Oracle code path? I realize that not
>> all LIO are created equal, but how do I check this? I am running on
>> SE12.1.0.1
>>
>> Any and all thoughts welcome.
>>
>> Henry
>>
>>
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Aug 11 2017 - 19:27:46 CEST

Original text of this message