dbms_stats.gather_system_stats - crazy numbers

From: CRISLER, JON A <JC1706_at_att.com>
Date: Mon, 29 Jul 2013 04:56:05 +0000
Message-ID: <9F15274DDC89C24387BE933E68BE3FD30D9DAAB5_at_MISOUT7MSGUSR9D.ITServices.sbc.com>



I have been working on a pair of RAC systems that are otherwise identical, but sometimes give some crazy explain plans that force full table scans in situations where a FTS would never be done. It got to the point where putting in sql profiles and otherwise tweaking could not keep up with the problem, so I decided to dig into it from a different perspective.

I finally found that dbms_stats.gather_system_stats had collected some really strange numbers for CPUSPEEDNW (no workload) and CPUSPEED (workload), along with some really odd numbers for disk transfer speed.

System A had an absurdly low number for cpuspeednw and cpuspeed (1/4 expected numbers), and at the same time a disk throughput number that was 10-50 times higher than expected. System B had the opposite problem: Better than possible numbers for cpuspeed while disk transfer was at the default number due to some quirk in workload system stats gathering.

In order to get the systems to match, the workaround was to delete the system stats (dbms_stats.delete_system_stats) on both and then run it only with a noworkload: cpu and disk popped into the expected ranges. The two clusters are tuned exactly the same, have the exact same OS, hardware, SAN storage, etc. Slight variances are to be expected for sure in the numbers, but not huge differences. The system that had a really low cpu num and ultrastupid high IOTFRSPEED would force a FTS at the strangest times, and users would complain about performance. The system with better than possible cpu numbers and low IOTFRSPEED did not get complaints.

Here is an example

column sname format a20
column pname format a20
column pval2 format a20
select sname, pname , pval1, pval2
from sys.aux_stats$;

SNAME                PNAME                     PVAL1 PVAL2

-------------------- -------------------- ---------- --------------------
SYSSTATS_INFO STATUS COMPLETED SYSSTATS_INFO DSTART 07-27-2013 23:02 SYSSTATS_INFO DSTOP 07-27-2013 23:02 SYSSTATS_INFO FLAGS 1 SYSSTATS_MAIN CPUSPEEDNW 1249 (when bad this was either hi or low, like 300 or 10000) SYSSTATS_MAIN IOSEEKTIM 7 SYSSTATS_MAIN IOTFRSPEED 28662 (when bad this number was like 600000) SYSSTATS_MAIN SREADTIM SYSSTATS_MAIN MREADTIM SYSSTATS_MAIN CPUSPEED SYSSTATS_MAIN MBRC SNAME PNAME PVAL1 PVAL2
-------------------- -------------------- ---------- --------------------
SYSSTATS_MAIN MAXTHR SYSSTATS_MAIN SLAVETHR

Two bugs are similar but not exact hits for this problem-  high value for IOTFRSPEED can drive CBO to choose bad expensive plan (Doc ID 785799.1) Bug 9842771 - Wrong SREADTIM and MREADTIM statistics in AUX_STATS$ (Doc ID 9842771.8)

The hardware is typical very large servers, native Red Hat, fast SAN. It is possible that some Intel Turbo Boost or power saving feature is causing odd cpu numbers that mess with oracle's understanding of the hardware ?

model name      :        Intel(R) Xeon(R) CPU E7- 4820  _at_ 2.00GHz
stepping        : 2
cpu MHz         : 1997.886
cache size      : 18432 KB
--
http://www.freelists.org/webpage/oracle-l
Received on Mon Jul 29 2013 - 06:56:05 CEST

Original text of this message