Re: Exadata performance

From: Q A I S E R <qrasheed_at_gmail.com>
Date: Fri, 2 Feb 2018 15:49:50 -0600
Message-ID: <CAHTGq-TC+EC3F5Br6fYRKbSHrNYwguxJ+NASA1rjFwEQGOUg0g_at_mail.gmail.com>



It looks like the DIMM was bad on the database node, and not on Exadata storage cell node. You do not need a script to monitor Hardware but configure EM. Oracle Enterprise Manager can be configured to monitor all components of Exadata. In addition, If you had ASR configured it would have automatically detected the fault in memory and raised an ASR for you.

Your performance issue may very well be related to this event as the SGA is created on memory.

As for kipmi0 process using over 100%, please see MOS Kipmi0 Using 100% CPU (Doc ID 1235235.1).

Thanks,
--Qaiser

On Fri, Feb 2, 2018 at 2:02 PM, Glenn Travis <Glenn.Travis_at_sas.com> wrote:

> Over the holidays we experienced some very poor performance on one of our
> Exadata (X3-2 quarter rack) nodes. The other node was unaffected. We
> spent several days running AWR and other performance related tuning
> opportunities to diagnose. At the database level we observed high I/O and
> some complex poor SQL. At the server level we identified high cpu. We
> bounced the databases several times over the next few days to troubleshoot,
> but performance was only slightly better.
>
>
>
> We decided to peruse the hardware logs and looked at the system log on the
> ilom for the node. We noticed we had 2 DIMMs receiving errors;
>
> Event Type - DIMM Service Required
>
> Subsystem – Memory
>
> Component – P0/D4 (CPU 0 DIMM 4) and P0/D5
>
> Message - The number of memory correctable errors has exceeded threshold
> limit. (Probability:100, UUID:7b4fe74e-4fb1-4a69-c966-b38d1cc8dab5,
> Resource:/SYS/MB/P0/D5
>
>
>
> My question is; Do you think the poor performance is related to the memory
> issues/errors? Can you tell what state the memory was in by the errors?
>
>
>
> We also noticed at the server level (using top) that the [kipmi0] process
> using over 100% cpu on a constant basis. Is this normal? Is this related to
> the memory errors?
>
> Also the ora_dia0_<SID> process using near 100% during this poor
> performance event. Is this normal?
>
>
>
> We resolved the issue (or the issue went away) after we scheduled an
> outage and had Oracle replaced the 2 bad DIMMS. Once the server was
> rebooted, performance returned to normal.
>
>
>
> We are wondering if there may have been something else going on, or was
> this solely related to hardware. And curious about the 2 high cpu
> processes.
>
>
>
> On a side note: Does anyone have a script/program/command to run (at
> regular intervals) to check the state of the hardware? We usually get
> emailed for hardware issues, but apparently this one was not bad enough to
> send an email.
>
>
>
> Thanks all!
>
>
>
>
>
> *Glenn Travis*
>
> DBA ▪ Database Services
>
> IT Enteprise Solutions
>
> SAS Institute
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Feb 02 2018 - 22:49:50 CET

Original text of this message