Re: Exadata performance

From: Frits Hoogland <frits.hoogland_at_gmail.com>
Date: Fri, 9 Feb 2018 15:32:48 +0100
Message-Id: <528D541F-3C52-4418-A2F9-B787F33C5AFF_at_gmail.com>



The performance problem seems to have been due to memory (no proof is provided, so it still is an assumption). Whenever memory uncorrectable errors are detected (typically via ECC, which is a hardware/memory feature; exadata does use ECC memory), linux will be notified and mark the page ‘HardwareCorrupted’. This means that on the linux layer, that statistic could be used to detect memory failure issues.

However, with exadata you get an ILOM, which does hardware management. Why don’t you use that functionality? You can use EM and an agent that can be notified by the ILOM as SNMP destination. You can use the email option in the ILOM, which will send an email if it finds an issue, which is a very nice way, these emails are very clear.

In order to analyse if the memory problems did cause the performance degradation, you first need to establish what the difference between badly performing and well performing is with regards to CPU usage and waits on the oracle level. Then look at the linux level and see if it caused swapping for example, causing much extra cpu to be used outside of the database.

Frits Hoogland

http://fritshoogland.wordpress.com <http://fritshoogland.wordpress.com/> frits.hoogland_at_gmail.com <mailto:frits.hoogland_at_gmail.com> Mobile: +31 6 14180860

> On 2 Feb 2018, at 22:49, Q A I S E R <qrasheed_at_gmail.com> wrote:
>
>
> It looks like the DIMM was bad on the database node, and not on Exadata storage cell node. You do not need a script to monitor Hardware but configure EM. Oracle Enterprise Manager can be configured to monitor all components of Exadata. In addition, If you had ASR configured it would have automatically detected the fault in memory and raised an ASR for you.
>
> Your performance issue may very well be related to this event as the SGA is created on memory.
>
> As for kipmi0 process using over 100%, please see MOS Kipmi0 Using 100% CPU (Doc ID 1235235.1).
>
> Thanks,
> --Qaiser
>
> On Fri, Feb 2, 2018 at 2:02 PM, Glenn Travis <Glenn.Travis_at_sas.com <mailto:Glenn.Travis_at_sas.com>> wrote:
> Over the holidays we experienced some very poor performance on one of our Exadata (X3-2 quarter rack) nodes. The other node was unaffected. We spent several days running AWR and other performance related tuning opportunities to diagnose. At the database level we observed high I/O and some complex poor SQL. At the server level we identified high cpu. We bounced the databases several times over the next few days to troubleshoot, but performance was only slightly better.
>
>
>
> We decided to peruse the hardware logs and looked at the system log on the ilom for the node. We noticed we had 2 DIMMs receiving errors;
>
> Event Type - DIMM Service Required
>
> Subsystem – Memory
>
> Component – P0/D4 (CPU 0 DIMM 4) and P0/D5
>
> Message - The number of memory correctable errors has exceeded threshold limit. (Probability:100, UUID:7b4fe74e-4fb1-4a69-c966-b38d1cc8dab5, Resource:/SYS/MB/P0/D5
>
>
>
> My question is; Do you think the poor performance is related to the memory issues/errors? Can you tell what state the memory was in by the errors?
>
>
>
> We also noticed at the server level (using top) that the [kipmi0] process using over 100% cpu on a constant basis. Is this normal? Is this related to the memory errors?
>
> Also the ora_dia0_<SID> process using near 100% during this poor performance event. Is this normal?
>
>
>
> We resolved the issue (or the issue went away) after we scheduled an outage and had Oracle replaced the 2 bad DIMMS. Once the server was rebooted, performance returned to normal.
>
>
>
> We are wondering if there may have been something else going on, or was this solely related to hardware. And curious about the 2 high cpu processes.
>
>
>
> On a side note: Does anyone have a script/program/command to run (at regular intervals) to check the state of the hardware? We usually get emailed for hardware issues, but apparently this one was not bad enough to send an email.
>
>
>
> Thanks all!
>
>
>
>
>
> Glenn Travis
>
> DBA ▪ Database Services
>
> IT Enteprise Solutions
>
> SAS Institute
>
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Feb 09 2018 - 15:32:48 CET

Original text of this message