Exadata performance

From: Glenn Travis <Glenn.Travis_at_sas.com>
Date: Thu, 1 Feb 2018 17:59:56 +0000
Message-ID: <BL2PR05MB23535A63C05513D419F1B5149FFA0_at_BL2PR05MB2353.namprd05.prod.outlook.com>

Over the holidays we experienced some very poor performance on one of our Exadata (X3-2 quarter rack) nodes. The other node was unaffected. We spent several days running AWR and other performance related tuning opportunities to diagnose. At the database level we observed high I/O and some complex poor SQL. At the server level we identified high cpu. We bounced the databases several times over the next few days to troubleshoot, but performance was only slightly better.

We decided to peruse the hardware logs and looked at the system log on the ilom for the node. We noticed we had 2 DIMMs receiving errors; Event Type - DIMM Service Required
Subsystem – Memory
Component – P0/D4 (CPU 0 DIMM 4) and P0/D5 Message - The number of memory correctable errors has exceeded threshold limit. (Probability:100, UUID:7b4fe74e-4fb1-4a69-c966-b38d1cc8dab5, Resource:/SYS/MB/P0/D5

My question is; Do you think the poor performance is related to the memory issues/errors? Can you tell what state the memory was in by the errors?

We also noticed at the server level (using top) that the [kipmi0] process using over 100% cpu on a constant basis. Is this normal? Is this related to the memory errors? Also the ora_dia0_<SID> process using near 100% during this poor performance event. Is this normal?

We resolved the issue (or the issue went away) after we scheduled an outage and had Oracle replaced the 2 bad DIMMS. Once the server was rebooted, performance returned to normal.

We are wondering if there may have been something else going on, or was this solely related to hardware. And curious about the 2 high cpu processes.

On a side note: Does anyone have a script/program/command to run (at regular intervals) to check the state of the hardware? We usually get emailed for hardware issues, but apparently this one was not bad enough to send an email.

Thanks all!

Glenn Travis
DBA ▪ Database Services
IT Enteprise Solutions
SAS Institute

Received on Thu Feb 01 2018 - 18:59:56 CET

Original text of this message