Latency of "direct path reads"

From: Hameed, Amir <Amir.Hameed_at_xerox.com>
Date: Mon, 19 Aug 2013 16:45:27 +0000
Message-ID: <AF02C941134B1A4AB5F61A726D08DCED0DD6E0AC_at_USA7109MB012.na.xerox.net>



Hi Folks,
We migrated our single instance ERP database to a four-node RAC system about a week ago. As part of this migration, we also moved from SAN to NAS. Here is a high-level architecture of the environment: Ø RDBMS and Grid versions.2.0.3 with April PSU

Ø Each node has two 10Gbe paths to the NAS.

Ø Database is configured with DNFS.

Ø Storage vendor is EMC. NAS Data Movers are connected to the EMC VMAX storage.

Ø Database files are sitting on SSD drives with RAID-5 configuration.

Ø Redo log files are on RAID-10

Since migrating to RAC, we are seeing high latencies on some batch jobs. Jobs that used to run in 45 minutes are now taking a few hours to run. We have a copy of production that is setup the same way as production (same disk layout, same hardware, etc.) with the only difference that non-production is a three-node RAC environment. If I take a simple statement to force a full scan in parallel on a large table and run it against production and the production-sized test environment, the average wait of "direct path reads" from 10046 trace is almost three times higher in production than in non-production. Here is the statement that I am running:

select /*+ full(MMT) parallel(MMT,8) */ count(*) from apps.mtl_material_transactions MMT ;

Production
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------

Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.02       0.02          0         31          0           0
Fetch        2    185.89     888.70    7706356    7706545          0           1

------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 185.91 888.73 7706356 7706576 0 1 Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ reliable message 2 0.00 0.00 enq: KO - fast object checkpoint 4 0.00 0.00 SQL*Net message to client 2 0.00 0.00 Disk file operations I/O 117 0.00 0.00 direct path read 83856 3.03 465.14 db file sequential read 14 0.00 0.02 SQL*Net message from client 2 0.00 0.00 ******************************************************************************** Non-production call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 2 0 0 Execute 1 0.03 0.03 0 31 0 0 Fetch 2 148.36 224.13 7103122 7103268 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 148.39 224.17 7103122 7103301 0 1 Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ enq: KO - fast object checkpoint 6 0.00 0.00 reliable message 2 0.00 0.00 enq: PS - contention 52 0.00 0.01 SQL*Net message to client 2 0.00 0.00 Disk file operations I/O 106 0.00 0.00 direct path read 3773 1.49 67.53 SQL*Net message from client 2 845.16 845.16 ********************************************************************************

The raw trace file shows that some IOs in production are taking over a second to complete as shown below:

WAIT #18446744071452723008: nam='direct path read' ela= 1109814 file number25 first dba"2080 block cnt8 obj#r04 tim•0081082913

We have EMC and Oracle engaged and are trying to diagnose the issue. Since, this is DNFS and all IOs are going over the network layer, we had our network folks look at the network layer and according to them all is clean there, however I am not convinced. EMC is reporting that from the NAS side, they are seeing an average service response of 2-4ms.

My question is, is there a way to send a large packet from the RAC host side and gauge its latency? Ping can work for up to 32k size packet but because Oracle tries to do a 1M IO for direct path reads, I am trying to see if there is a tool available that can allow me to send larger packets and gauge its latency.

Any help will be appreciated.

Thanks,
Amir

--
http://www.freelists.org/webpage/oracle-l
Received on Mon Aug 19 2013 - 18:45:27 CEST

Original text of this message