Skip navigation.

Tanel Poder

Syndicate content
Linux, Oracle, Exadata and Hadoop.
Updated: 17 hours 44 min ago

Our take on the Oracle Database 12c In-Memory Option

Tue, 2014-06-10 15:26

Enkitec folks have been beta testing the Oracle Database 12c In-Memory Option over the past months and recently the Oracle guys interviewed Kerry OsborneCary Millsap and me to get our opinions. In short, this thing rocks!

We can’t talk much about the technical details before Oracle 12.1.0.2 is officially out in July, but here’s the recorded interview that got published at Oracle website as part of the In-Memory launch today:

Alternatively go to Oracle webpage:

Just scroll down to the Overview section that says: Video: Database Industry Experts Discuss Oracle Database In-Memory (11:10)

I might actually be even more excited about the In-Memory Option than I was excited about Exadata years ago. The In-Memory Option is not just a performance feature, it’s a simplifying feature too. So, now it’s ok to kill your performance problem with hardware, as long as you use it in a smart way :-)

Related Posts

Combining Bloom Filter Offloading and Storage Indexes on Exadata

Sat, 2014-05-17 22:15

Here’s a little known feature of Exadata – you can use a Bloom filter computed from a join column of a table to skip disk I/Os against another table it is joined to. This not the same as the Bloom filtering of the datablock contents in Exadata storage cells, but rather avoiding reading in some storage regions from the disks completely.

So, you can use storage indexes to skip I/Os against your large fact table, based on a bloom filter calculated from a small dimension table!

This is useful especially for dimensional star schemas, as your SQL statements might not have direct predicates on your large fact tables at all, all results will be determined by looking up relevant dimension records and then performing a hash join to the fact table (whether you should have some direct predicates against the fact tables, for performance reasons, is a separate topic for some other day :-)

Let me show an example using the SwingBench Order Entry schema. The first output is from Oracle 11.2.0.3 BP21 on Cellsrv 12.1.1.1.0:

SQL> ALTER SESSION SET "_serial_direct_read"=ALWAYS;

Session altered.

SQL> SELECT
  2      /*+ LEADING(c)
  3          NO_SWAP_JOIN_INPUTS(o)
  4          INDEX_RS_ASC(c(cust_email))
  5          FULL(o)
  6          MONITOR
  7      */
  8      *
  9  FROM
 10      soe.customers c
 11    , soe.orders o
 12  WHERE
 13      o.customer_id = c.customer_id
 14  AND c.cust_email = 'florencio@ivtboge.com'
 15  /

CUSTOMER_ID CUST_FIRST_NAME                CUST_LAST_NAME  ...
----------- ------------------------------ --------------
  399999199 brooks                         laxton        

Elapsed: 00:00:55.81

You can ignore the hints in the query, I just used these for get the plan I wanted for my demo. I also forced the serial full segment scans to use direct path reads (and thus attempt Smart Scans on Exadata) using the _serial_direct_read parameter.

Note that while I do have a direct filter predicate on the “small” CUSTOMERS table, I don’t have any predicates on the “large” ORDERS table, so there’s no filter predicate to offload to storage layer on the ORDERS table (but the column projection and HCC decompression can still be offloaded on that table too). Anyway, the query ran in over 55 seconds.

Let’s run this now with PARALLEL degree 2 and compare the results:

SQL> ALTER SESSION FORCE PARALLEL QUERY PARALLEL 2;

Session altered.

SQL> SELECT
  2      /*+ LEADING(c)
  3          NO_SWAP_JOIN_INPUTS(o)
  4          INDEX_RS_ASC(c(cust_email))
  5          FULL(o)
  6          MONITOR
  7      */
  8      *
  9  FROM
 10      soe.customers c
 11    , soe.orders o
 12  WHERE
 13      o.customer_id = c.customer_id
 14  AND c.cust_email = 'florencio@ivtboge.com'
 15  /

CUSTOMER_ID CUST_FIRST_NAME                CUST_LAST_NAME       
----------- ------------------------------ ---------------------
  399999199 brooks                         laxton               

Elapsed: 00:00:03.80

Now the query ran in less than 4 seconds. How come did the same query run close to 15 times faster than in serial? The parallel degree 2 for this simple query should give me max 4 slaves doing the work… 15x speedup indicates that something else has changed as well.

The first thing I would normally suspect is that perhaps the direct path reads were not attempted for the serial query (and Smart Scans did not kick in). That would sure explain the big performance difference… but I did explicitly force the serial direct path reads in my session. Anyway, let’s stop guessing and instead know for sure by measuring these experiments!

SQL Monitoring reports are a good starting point (I have added the MONITOR hint to the query so that the SQL monitoring would kick in immediately also for serial queries).

Here’s the slow serial query:

Global Stats
====================================================================================================
| Elapsed |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Offload |
====================================================================================================
|      56 |      49 |     6.64 |        0.01 |     0.01 |     2 |     4M | 53927 |  29GB |  14.53% |
====================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2205859845)
==========================================================================================================================================
| Id |               Operation               |     Name      | Execs |   Rows   | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                       |               |       | (Actual) | Bytes | Offload |   (%)    |        (# samples)        |
==========================================================================================================================================
|  0 | SELECT STATEMENT                      |               |     1 |        1 |       |         |          |                           |
|  1 |   HASH JOIN                           |               |     1 |        1 |       |         |    42.86 | Cpu (24)                  |
|  2 |    TABLE ACCESS BY GLOBAL INDEX ROWID | CUSTOMERS     |     1 |        1 |       |         |          |                           |
|  3 |     INDEX RANGE SCAN                  | CUST_EMAIL_IX |     1 |        1 |       |         |          |                           |
|  4 |    PARTITION HASH ALL                 |               |     1 |     458M |       |         |          |                           |
|  5 |     TABLE ACCESS STORAGE FULL         | ORDERS        |    64 |     458M |  29GB |  14.53% |    57.14 | Cpu (29)                  |
|    |                                       |               |       |          |       |         |          | cell smart table scan (3) |
==========================================================================================================================================

So, the serial query issued 29 GB worth of IO to the Exadata storage cells, but only 14.53% less data was sent back… not that great reduction in the storage interconnect traffic. Also, looks like all 458 Million ORDERS table rows were sent back from the storage cells (as the ORDERS table didn’t have any direct SQL predicates against it). Those rows were then fed to the HASH JOIN parent row source (DB CPU usage!) and it just threw all of the rows away but one. Talk about inefficiency!

Ok, this is the fast parallel query:

Global Stats
==============================================================================================================
| Elapsed | Queuing |   Cpu   |    IO    | Application |  Other   | Fetch | Buffer | Read  | Read  |  Cell   |
| Time(s) | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Offload |
==============================================================================================================
|    4.88 |    0.00 |    0.84 |     4.03 |        0.00 |     0.01 |     2 |     4M | 30056 |  29GB |  99.99% |
==============================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2121763430)
===============================================================================================================================================
| Id |                 Operation                  |     Name      | Execs |   Rows   | Read  |  Cell   | Activity |      Activity Detail      |
|    |                                            |               |       | (Actual) | Bytes | Offload |   (%)    |        (# samples)        |       
===============================================================================================================================================
|  0 | SELECT STATEMENT                           |               |     3 |        1 |       |         |          |                           |       
|  1 |   PX COORDINATOR                           |               |     3 |        1 |       |         |          |                           |       
|  2 |    PX SEND QC (RANDOM)                     | :TQ10001      |     2 |        1 |       |         |          |                           |       
|  3 |     HASH JOIN                              |               |     2 |        1 |       |         |          |                           |       
|  4 |      BUFFER SORT                           |               |     2 |        2 |       |         |          |                           |       
|  5 |       PX RECEIVE                           |               |     2 |        2 |       |         |          |                           |       
|  6 |        PX SEND BROADCAST                   | :TQ10000      |     1 |        2 |       |         |          |                           |       
|  7 |         TABLE ACCESS BY GLOBAL INDEX ROWID | CUSTOMERS     |     1 |        1 |       |         |          |                           |       
|  8 |          INDEX RANGE SCAN                  | CUST_EMAIL_IX |     1 |        1 |       |         |          |                           |       
|  9 |      PX PARTITION HASH ALL                 |               |     2 |     4488 |       |         |          |                           |       
| 10 |       TABLE ACCESS STORAGE FULL            | ORDERS        |    64 |     4488 |  29GB |  99.99% |   100.00 | cell smart table scan (4) |
===============================================================================================================================================

The fast parallel query still issued 29 GB of IO, but only 0.01% of data (compared to the amount of IOs issued) was sent back from the storage cells, so the offload efficiency for that table scan is 99.99%. Also, only a few thousand rows were returned back from the full table scan (so the throw-away by the hash join later in the plan is smaller).

So, where does the big difference in runtime and offload efficiency metrics come from? It’s the same data, same query (looking for the same values) so how come do we have so different Offload Efficiency %?

The problem with looking into a single metric showing some percentage (of what?!) is that it hides a lot of detail. So, lets get systematic and look into some detailed metrics :-)

I’ll use the Exadata Snapper for this purpose, although I could just list the V$SESSTAT metrics it uses as its source. The ExaSnapper lists various Exadata I/O metrics of the monitored session all in one “chart”, in the same unit (MB) & scale – so it will be easy to compare these different cases. If you do not know what the Exadata Snapper is, then check out this article and video.

Here’s the slow serial query (you may need to scroll the output right to see the MB numbers):

SQL> SELECT * FROM TABLE(exasnap.display_snap(:t1,:t2));

NAME
-------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.81 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )
-------------------------------------------------------------------------------------------------------------------------------------------
DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|    29291 MB    365 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|    29291 MB    365 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |        0 MB      0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |################################################# |    29134 MB    363 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |                                                  |        0 MB      0 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |      157 MB      2 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |      157 MB      2 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |        0 MB      0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|    29291 MB    365 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |##########################################        |    24993 MB    312 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |##########################################        |    24993 MB    312 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |        0 MB      0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |##################################################|    29478 MB    368 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |        0 MB      0 MB/sec

The DB_PHYSDR_BYTES says this session submitted 29291 MB of I/O as far as the DB layer sees it. The PRED_OFFLOADABLE_BYTES says that all of this 29291 MB of I/O was attempted to be done in a smart way (smart scan). The SMART_SCAN_RET_BYTES tells us that 24993 MB worth of data was sent back from the storage cells as a result of the smart scan (about 14.5% less than the amount of IO issued).

However, the PHYRD_STORIDX_SAVED_BYTES shows zero, we couldn’t avoid doing (skip) any I/Os when scanning the ORDERS table. We couldn’t use the storage index as we did not have any direct filter predicates on the ORDERS table.

Anyway, the story is different for the fast parallel query:

SQL> SELECT * FROM TABLE(exasnap.display_snap(:t1,:t2));     

NAME
-------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.81 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )              
-------------------------------------------------------------------------------------------------------------------------------------------
DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|    29291 MB   2212 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|    29291 MB   2212 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |        0 MB      0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |#######################                           |    13321 MB   1006 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |###########################                       |    15700 MB   1186 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |      270 MB     20 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |      270 MB     20 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |        0 MB      0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|    29291 MB   2212 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |                                                  |        4 MB      0 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |                                                  |        4 MB      0 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |        0 MB      0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |#######################                           |    13591 MB   1027 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |        0 MB      0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |        0 MB      0 MB/sec

The DB_PHYSIO_BYTES is still the same (29GB) as in the previous case – as this database layer metric knows only about the amount of I/Os it has requested, not what actually gets (or doesn’t get) done inside the storage cells. SMART_SCAN_RET_BYTES is only 4MB (compared to almost 25GB previously) so evidently there must be some early filtering going on somewhere in the lower layers.

And now to the main topic of this article – the PHYRD_STORIDX_SAVED_BYTES metric (that comes from the cell physical IO bytes saved by storage index metric in V$SESSTAT) shows that we have managed to avoid doing 15700 MB worth of IO completely thanks to the storage indexes. And this is without having any direct SQL filter predicates on the large ORDERS table! How is this possible? Keep reading :-)

Before going deeper, let’s look into the predicate section of these cursors.

The slow serial query:

--------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | E-Rows | Pstart| Pstop |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               |        |       |       |
|*  1 |  HASH JOIN                          |               |      1 |       |       |
|   2 |   TABLE ACCESS BY GLOBAL INDEX ROWID| CUSTOMERS     |      1 | ROWID | ROWID |
|*  3 |    INDEX RANGE SCAN                 | CUST_EMAIL_IX |      1 |       |       |
|   4 |   PARTITION HASH ALL                |               |    506M|     1 |    64 |
|   5 |    TABLE ACCESS STORAGE FULL        | ORDERS        |    506M|     1 |    64 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   3 - access("C"."CUST_EMAIL"='florencio@ivtboge.com')

The fast parallel query:

-------------------------------------------------------------------------------------------
| Id  | Operation                                | Name          | E-Rows | Pstart| Pstop |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |               |        |       |       |
|   1 |  PX COORDINATOR                          |               |        |       |       |
|   2 |   PX SEND QC (RANDOM)                    | :TQ10001      |      1 |       |       |
|*  3 |    HASH JOIN                             |               |      1 |       |       |
|   4 |     BUFFER SORT                          |               |        |       |       |
|   5 |      PX RECEIVE                          |               |      1 |       |       |
|   6 |       PX SEND BROADCAST                  | :TQ10000      |      1 |       |       |
|   7 |        TABLE ACCESS BY GLOBAL INDEX ROWID| CUSTOMERS     |      1 | ROWID | ROWID |
|*  8 |         INDEX RANGE SCAN                 | CUST_EMAIL_IX |      1 |       |       |
|   9 |     PX PARTITION HASH ALL                |               |    506M|     1 |    64 |
|* 10 |      TABLE ACCESS STORAGE FULL           | ORDERS        |    506M|     1 |    64 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   8 - access("C"."CUST_EMAIL"='florencio@ivtboge.com')
  10 - storage(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))

See how the parallel plan has the storage() predicate with SYS_OP_BLOOM_FILTER in it and the bloom filter :BF0000 will be compared to the hashed CUSTOMER_ID column values when scanning the ORDERS table in storage cells. So, this shows that your session attempts to push the Bloom filter down to the storage layer.

However, merely seeing the Bloom filter storage predicate in the plan doesn’t tell you whether any IOs were avoided thanks to the storage indexes and Bloom filters. The only way to know would be to run the query and look into the cell physical IO bytes saved by storage index metric (or PHYRD_STORIDX_SAVED_BYTES in Exadata Snapper). Unfortunately it will not be easy to measure this at rowsource level when you have multiple smart scans happening in different locations of your execution plan (more about this in the next blog entry :-)

Anyway, how is Oracle able to avoid reading in blocks of one table based on the output of another table involved in the join?

The answer lies in the parameter _bloom_minmax_enabled. It’s description says “enable or disable bloom min max filtering”. This parameter is true by default and on Exadata it means that in addition to computing the Bloom filter bitmap (based on hashed values in the join column) we also keep track of the smallest and biggest value retrieved from the driving table of the join.

In our example, both the Bloom filter bitmap of all matching CUSTOMER_IDs retrieved from the CUSTOMERS table (after any direct predicate filtering) would be sent to the storage cells and also the biggest and smallest CUSTOMER_ID retrieved from the driving table of the join.

In our hand-crafted example where only a single customer was taken from the driving table, only one bit in the bloom filter would be set and both the MIN and MAX CUSTOMER_ID of interest (in the joined ORDERS table) was set to 399999199 (I had about 400 Million customers generated in my benchmark dataset).

So, knowing that we were looking for only CUSTOMER_IDs in the single value “range” of 399999199, the smart scan could start skipping IOs thanks to the storage indexes in memory! You can think of this as runtime BETWEEN predicate generation from a driving table in a join that then gets offloaded to the storage cell and applied when scanning the other table in the join. IOs can be avoided thanks to knowing the range of values we are looking for and then further row-level filtering can be done using the usual Bloom filter bitmap comparison.

 

Note that the above examples are from Oracle 11.2.0.3 – in this version the Bloom filtering features should kick in only for parallel queries (although Jonathan Lewis has blogged about a case when this happens also on 11.2.0.3).

And this is the ultimate reason why the serial execution was so much slower and less efficient than the parallel run – in my demos, on 11.2.0.3 the Bloom filter usage kicked in only when running the query in parallel.

However, this has changed on Oracle 11.2.0.4 – now also serial queries frequently compute bloom filters and push these to the storage, for the usual bloom filtering reasons and for IO skipping by comparing the bloom filter min/max values to the Exadata storage index memory structures.

This is an example from Oracle 11.2.0.4, the plan is slightly different because in this database my test tables are not partitioned (and I didn’t have any indexes on the tables):

------------------------------------------------------------------
| Id  | Operation                   | Name      | E-Rows |E-Bytes|
------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |        |       |
|*  1 |  HASH JOIN                  |           |      1 |   114 |
|   2 |   JOIN FILTER CREATE        | :BF0000   |      1 |    64 |
|*  3 |    TABLE ACCESS STORAGE FULL| CUSTOMERS |      1 |    64 |
|   4 |   JOIN FILTER USE           | :BF0000   |   4581K|   218M|
|*  5 |    TABLE ACCESS STORAGE FULL| ORDERS    |   4581K|   218M|
------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("O"."CUSTOMER_ID"="C"."CUSTOMER_ID")
   3 - storage("C"."CUST_EMAIL"='florencio@ivtboge.com')
       filter("C"."CUST_EMAIL"='florencio@ivtboge.com')
   5 - storage(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"O"."CUSTOMER_ID"))

The Bloom bitmap (and MIN/MAX value) would be computed in the step #2 and would then be used by the full table scan in the step #5 – having the filtering info pushed all the way to the storage layer (assuming that the smart scan did kick in).

If you look into the Outline hints section of this plan, you’ll see the PX_JOIN_FILTER hint that has showed up there, this instructs the optimizer to set up and use the bloom filters (and despite the name, it now can be used for serial queries):

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('11.2.0.4')
      DB_VERSION('11.2.0.4')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$1")
      FULL(@"SEL$1" "C"@"SEL$1")
      FULL(@"SEL$1" "O"@"SEL$1")
      LEADING(@"SEL$1" "C"@"SEL$1" "O"@"SEL$1")
      USE_HASH(@"SEL$1" "O"@"SEL$1")
      PX_JOIN_FILTER(@"SEL$1" "O"@"SEL$1")
      END_OUTLINE_DATA
  */

So if the CBO doesn’t choose it automatically, you can use this hint for testing whether your queries could potentially benefit from this feature (note that it controls the whole join filter propagation logic, not only the Exadata stuff). You can also use the _bloom_serial_filter parameter to disable this behavior on Exadata (not that you should).

Note that I usually treat the storage index IO savings as just an added performance benefit at the system level, something that allows to keep the disk devices a little less busy, therefore yielding higher overall throughput. I do not design my applications’ performance overly dependent on the storage indexes as it’s not possible to easily control which columns end up in the storage index hashtables and that may give you unpredictable results.

That’s all – hopefully this shed some light on this cool combination of multiple different features (Smart Scans + Hash Joins + Bloom Filters + Storage Indexes).

 

Related Posts

Enkitec + Accenture = Even More Awesomeness!

Thu, 2014-05-15 10:30

Enkitec is the best consulting firm for hands on implementation, running and troubleshooting your Oracle based systems, especially the engineered systems like Exadata. We have a truly awesome group of people here; many are the best in their field (just look at the list!!!).

This is why I am here.

This is also why Accenture approached us some time ago – and you may already have seen today’s announcement that Enkitec got bought!

We all are now part of Accenture and this opens up a whole lot of new opportunities. I think this is BIG, and I will explain how I see the future (sorry, no Oracle Database internals in this post ;-)

In my opinion the single most important detail of this transaction is that both Enkitec and the folks at Accenture realize that the reason Enkitec is so awesome is that awesome techies want to work here. And we don’t just want to keep it that way – we must keep it that way!

The Enkitec group will not be dissolved into the Accenture. If it were, we would disappear, like a drop in the ocean and Accenture would have lost its investment. Instead we will remain an island in the ocean continuing to provide expert help for our existing and new customers – and in long term help Accenture build additional capability for the massive projects of their customers.

We will not have ten thousand people in our group. Instead we will continue hiring (and retaining) people exactly the way we’ve been – organic growth by having only the best, likeminded people. The main difference is, now with Accenture behind us, we can hire the best people globally, as we’ll have operations in over 50 countries. I understand that we won’t likely even double in size in the next few years – as we plan to stick to hiring only the best.

I think we will have a much, much wider reach now, showing how to do Oracle technology “our way” all around the world. With Accenture behind us, we will be navigating through even larger projects in larger businesses, influencing things earlier and more. And on a more personal note, I’m looking forward to all those 10 rack Exadata and 100TB In-Memory DB Option performance projects ;-)

See you at Enkitec E4 in June!

 

Related Posts

What the heck are the /dev/shm/JOXSHM_EXT_x files on Linux?

Fri, 2014-05-09 14:12

There was an interesting question in Oracle-L about the JOXSHM_EXT_* files in /dev/shm directory on Linux. Basically something like this:

$ ls -l /dev/shm/* | head
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_0_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_100_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:16 /dev/shm/JOXSHM_EXT_101_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:23 /dev/shm/JOXSHM_EXT_102_LIN112_1409029
-rwxrwx--- 1 oracle dba 4096 Apr 18 10:23 /dev/shm/JOXSHM_EXT_103_LIN112_1409029
-rwxrwx--- 1 oracle dba 36864 Apr 18 10:23 /dev/shm/JOXSHM_EXT_104_LIN112_1409029
...

There are a few interesting MOS articles about these files and how/when to get rid of those (don’t remove any files before reading the notes!), but none of these articles explain why these JOXSHM (and PESHM) files are needed at all:

  • /dev/shm Filled Up With Files In Format JOXSHM_EXT_xxx_SID_xxx (Doc ID 752899.1)
  • Stale Native Code Files Are Being Cached with File Names Such as: JOXSHM_EXT*, PESHM_EXT*, PESLD* or SHMDJOXSHM_EXT* (Doc ID 1120143.1)
  • Ora-7445 [Ioc_pin_shared_executable_object()] (Doc ID 1316906.1)

Here’s an explanation, a bit more elaborated version of what I already posted in Oracle-L:

The JOX files are related to Oracle’s in-database JVM JIT compilation. So, instead of interpreting the JVM bytecode during runtime, Oracle compiles it to archictecture-specific native binary code – just like compiling C code with something like gcc would do. So the CPUs can execute that binary code directly without any interpretation layers in between.

Now the question is that how do we load that binary code into our own process address space – so that the CPUs could execute this stuff directly?

This is why the JOX files exist. When the JIT compilation is enabled (it’s on by default on Oracle 11g), then the java code you access in the database will be compiled to machine code and saved in to the JOX files. Each Java class or method gets its own file (I haven’t checked which is it exactly). And then your Oracle process maps those files into its address space with a mmap() system call. So, any time this compiled java code has to be executed, your Oracle process can just jump to the compiled method address (and return back later).

Let’s do a little test:

SQL> SHOW PARAMETER jit

PARAMETER_NAME                                               TYPE        VALUE
------------------------------------------------------------ ----------- -----
java_jit_enabled                                             boolean     TRUE

Java just-in-time compilation is enabled. Where the JOX files are put by default is OS-specific, but on Linux they will go to /dev/shm (the in-memory filesystem) unless you specify some other directory with the _ncomp_shared_objects_dir parameter (and you’re not hitting one of the related bugs).

So let’s run some Java code in the Database:

SQL> SELECT DBMS_JAVA.GETVERSION FROM dual;

GETVERSION
--------------------------------------------
11.2.0.4.0

After this execution a bunch of JOX files showed up in /dev/shm:

$ ls -l /dev/shm | head
total 77860
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_0_LIN112_229381
-rwxrwx--- 1 oracle dba        12288 May  9 22:13 JOXSHM_EXT_10_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_11_LIN112_229381
-rwxrwx--- 1 oracle dba         8192 May  9 22:13 JOXSHM_EXT_12_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_13_LIN112_229381
-rwxrwx--- 1 oracle dba         4096 May  9 22:13 JOXSHM_EXT_14_LIN112_229381
...

When I check that process’es address space with pmap, I see that some of these JOX files are also mapped into its address space:

oracle@oel6:~$ sudo pmap -x 33390
33390:   oracleLIN112 (LOCAL=NO)
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000008048000  157584   21268       0 r-x--  oracle
0000000011a2c000    1236     372      56 rw---  oracle
0000000011b61000     256     164     164 rw---    [ anon ]
0000000013463000     400     276     276 rw---    [ anon ]
0000000020000000    8192       0       0 rw-s-  SYSV00000000 (deleted)
0000000020800000  413696       0       0 rw-s-  SYSV00000000 (deleted)
0000000039c00000    2048       0       0 rw-s-  SYSV16117e54 (deleted)
00000000420db000     120     104       0 r-x--  ld-2.12.so
00000000420f9000       4       4       4 r----  ld-2.12.so
00000000420fa000       4       4       4 rw---  ld-2.12.so
00000000420fd000    1604     568       0 r-x--  libc-2.12.so
000000004228e000       8       8       8 r----  libc-2.12.so
0000000042290000       4       4       4 rw---  libc-2.12.so
0000000042291000      12      12      12 rw---    [ anon ]
0000000042296000      92      52       0 r-x--  libpthread-2.12.so
00000000422ad000       4       4       4 r----  libpthread-2.12.so
00000000422ae000       4       4       4 rw---  libpthread-2.12.so
00000000422af000       8       4       4 rw---    [ anon ]
00000000422b3000      12       8       0 r-x--  libdl-2.12.so
00000000422b6000       4       4       4 r----  libdl-2.12.so
00000000422b7000       4       4       4 rw---  libdl-2.12.so
00000000f63b9000       4       4       4 rwxs-  JOXSHM_EXT_88_LIN112_229381
00000000f63ba000      16      16      16 rwxs-  JOXSHM_EXT_91_LIN112_229381
00000000f63be000       4       4       4 rwxs-  JOXSHM_EXT_90_LIN112_229381
00000000f63bf000       4       4       4 rwxs-  JOXSHM_EXT_89_LIN112_229381

...

Note the X and S bits (in the rwxs-) for the JOX mapped segments, this means that the Linux virtual memory manager allows the contents of these mapped files to be directly executed by the CPU and the S means its a shared mapping (other processes can map this binary code into their address spaces as well).

Oracle can also load some of its binary libraries into its address space with the dynamic dlopen() system call, but I verified using strace that the JOX files are “loaded” into the address space with just a mmap() syscall:

33390 open("/dev/shm/JOXSHM_EXT_85_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63c2000
33390 close(8)                          = 0
33390 open("/dev/shm/JOXSHM_EXT_87_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63c0000
33390 close(8)                          = 0
33390 open("/dev/shm/JOXSHM_EXT_89_LIN112_229381", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = 8
33390 mmap2(NULL, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_SHARED, 8, 0) = 0xfffffffff63bf000
33390 close(8)                          = 0
...

Just to check if these JOX files really are compiled machine-code instructions:

$ file /dev/shm/JOXSHM_EXT_88_LIN112_229381
/dev/shm/JOXSHM_EXT_88_LIN112_229381: data
$

Oops, the file command doesn’t recognize any specific file format from its contents as it’s a bare slice of machinecode and doesn’t contain the normal object module stuff the .o files would have… let’s try to disassemble this file and see if contains sensible instructions:

$ objdump -b binary -m i386 -D /dev/shm/JOXSHM_EXT_88_LIN112_229381 | head -30

/dev/shm/JOXSHM_EXT_88_LIN112_229381:     file format binary


Disassembly of section .data:

00000000 :
       0:	f7 15 1e 33 00 00    	notl   0x331e
       6:	00 00                	add    %al,(%eax)
       8:	55                   	push   %ebp
       9:	89 e5                	mov    %esp,%ebp
       b:	83 c4 c8             	add    $0xffffffc8,%esp
       e:	8b 45 10             	mov    0x10(%ebp),%eax
      11:	89 c1                	mov    %eax,%ecx
      13:	83 e1 f8             	and    $0xfffffff8,%ecx
      16:	89 4d e0             	mov    %ecx,-0x20(%ebp)
      19:	8b 4d 08             	mov    0x8(%ebp),%ecx
      1c:	89 45 f0             	mov    %eax,-0x10(%ebp)
      1f:	8b 45 0c             	mov    0xc(%ebp),%eax
      22:	89 5d ec             	mov    %ebx,-0x14(%ebp)
      25:	89 7d e8             	mov    %edi,-0x18(%ebp)
      28:	89 75 e4             	mov    %esi,-0x1c(%ebp)
      2b:	c7 45 f8 02 00 00 40 	movl   $0x40000002,-0x8(%ebp)
      32:	8b 91 1d 02 00 00    	mov    0x21d(%ecx),%edx
      38:	8b 7d 14             	mov    0x14(%ebp),%edi
      3b:	89 45 f4             	mov    %eax,-0xc(%ebp)
      3e:	8d 45 f0             	lea    -0x10(%ebp),%eax
      41:	83 c2 04             	add    $0x4,%edx
      44:	89 91 1d 02 00 00    	mov    %edx,0x21d(%ecx)
      4a:	39 d0                	cmp    %edx,%eax
...

This file contains machine code indeed!

So this is how Oracle approaches native JIT compilation for the in-database JVM. In 11g it’s similar to the PL/SQL native compilation too (you’d see various PESHM_ files in /dev/shm). Before 11g, Oracle actually generated intermediate C code for your PL/SQL and then invoked an OS C compiler on it, but in 11g it’s all self-contained in the database code. Pretty cool!

 

Related Posts

Oracle Memory Troubleshooting, Part 4: Drilling down into PGA memory usage with V$PROCESS_MEMORY_DETAIL

Wed, 2014-03-26 16:58

If you haven’t read them – here are the previous articles in Oracle memory troubleshooting series: Part 1Part 2, Part 3.

Let’s say you have noticed that one of your Oracle processes is consuming a lot of private memory. The V$PROCESS has PGA_USED_MEM / PGA_ALLOC_MEM columns for this. Note that this view will tell you what Oracle thinks it’s using – how much of allocated/freed bytes it has kept track of. While this doesn’t usually tell you the true memory usage of a process, as other non-Oracle-heap allocation routines and the OS libraries may allocate (and leak) memory of their own, it’s a good starting point and usually enough.

Then, the V$PROCESS_MEMORY view would allow you to see a basic breakdown of that process’es memory usage – is it for SQL, PL/SQL, Java, unused (Freeable) or for “Other” reasons. You can use either the smem.sql or pmem.sql scripts for this (report v$process_memory for a SID or OS PID):

SQL> @smem 198
Display session 198 memory usage from v$process_memory....

       SID        PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED
---------- ---------- ---------- --------------- ---------- ---------- -------------
       198         43         17 Freeable           1572864          0
       198         43         17 Other              5481102                  5481102
       198         43         17 PL/SQL                2024        136          2024
       198         43         17 SQL              117805736  117717824     118834536

From the above output we see that this session has allocated over 100MB of private memory for “SQL” reasons. This normally means SQL workareas, so we can break this down further by querying V$SQL_WORKAREA_ACTIVE that shows us all currently in-use cursor workareas in the instance. I’m using a script wrka.sql for convenience – and listing only my SID-s workareas:

SQL> @wrka sid=198
Show Active workarea memory usage for where sid=198...

   INST_ID        SID  QCINST_ID      QCSID SQL_ID        OPERATION_TYPE                  PLAN_LINE POLICY                   ACTIVE_SEC ACTUAL_MEM_USED MAX_MEM_USED WORK_AREA_SIZE NUMBER_PASSES TEMPSEG_SIZE TABLESPACE
---------- ---------- ---------- ---------- ------------- ------------------------------ ---------- ------------------------ ---------- --------------- ------------ -------------- ------------- ------------ ------------------------------
         1        198                       ff8v9qhv21pm5 SORT (v2)                               1 AUTO                           14.6        64741376    104879104       97623040             0   2253389824 TEMP
         1        198                       ff8v9qhv21pm5 HASH-JOIN                               6 AUTO                           14.8         1370112      1370112        2387968             0
         1        198                       ff8v9qhv21pm5 BUFFER                                 25 AUTO                           14.8        11272192     11272192       11272192             0

The ACTUAL_MEM_USED column above shows the currently used memory by this workarea (that happens to be a SORT (v2) operation in that cursor’s execution plan line #1). It was only about 64MB at the time I got to query this view, but the MAX_MEM_USED shows it was about 100MB at its peak. This can happen due to multipass operations where the merge phase may use less memory than the sort phase or once the sorting completed and the rowsource was ready to start sending sorted rows back, not that much memory would have been needed anymore for just buffering the blocks read from TEMP (the sort_area_size vs sort_area_retained_size thing from past).

For completeness, I also have a script called wrkasum.sql that summarizes the workarea memory usage of all sessions in an instance (so if you’re not interested in a single session, but rather a summary of which operation types tend to consume most memory etc) you can use that:

SQL> @wrkasum
Top allocation reason by PGA memory usage

OPERATION_TYPE      POLICY      ACTUAL_PGA_MB ALLOWED_PGA_MB    TEMP_MB NUM_PASSES     NUM_QC NUM_SESSIONS 
------------------- ----------- ------------- -------------- ---------- ---------- ---------- ------------ 
SORT (v2)           AUTO                   58            100       1525          0          1            1            
BUFFER              AUTO                   11             11                     0          1            1            
HASH-JOIN           AUTO                    1              2                     0          1            1

You may want to modify the script to change the GROUP BY to SQL_ID you want to list the top workarea-memory consuming SQL statement across the whole instance (or any other column of interest – like QC_INST_ID/QCSID).

But what about the following example:

SQL> @pmem 27199
Display process memory usage for SPID 27199...

       SID SPID                            PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED     CON_ID
---------- ------------------------ ---------- ---------- --------------- ---------- ---------- ------------- ----------
      1516 27199                           120        198 Freeable            786432          0                        0
      1516 27199                           120        198 Other            842807461                842807461          0
      1516 27199                           120        198 PL/SQL              421064      77296        572344          0
      1516 27199                           120        198 SQL                2203848      50168       2348040          0

Most of the memory (over 800MB) is consumed by category “Other”?! Not that helpful, huh? V$SQL_WORKAREA_ACTIVE didn’t show anything either as it deals only with SQL workareas and not all the other possible reasons why an Oracle process might allocate memory.

So we need a way to drill down into the Other category and see which allocation reasons have taken the most of this memory. Historically this was only doable with a PGA/UGA memory heapdump and by aggregating the resulting dumpfile. You have to use oradebug to get the target process to dump its own private memory breakdown as it’s private memory and other processes can not just read it directly. I have written about it in Part 1 of the Oracle memory troubleshooting series.

Update: an alternative to ORADEBUG is to use ALTER SESSION SET EVENTS ‘immediate trace name pga_detail_get level N’ where N is the Oracle PID of the process. 

However starting from Oracle 10.2 you can get similar detailed breakdown info by querying V$PROCESS_MEMORY_DETAIL, no need for post-processing tracefiles! However when you just query it, the view does not return any rows:

SQL> SELECT * FROM v$process_memory_detail;

no rows selected

Again this is for the abovementioned reasons – your current process can not just read the contents of some other process’es private memory – the OS ensures that. You will have to ask that target process to populate the V$PROCESS_MEMORY_DETAIL with its memory allocation breakdown. You can do this by using the ORADEBUG DUMP PGA_DETAIL_GET command:

SQL> ORADEBUG SETMYPID
Statement processed.
SQL> ORADEBUG DUMP PGA_DETAIL_GET 49
Statement processed.

The number 49 above is the Oracle PID (v$process.pid) of the target process I want to examine. The oradebug PGA_DETAIL_GET command will not immediately make the target process to report its usage – it will merely set a flag somewhere and the target process itself checks it when it is active. In other words, if the target process is idle or sleeping for a long time (due to some lock for example), then it won’t populate the V$ view with required data. In my test environment, the V$PROCESS_MEMORY_DETAIL got populated only after I ran another dummy command in the target session. This shouldn’t be an issue if you are examining a process that’s actively doing something (and not idle/sleeping for a long time).

The output below is from another dummy demo session that wasn’t using much of memory:

SQL> SELECT * FROM v$process_memory_detail ORDER BY pid, bytes DESC;

       PID    SERIAL# CATEGORY        NAME                       HEAP_NAME            BYTES ALLOCATION_COUNT HEAP_DES PARENT_H
---------- ---------- --------------- -------------------------- --------------- ---------- ---------------- -------- --------
        49          5 Other           permanent memory           pga heap            162004               19 11B602C0 00
        49          5 SQL             QERHJ Bit vector           QERHJ hash-joi      131168                8 F691EF4C F68F6F7C
        49          5 Other           kxsFrame4kPage             session heap         57736               14 F68E7134 11B64780
        49          5 SQL             free memory                QERHJ hash-joi       54272                5 F691EF4C F68F6F7C
        49          5 Other           free memory                pga heap             41924                8 11B602C0 00
        49          5 Other           miscellaneous                                   39980              123 00       00
        49          5 Other           Fixed Uga                  Fixed UGA heap       36584                1 F6AA44B0 11B602C0
        49          5 Other           permanent memory           top call heap        32804                2 11B64660 00
        49          5 Other           permanent memory           session heap         32224                2 F68E7134 11B64780
        49          5 Other           free memory                top call heap        31692                1 11B64660 00
        49          5 Other           kgh stack                  pga heap             17012                1 11B602C0 00
        49          5 Other           kxsFrame16kPage            session heap         16412                1 F68E7134 11B64780
        49          5 Other           dbgeInitProcessCtx:InvCtx  diag pga             15096                2 F75A8630 11B602C0
...

The BYTES column shows the sum of memory allocated from private memory heap HEAP_NAME for the reason shown in NAME column. If you want to know the average allocation (chunk) size in the heap, divide BYTES by ALLOCATION_COUNT.
For example, the top PGA memory user in that process is an allocation called “permanent memory”, 162004 bytes taken straight from the top-level “pga-heap”. It probably contains all kinds of low-level runtime allocations that the process needs for its own purposes. It may be possible to drill down into the subheaps inside that allocation with the Oracle memory top-5 subheap dumping I have written about before.

The 2nd biggest memory user is in category SQL – “QERHJ Bit vector” allocation, 131168 bytes allocated in 8 chunks of ~16kB each (on average). QERHJ should mean Query Execution Row-source Hash-Join and the hash join bit vector is a hash join optimization (somewhat like a bloom filter on hash buckets) – Jonathan Lewis has written about this in his CBO book.

I do have a couple of scripts which automate running the ORAEDBUG command, waiting for a second so that the target process would have a chance to publish its data in the V$PROCESS_MEMORY_DETAIL and then query it. Check out smem_detail.sql and pmem_detail.sql.

Now, let’s look into a real example from a problem case – a stress test environment on Oracle 12c:

SQL> @smem 1516
Display session 1516 memory usage from v$process_memory....

       SID        PID    SERIAL# CATEGORY         ALLOCATED       USED MAX_ALLOCATED     CON_ID
---------- ---------- ---------- --------------- ---------- ---------- ------------- ----------
      1516        120        198 Freeable            786432          0                        0
      1516        120        198 Other            844733773                844733773          0
      1516        120        198 PL/SQL              421064      77296        572344          0
      1516        120        198 SQL                 277536      45904       2348040          0

The Other memory usage of a session has grown to over 800MB!

Let’s drill down deeper. The script warns that it’s experimental and asks you to press enter to continue as it’s using ORADEBUG. I haven’t seen any problems with it, but use it at your own risk (and stay away from critical background processes on production systems)!

SQL> @smem_detail 1516

WARNING! About to run an undocumented ORADEBUG command
for getting heap details.
This script is EXPERIMENTAL, use at your own risk!

Press ENTER to continue, or CTRL+C to cancel

PL/SQL procedure successfully completed.

STATUS
----------
COMPLETE

If the status above is not COMPLETE then you need to wait
for the target process to do some work and re-run the
v$process_memory_detail query in this script manually
(or just take a heapdump level 29 to get heap breakdown
in a tracefile)

       SID CATEGORY        NAME                       HEAP_NAME            BYTES ALLOCATION_COUNT
---------- --------------- -------------------------- --------------- ---------- ----------------
      1516 Other           permanent memory           qmxlu subheap    779697376           203700
      1516 Other           free memory                qmxlu subheap     25960784           202133
      1516 Other           XVM Storage                XVM subheap of     5708032               51
      1516 Other           free memory                session heap       2722944              598
      1516 Other           permanent memory           pga heap            681992               36
      1516 Other           qmushtCreate               qmtmInit            590256                9
      1516 Other           free memory                top uga heap        449024              208
      1516 Other           qmtmltAlloc                qmtmInit            389680             1777
      1516 Other           permanent memory           kolarsCreateCt      316960               15
      1516 Other           free memory                pga heap            306416               17
      1516 Other           miscellaneous                                  297120              105
      1516 Other           permanent memory           qmxtgCreateBuf      279536               73
      1516 Other           free memory                koh dur heap d      239312              134
      1516 Other           kxsFrame4kPage             session heap        232512               56
      1516 Other           permanent memory           qmcxdDecodeIni      228672               21
      1516 Other           permanent memory           qmxtigcp:heap       215936              730
      1516 Other           permanent memory           session heap        189472               28
      1516 Other           free memory                lpxHeap subhea      182760               32
      1516 Other           kfioRqTracer               pga heap            131104                1
      1516 Other           free memory                top call heap       129312                4
      1516 PL/SQL          recursive addr reg file    koh-kghu sessi      110592               10
      1516 Other           free memory                callheap            109856                4
      1516 Other           koh-kghu session heap      session heap         88272               36
      1516 Other           Fixed Uga                  pga heap             72144                1
      1516 PL/SQL          PL/SQL STACK               PLS PGA hp           68256                4
...

Well, there you go – the power of measuring & profiling. Most of that big memory usage comes from something called qmxlu subheap. Now, while this name is cryptic and we don’t know what it means – we are already half-way there, we at least know what to focus on now. We can ignore all the other hundreds of cryptic memory allocations in the output and just try to figure out what “qmxlu subheap” is. A quick MOS search might just tell it and if there are known bugs related to this memory leak, you might just find what’s affecting you right away (as Oracle support analysts may have pasted some symptoms, patch info and workarounds into the bug note):

memory_mos_notes

Indeed, there are plenty of results in MOS and when browsing through them to find one matching our symptoms and environment the closest, I looked into this: ORA-4030 With High Allocation Of “qmxdpls_subheap” (Doc ID 1509914.1). It came up in the search as the support analyst had pasted a recursive subheap dump containing our symptom – “qmxlu subheap” there:

Summary of subheaps at depth 2
5277 MB total:
 5277 MB commented, 128 KB permanent
 174 KB free (110 KB in empty extents),
   2803 MB, 1542119496 heaps:   "               "
   1302 MB, 420677 heaps:   "qmxlu subheap  "
    408 MB, 10096248 chunks:  "qmxdplsArrayGetNI1        " 2 KB free held
    385 MB, 10096248 chunks:  "qmxdplsArrayNI0           " 2 KB free held

In this note, the reference bug had been closed as “not a bug” and hinted that it may be an application issue (an application “object” leak) instead of an internal memory leak that causes this memory usage growth.

Cause:

The cause of this problem has been identified in:
unpublished Bug:8918821 – MEMORY LEAK IN DBMS_XMLPARSER IN QMXDPLS_SUBHEAP
closed as “not a bug”. The problem is caused by the fact that the XML document is created with XMLDOM.CREATEELEMENT, but after creation XMLDOM.FREEDOCUMENT is not called. This causes the XML used heaps to remain allocated. Every new call to XMLDOM.CREATEELEMENT will then allocate a new heap, causing process memory to grow over time, and hence cause the ORA-4030 error to occur in the end.

Solution:

To implement a solution for this issue, use XMLDOM.FREEDOCUMENT to explicitly free any explicitly or implictly created XML document, so the memory associated with that document can be released for reuse.

And indeed, in our case it turned out that it was an application issue – the application did not free the XMLDOM documents after use, slowly accumulating more and more open document memory structures, using more memory and also more CPU time (as, judging by the ALLOCATION_COUNT figure in smem_detail output above, the internal array used for managing the open document structures had grown to 203700). Once the application object leak issue was fixed, the performance and memory usage problem went away.

Summary:

V$PROCESS_MEMORY_DETAIL allows you to conveniently dig deeper into process PGA memory usage. The alternative is to use Oracle heapdumps. A few more useful comments about it are in an old Oracle-L post.

Normally my process memory troubleshooting & drilldown sequence goes like that (usually only steps 1-2 are enough, 3-4 are rarely needed):

  1. v$process / v$process_memory / top / ps
  2. v$sql_workarea_active
  3. v$process_memory_detail or heapdump_analyzer
  4. pmap -x at OS level

#1,2,3 above can show you “session” level memory usage (assuming that you are using dedicated servers with 1-1 relationship between a session and a process) and #4 can show you a different view into the real process memory usage from the OS perspective.

Even though you may see cryptic allocation reason names in the output, if reason X causes 95% of your problem, you’ll need to focus on finding out what X means and don’t need to waste time on anything else. If there’s an Oracle bug involved, a MOS search by top memory consumer names would likely point you to the relevant bug right away.

Oracle troubleshooting is fun!

Related Posts