Skip navigation.

Tanel Poder

Syndicate content
Linux, Oracle, Exadata and Hadoop.
Updated: 10 hours 55 min ago

Oracle In-Memory Column Store Internals – Part 1 – Which SIMD extensions are getting used?

Sun, 2014-10-05 23:51

This is the first entry in a series of random articles about some useful internals-to-know of the awesome Oracle Database In-Memory column store. I intend to write about Oracle’s IM stuff that’s not already covered somewhere else and also about some general CPU topics (that are well covered elsewhere, but not always so well known in the Oracle DBA/developer world).

Before going into further details, you might want to review the Part 0 of this series and also our recent Oracle Database In-Memory Option in Action presentation with some examples. And then read this doc by Intel if you want more info on how the SIMD registers and instructions get used.

There’s a lot of talk about the use of your CPUs’ SIMD vector processing capabilities in the Oracle inmemory module, let’s start by checking if it’s enabled in your database at all. We’ll look into Linux/Intel examples here.

The first generation of SIMD extensions in Intel Pentium world were called MMX. It added 8 new XMMn registers, 64 bits each. Over time the registers got widened, more registers and new features were added. The extensions were called Streaming SIMD Extensions (SSE, SSE2, SSSE3, SSE4.1, SSE4.2) and Advanced Vector Extensions (AVX and AVX2).

The currently available AVX2 extensions provide 16 x 256 bit YMMn registers and the AVX-512 in upcoming King’s Landing microarchitecture (year 2015) will provide 32 x 512 bit ZMMn registers for vector processing.

So how to check which extensions does your CPU support? On Linux, the “flags” column in /proc/cpuinfo easily provides this info.

Let’s check the Exadatas in our research lab:

Exadata V2:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

So the highest SIMD extension support on this Exadata V2 is SSE4.2 (No AVX!)

Exadata X2:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

Exadata X2 also has SSE4.2 but no AVX.

Exadata X3:

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
avx
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

The Exadata X3 supports the newer AVX too.

My laptop (Macbook Pro late 2013):
The Exadata X4 has not yet arrived to our lab, so I’m using my laptop as an example of a latest available CPU with AVX2:

Update: Jason Arneil commented that the X4 does not have AVX2 capable CPUs (but the X5 will)

$ grep "^model name" /proc/cpuinfo | sort | uniq
model name	: Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq
avx
avx2
popcnt
sse
sse2
sse4_1
sse4_2
ssse3

The Core-i7 generation supports everything up to the current AVX2 extension set.

So, which extensions is Oracle actually using? Let’s check!

As Oracle needs to run different binary code on CPUs with different capabilities, some of the In-Memory Data (kdm) layer code has been duplicated into separate external libraries – and then gets dynamically loaded into Oracle executable address space as needed. You can run pmap on one of your Oracle server processes and grep for libshpk:

$ pmap 21401 | grep libshpk
00007f0368594000   1604K r-x--  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so
00007f0368725000   2044K -----  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so
00007f0368924000     72K rw---  /u01/app/oracle/product/12.1.0.2/dbhome_1/lib/libshpksse4212.so

My (educated) guess is that the “shpk” in libshpk above stands for oS dependent High Performance [K]ompression. “s” prefix normally means platform dependent (OSD) code and this low-level SIMD code sure is platform and CPU microarchitecture version dependent stuff.

Anyway, the above output from an Exadata X2 shows that SSE4.2 SIMD HPK libraries are used on this platform (and indeed, X2 CPUs do support SSE4.2, but not AVX).

Let’s list similar files from $ORACLE_HOME/lib:

$ cd $ORACLE_HOME/lib
$ ls -l libshpk*.so
-rw-r--r-- 1 oracle oinstall 1818445 Jul  7 04:16 libshpkavx12.so
-rw-r--r-- 1 oracle oinstall    8813 Jul  7 04:16 libshpkavx212.so
-rw-r--r-- 1 oracle oinstall 1863576 Jul  7 04:16 libshpksse4212.so

So, there are libraries for AVX and AVX2 in the lib directory too (the “12” suffix for all file names just means Oracle version 12). The AVX2 library is almost empty though (and the nm/objdump commands don’t show any Oracle functions in it, unlike in the other files).

Let’s run pmap on a process in my new laptop (which supports AVX and AVX2 ) to see if the AVX2 library gets used:

$ pmap 18969 | grep libshpk     
00007f85741b1000   1560K r-x-- libshpkavx12.so
00007f8574337000   2044K ----- libshpkavx12.so
00007f8574536000     72K rw--- libshpkavx12.so

Despite my new laptop supporting AVX2, only the AVX library is used (the AVX2 library is named libshpkavx212.so). So it looks like the AVX2 extensions are not used yet in this version (it’s the first Oracle 12.1.0.2 GA release without any patches). I’m sure this will be added soon, along with more features and bugfixes.

To be continued …

Related Posts

My presentations at OOW 2014 (See you there!)

Mon, 2014-09-22 17:16

Here’s where I will hang out (and in some cases speak) during the OOW:

Sunday, Sep 28 3:30pm – Moscone South – 310

Monday, Sep 29 8:30am – 4:00pm - Creativity Museum

  • I will mostly hang out at the OakTableWorld satellite event and listen to the awesome talks there.

Tuesday, Sep 30 10:00am – Creativity Museum

  • I will speak about Hacking Oracle 12c for an hour at OakTableWorld (random stuff about the first things I researched when Oracle 12c was released)
  • I also plan to hang out there for most of the day, so see you there!

Wednesday, Oct 1 – 3:00pm – Jillian’s

  • I’ll be at Enkitec’s “office” (read: we’ll have beer) in Jillian’s (on 4th St between Mission/Howard) from 3pm onwards on Wednesday, so, come by for a chat.
  • Right after Enkitec’s office hours I’ll head to the adjacent room for the OTN Bloggers meetup and this probably means more beer & chat.

Thursday, Oct 2 – 10:45am – Moscone South – 104

  • Oracle In-Memory Database In Action
  • In this presentation Kerry and I will walk you through the performance differences when swithching from an old DW/reporting system (on a crappy I/O subsystem) all the way to having your data cached in Oracle’s In-Memory Column Store – with all the Oracle 12.1.0.2’s performance bells and whistles enabled. It will be awesome – see you there! ;-)

 

Related Posts

About index range scans, disk re-reads and how your new car can go 600 miles per hour!

Wed, 2014-09-17 02:56

Despite the title, this is actually a technical post about Oracle, disk I/O and Exadata & Oracle In-Memory Database Option performance. Read on :)

If a car dealer tells you that this fancy new car on display goes 10 times (or 100 or 1000) faster than any of your previous ones, then either the salesman is lying or this new car is doing something radically different from all the old ones. You don’t just get orders of magnitude performance improvements by making small changes.

Perhaps the car bends space around it instead of moving – or perhaps it has a jet engine built on it (like the one below :-) :

Anyway, this blog entry is a prelude to my upcoming Oracle In-Memory Database Option series and here I’ll explain one of the radical differences between the old way of thinking and modern (In-Memory / Smart Scan) thinking that allow such performance improvements.

To set the scope and and clarify what I mean by the “old way of thinking”: I am talking about reporting, analytics and batch workloads here – and the decades old mantra “if you want more speed, use more indexes”.

I’m actually not going to talk about the In-Memory DB option here – but I am going to walk you through the performance numbers of one index range scan. It’s a deliberately simple and synthetic example executed on my laptop, but it should be enough to demonstrate one important point.

Let’s say we have a report that requires me to visit 20% of rows in an orders table and I’m using an index range scan to retrieve these rows (let’s not discuss whether that’s wise or not just yet). First, I’ll give you some background information about the table and index involved.

My test server’s buffer cache is currently about 650 MB:

SQL> show sga

Total System Global Area 2147483648 bytes
Fixed Size                  2926472 bytes
Variable Size             369100920 bytes
Database Buffers          687865856 bytes
Redo Buffers               13848576 bytes
In-Memory Area           1073741824 bytes

The table I am accessing is a bit less than 800 MB in size, about 100k blocks:

SQL> @seg soe.orders

    SEG_MB OWNER  SEGMENT_NAME   SEGMENT_TYPE    BLOCKS 
---------- ------ -------------  ------------- -------- 
       793 SOE    ORDERS         TABLE           101504 

I have removed some irrelevant output from the output below, I will be using the ORD_WAREHOUSE_IX index for my demo:

SQL> @ind soe.orders
Display indexes where table or index name matches %soe.orders%...

TABLE_OWNER  TABLE_NAME  INDEX_NAME         POS# COLUMN_NAME     DSC
------------ ----------- ------------------ ---- --------------- ----
SOE          ORDERS      ORDER_PK              1 ORDER_ID
                         ORD_WAREHOUSE_IX      1 WAREHOUSE_ID
                                               2 ORDER_STATUS

INDEX_OWNER  TABLE_NAME  INDEX_NAME        IDXTYPE    UNIQ STATUS   PART TEMP  H  LFBLKS       NDK   NUM_ROWS      CLUF LAST_ANALYZED     DEGREE VISIBILIT
------------ ----------- ----------------- ---------- ---- -------- ---- ---- -- ------- --------- ---------- --------- ----------------- ------ ---------
SOE          ORDERS      ORDER_PK          NORMAL/REV YES  VALID    NO   N     3   15801   7148950    7148950   7148948 20140913 16:17:29 16     VISIBLE
             ORDERS      ORD_WAREHOUSE_IX  NORMAL     NO   VALID    NO   N     3   17860      8685    7148950   7082149 20140913 16:18:03 16     VISIBLE

I am going to do an index range scan on the WAREHOUSE_ID column:

SQL> @descxx soe.orders

Col# Column Name                    Null?      Type                      NUM_DISTINCT        Density  NUM_NULLS HISTOGRAM       NUM_BUCKETS Low Value                        High Value
---- ------------------------------ ---------- ------------------------- ------------ -------------- ---------- --------------- ----------- -------------------------------- --------------------------------
   1 ORDER_ID                       NOT NULL   NUMBER(12,0)                   7148950   .00000013988          0                           1 1                                7148950
...
   9 WAREHOUSE_ID                              NUMBER(6,0)                        999   .00100100100          0                           1 1                                999
...

Also, I enabled SQL trace and event 10298 – “ORA-10298: ksfd i/o tracing”, more about that later:

SQL> ALTER SESSION SET EVENTS '10298 trace name context forever, level 1';

Session altered.

SQL> EXEC SYS.DBMS_MONITOR.SESSION_TRACE_ENABLE(waits=>TRUE);

PL/SQL procedure successfully completed.

SQL> SET AUTOTRACE ON STAT

Ok, now we are ready to run the query! (It’s slightly formatted):

SQL> SELECT /*+ MONITOR INDEX(o, o(warehouse_id)) */ 
         SUM(order_total) 
     FROM 
         soe.orders o 
     WHERE 
         warehouse_id BETWEEN 400 AND 599;

Let’s check the basic autotrace figures:

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
    1423335  consistent gets
     351950  physical reads
          0  redo size
        347  bytes sent via SQL*Net to client
        357  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

What?! We have done 351950 physical reads?! This is 351950 blocks read via physical read operations. This is about 2.7 GB worth of IOs done just for this query! Our entire table size was under 800MB and the index size under 150MB. Shouldn’t indexes allow us to visit less blocks than the table size?!

Let’s dig deeper – by breaking down this IO number by execution plan line (using a SQL Monitoring report in this case):

Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|      48 |      25 |       23 |     1 |     1M | 352K |   3GB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=16715356)
=============================================================================================================================================
| Id |               Operation                |       Name       | Execs |   Rows   | Read | Read  | Activity |       Activity Detail       |
|    |                                        |                  |       | (Actual) | Reqs | Bytes |   (%)    |         (# samples)         |
=============================================================================================================================================
|  0 | SELECT STATEMENT                       |                  |     1 |        1 |      |       |          |                             |
|  1 |   SORT AGGREGATE                       |                  |     1 |        1 |      |       |          |                             |
|  2 |    TABLE ACCESS BY INDEX ROWID BATCHED | ORDERS           |     1 |       1M | 348K |   3GB |    96.30 | Cpu (1)                     |
|    |                                        |                  |       |          |      |       |          | db file parallel read (25)  |
|  3 |     INDEX RANGE SCAN                   | ORD_WAREHOUSE_IX |     1 |       1M | 3600 |  28MB |     3.70 | db file sequential read (1) |
=============================================================================================================================================

So, most of these IOs come from accessing the table (after fetching relevant ROWIDs from the index). 96% of response time of this query was also spent in that table access line. We have done about ~348 000 IO requests for fetching blocks from this table. This is over 3x more blocks than the entire table size! So we must be re-reading some blocks from disk again and again for some reason.

Let’s confirm if we are having re-reads. This is why I enabled the SQL trace and event 10298. I can just post-process the tracefile and see if IO operations with the same file# and block# combination do show up.

However, using just SQL trace isn’t enough because multiblock read wait events don’t show all blocks read (you’d have to infer this from the starting block# and count#), the “db file parallel read” doesn’t show any block#/file# info at all in SQL Trace (as this “vector read” wait event encompasses multiple different block reads under a single wait event).

The classic single block read has the file#/block# info:

WAIT #139789045903344: nam='db file sequential read' ela= 448 file#=2 block#=1182073 blocks=1 obj#=93732 tim=156953721029

The parallel read wait events don’t have individual file#/block# info (just total number of files/blocks involved):

WAIT #139789045903344: nam='db file parallel read' ela= 7558 files=1 blocks=127 requests=127 obj#=93696 tim=156953729450

Anyway, because we had plenty of db file parallel read waits that don’t show all the detail in SQL Trace, I also enabled the event 10298 that gives us following details below (only tiny excerpt below):

...
ksfd_osdrqfil:fob=0xce726160 bufp=0xbd2be000 blkno=1119019 nbyt=8192 flags=0x4
ksfdbio:rq=0x7f232c4edb00 fob=0xce726160 aiopend=126
ksfd_osdrqfil:fob=0xce726160 bufp=0x9e61a000 blkno=1120039 nbyt=8192 flags=0x4
ksfdbio:rq=0x7f232c4edd80 fob=0xce726160 aiopend=127
ksfdwtio:count=127 aioflags=0x500 timeout=2147483647 posted=(nil)
...
ksfdchkio:ksfdrq=0x7f232c4edb00 completed=1
ksfdchkio:ksfdrq=0x7f232c4edd80 completed=0
WAIT #139789045903344: nam='db file parallel read' ela= 6872 files=1 blocks=127 requests=127 obj#=93696 tim=156953739197

So, on Oracle 12.1.0.2 on Linux x86_64 with xfs filesystem with async IO enabled and filesystemio_options = SETALL we get the “ksfd_osdrqfil” trace entries to show us the block# Oracle read from a datafile. It doesn’t show the file# itself, but it shows the accessed file state object address (FOB) in SGA and as it was always the same in the tracefile, I know duplicate block numbers listed in trace would be for the same datafile (and not for a block with the same block# in some other datafile). And the tablespace I used for my test had a single datafile anyway.

Anyway, I wrote a simple script to summarize whether there were any disk re-reads in this tracefile (of a select statement):

$ grep ^ksfd_osdrqfil LIN121_ora_11406.trc | awk '{ print $3 }' | sort | uniq -c | sort -nr | head -20
     10 blkno=348827
     10 blkno=317708
      9 blkno=90493
      9 blkno=90476
      9 blkno=85171
      9 blkno=82023
      9 blkno=81014
      9 blkno=80954
      9 blkno=74703
      9 blkno=65222
      9 blkno=63899
      9 blkno=62977
      9 blkno=62488
      9 blkno=59663
      9 blkno=557215
      9 blkno=556581
      9 blkno=555412
      9 blkno=555357
      9 blkno=554070
      9 blkno=551593
...

Indeed! The “worst” blocks have been read in 10 times – all that for a single query execution.

I only showed 20 top blocks here, but even when I used “head -10000″ and “head -50000″ above, I still saw blocks that had been read in to buffer cache 8 and 4 times respectively.

Looking into earlier autotrace metrics, my simple index range scan query did read in over 3x more blocks than the total table and index size combined (~350k blocks read while the table had only 100k blocks)! Some blocks have gotten kicked out from buffer cache and have been re-read back into cache later, multiple times.

Hmm, let’s think further: We are accessing only about 20% of a 800 MB table + 150 MB index, so the “working set” of datablocks used by my query should be well less than my 650 MB buffer cache, right? And as I am the only user in this database, everything should nicely fit and stay in buffer cache, right?

Actually, both of the arguments above are flawed:

  1. Accessing 20% of rows in a table doesn’t automatically mean that we need to visit only 20% blocks of that table! Maybe all of the tables’s blocks contain a few of the rows this index range scan needs? So we might need to visit all of that table’s blocks (or most of them) and extract only a few matching rows from each block. But nevertheless, the “working set” of required blocks for this query would include almost all of the table blocks, not only 20%. We must read all of them in at some point in the range scan.So, the matching rows in table blocks are not tightly packed and physically in correspondence with the index range scan’s table access driving order, but are potentially “randomly” scattered all over the table.This means that an index range scan may come back and access some data block again and again to get a yet-another row from it when the ROWID entries in index leaf blocks point there. This is what I call buffer re-visits(Now scroll back up and see what is that index’es clustering factor :-)

  2. So what, all the buffer re-visits should be really fast as the previously read block is going to be in buffer cache, right?Well, not really. Especially when the working set of blocks read is bigger than buffer cache. But even if it is smaller, the Oracle buffer cache isn’t managed using basic LRU replacement logic (since 8.1.6). New blocks that get read in to buffer cache will be put into the middle of the “LRU” list and they work their way up to the “hot” end only if they are touched enough times before someone manages to flush them out. So even if you are a single user of the buffer cache, there’s a chance that some just recently read blocks get aged out from buffer cache – by the same query still running – before they get hot enough. And this means that your next buffer re-visit may turn into a disk block re-read that we saw in the tracefiles.If you combine this with the reality of production systems where there’s a thousand more users trying to do what you’re doing, at the same time, it becomes clear that you’ll be able to use only a small portion of the total buffer cache for your needs. This is why people sometimes configure KEEP pools – not that the KEEP pool is somehow able to keep more blocks in memory for longer per GB of RAM, but simply for segregating the less important troublemakers from more important… troublemakers :)

 

So what’s my point here – in the context of this blog post’s title?

Let’s start from Exadata – over the last years it has given many customers order(s) of magnitude better analytics, reporting and batch performance compared to their old systems, if done right of course. In other words, instead of indexing even more, performing wide index range scans with millions of random block reads and re-reads, they ditched many indexes and started doing full table scans. Full table scans do not have such “scaling problems” like a wide index range scan (or a “wide” nested loop join driving access to another table). In addition you got all the cool stuff that goes really well with full scans – multiblock reads, deep prefetching, partition-wise hash joins, partition pruning and of course all the throughput and Smart Scan magic on Exadata).

An untuned complex SQL on a complex schema with lots of non-ideal indexes may end up causing a lot of “waste IO” (don’t have a better term) and similarly CPU usage too. And often it’s not simple to actually fix the query – as it may end up needing a significant schema adjustment/redesign that would require also changing the application code in many different places (ain’t gonna happen). With defaulting reporting to full table scans, you can actually eliminate a lot of such waste, assuming that you have a high-througput – and ideally smart – IO subsystem. (Yes, there are always exceptions and special cases).

We had a customer who had a reporting job that ran almost 2000x faster after moving to Exadata (from 17 hours to 30 seconds or something like that). Their first reaction was: “It didn’t run!” Indeed it did run and it ran correctly. Such radical improvement came from the fact that the new system – compared to the old system – was doing multiple things radically better. It wasn’t just an incremental tweak of adding a hint or a yet another index without daring to do more significant changes.

In this post I demoed just one of the problems that’s plaguing many of the old-school Oracle DW and reporting systems. While favoring full table scanning had always been counterintuitive for most Oracle shops out there, it was the Exadata’s hardware, software and also the geek-excitement surrounding it, what allowed customers to take the leap and switch from the old mindset to new. I expect the same from the Oracle In-Memory Database Option. More about this in a following post.

 

Related Posts