Skip navigation.

Tanel Poder

Syndicate content
Linux, Oracle, Hadoop.
Updated: 2 hours 3 min ago

RAM is the new disk – and how to measure its performance – Part 3 – CPU Instructions & Cycles

Mon, 2015-11-30 00:45

If you haven’t read the previous parts of this series yet, here are the links: [ Part 1 | Part 2 ].

A Refresher

In the first part of this series I said that RAM access is the slow component of a modern in-memory database engine and for performance you’d want to reduce RAM access as much as possible. Reduced memory traffic thanks to the new columnar data formats is the most important enabler for the awesome In-Memory processing performance and SIMD is just icing on the cake.

In the second part I also showed how to measure the CPU efficiency of your (Oracle) process using a Linux perf stat command. How well your applications actually utilize your CPU execution units depends on many factors. The biggest factor is your process’es cache efficiency that depends on the CPU cache size and your application’s memory access patterns. Regardless of what the OS CPU accounting tools like top or vmstat may show you, your “100% busy” CPUs may actually spend a significant amount of their cycles internally idle, with a stalled pipeline, waiting for some event (like a memory line arrival from RAM) to happen.

Luckily there are plenty of tools for measuring what’s actually going on inside the CPUs, thanks to modern processors having CPU Performance Counters (CPC) built in to them.

A key derived metric for understanding CPU-efficiency is the IPC (instructions per cycle). Years ago people were actually talking about the inverse metric CPI (cycles per instruction) as on average it took more than one CPU cycle to complete an instruction’s execution (again, due to the abovementioned reasons like memory stalls). However, thanks to today’s superscalar processors with out-of-order execution on a modern CPU’s multiple execution units – and with large CPU caches – a well-optimized application can execute multiple instructions per a single CPU cycle, thus it’s more natural to use the IPC (instructions-per-cycle) metric. With IPC, higher is better.

Here’s a trimmed snippet from the previous article, a process that was doing a fully cached full table scan of an Oracle table (stored in plain old row-oriented format):

Performance counter stats for process id '34783':

      27373.819908 task-clock                #    0.912 CPUs utilized
    86,428,653,040 cycles                    #    3.157 GHz                     [33.33%]
    32,115,412,877 instructions              #    0.37  insns per cycle
                                             #    2.39  stalled cycles per insn [40.00%]
    76,697,049,420 stalled-cycles-frontend   #   88.74% frontend cycles idle    [40.00%]
    58,627,393,395 stalled-cycles-backend    #   67.83% backend  cycles idle    [40.00%]
       256,440,384 cache-references          #    9.368 M/sec                   [26.67%]
       222,036,981 cache-misses              #   86.584 % of all cache refs     [26.66%]

      30.000601214 seconds time elapsed

The IPC of the above task is pretty bad – the CPU managed to complete only 0.37 instructions per CPU cycle. On average every instruction execution was stalled in the execution pipeline for 2.39 CPU cycles.

Note: Various additional metrics can be used for drilling down into why the CPUs spent so much time stalling (like cache misses & RAM access). I covered the typical perf stat metrics in the part 2 of this series so won’t go in more detail here.

Test Scenarios

The goal of my experiments was to measure the number CPU-efficiency of different data scanning approaches in Oracle – on different data storage formats. I focused only on data scanning and filtering, not joins or aggregations. I ensured that everything would be cached in Oracle’s buffer cache or in-memory column store for all test runs – so disk IO was not a factor here (again, read more about my test environment setup in part 2 of this series).

The queries I ran were mostly variations of this:

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

Although I was after testing the full table scanning speeds, I also added two examples of scanning through the entire table’s rows via index range scans. This allows me to show how inefficient index range scans can be when accessing a large part of a table’s rows even when all is cached in memory. Even though you see different WHERE clauses in some of the tests, they all are designed so that they go through all rows of the table (just using different access patterns and code paths).

The descriptions of test runs should be self-explanatory:


SELECT /*+ MONITOR INDEX(c(cust_postal_code)) */ COUNT(cust_valid)
FROM customers_nopart c WHERE cust_postal_code > '0';


SELECT /*+ MONITOR INDEX(c(cust_id)) */ COUNT(cust_valid)
FROM customers_nopart c WHERE cust_id > 0;


FROM customers_nopart c WHERE cust_id > 0;


FROM customers_nopart c WHERE cust_id > 0;


FROM customers_nopart c;


SELECT /*+ MONITOR */ COUNT(cust_valid) 
FROM customers_nopart_hcc_ql WHERE cust_id > 0

Note how all experiments except the last one are scanning the same physical table just with different options (like index scan or in-memory access path) enabled. The last experiment is against a copy of the same table (same columns, same rows), but just physically formatted in the HCC format (and fully cached in buffer cache).

Test Results: Raw Numbers

It is not enough to just look into the CPU performance counters of different experiments, they are too low level. For the full picture, we also want to know how much work (like logical IOs etc) the application was doing and how many rows were eventually processed in each case. Also I verified that I did get the exact desired execution plans, access paths and that no physical IOs or other wait events happened using the usual Oracle metrics (see the log below).

Here’s the experiment log file with full performance numbers from SQL Monitoring reports, Snapper and perf stat:

I also put all these numbers (plus some derived values) into a spreadsheet. I’ve pasted a screenshot of the data below for convenience, but you can access the entire spreadsheet with its raw data and charts here (note that the spreadsheet has multiple tabs and configurable pivot charts in it):

Raw perf stat data from the experiments:

oracle scan test results.png

Now let’s plot some charts!

Test Results: CPU Instructions

Let’s start from something simple and gradually work our way deeper. I will start from listing the task-clock-ms metric that shows the CPU time usage of the Oracle process in milliseconds for each of my test table scans. This metric comes from the OS-level and not from within the CPU:

task-clock-ms.pngCPU time used for scanning the dataset (in milliseconds)

As I mentioned earlier, I added two index (full) range scan based approaches for comparison. Looks like the index-based “full table scans” seen in first and second columns are using the most CPU-time as the OS sees it (~120 and close to 40 seconds of CPU respectively).

Now let’s see how many CPU instructions (how much work “requested” from CPU) the Oracle process executed for scanning the same dataset using different access paths and storage formats:

oracle table scan instructions clean.pngCPU instructions executed for scanning the dataset

Wow, the index-based approaches seem to be issuing multiple times more CPU instructions per query execution than any of the full table scans. Whatever loops the Oracle process is executing for processing the index-based query, it runs more of them. Or whatever functions it calls within those loops, the functions are “fatter”. Or both.

Let’s look into an Oracle-level metric session logical reads to see how many buffer gets it is doing:

oracle buffer gets clean.pngBuffer gets done for a table scan


Wow, using the index with bad clustering factor (1st bar) causes Oracle to do over 60M logical IOs, while the table scans do around 1.6M of logical IOs each. Retrieving all rows of a table via an index range scan is super-inefficient, given that the underlying table size is only 1613824 blocks.

This inefficiency is due to index range scans having to re-visit the same datablocks multiple times (up to one visit per row, depending on the clustering factor of the index used). This would cause another logical IO and use more CPU cycles for each buffer re-visit, except in cases where Oracle has managed to keep a buffer pinned since last visit. The index range scan with a good clustering factor needs to do much fewer logical IOs as given the more “local” clustered table access pattern, the re-visited buffers are much more likely found already looked-up and pinned (shown as the buffer is pinned count metric in V$SESSTAT).

Knowing that my test table has 69,642,625 rows in it, I can also derive an average CPU instructions per row processed metric from the total instruction amounts:

instructions per row.png

The same numbers in tabular form:

Screen Shot 2015-11-30 at 00.38.12

Indeed there seem to be radical code path differences (that come from underlying data and cache structure differences) that make an index-based lookup use thousands of instructions per row processed, while an in-memory scan with a single predicate used only 102 instructions per row processed on average. The in-memory counting without any predicates didn’t need to execute any data comparison logic in it, so could do its data access and counting with only 43 instructions per row on average.

So far I’ve shown you some basic stuff. As this article is about studying the full table scan efficiency, I will omit the index-access metrics from further charts. The raw metrics are all available in the raw text file and spreadsheet mentioned above.

Here are again the buffer gets of only the four different full table scan test cases:

oracle buffer gets table scan only.pngBuffer gets done for full table scans

All test cases except the HCC-compressed table scan cause the same amount of buffer gets (~1.6M) as this is the original table’s size in blocks. The HCC table is only slightly smaller – didn’t get great compression with the query low setting.

Now let’s check the number CPU instructions executed by these test runs:

oracle table scan only instructions.pngCPU instructions executed for full table scans

Wow, despite the table sizes and number of logical IOs being relatively similar, the amount of machine code the Oracle process executes is wildly different! Remember, all that my query is doing is just scanning and filtering the data followed with a basic COUNT(column) operation – no additional sorting, joining is done. The in-memory access paths (column 3 & 4) get away with executing much fewer CPU instructions than the regular buffered tables in row-format and HCC format (columns 1 & 2 in the chart).

All the above shows that not all logical IOs are equal, depending on your workload and execution plans (how many block visits, how many rows extracted per block visit) and underlying storage formats (regular row-format, HCC in buffer cache or compressed columns in In-Memory column store), you may end up doing a different amount of CPU work per row retrieved for your query.

This was true before the In-Memory option and even more noticeable with the In-Memory option. But more about this in a future article.

Test Results: CPU Cycles

Let’s go deeper. We already looked into how many buffer gets and CPU instructions the process executed for the different test cases. Now let’s look into how much actual CPU time (in form of CPU cycles) these tests consumed. I added the CPU cycles metric to instructions for that:

instructions and cycles.pngCPU instructions and cycles used for full table scans

Hey, what? How come the regular row-oriented block format table scan (TABLE BUFCACHE) takes over twice more CPU cycles compared to its instructions executed?

Also, how come all the other table access methods use noticeably less CPU cycles than the number of instructions they’ve executed?

If you paid attention to this article (and previous ones) you’ll already know why. In the 1st example (TABLE BUFCACHE) the CPU must have been “waiting” for something a lot, instructions having spent multiple cycles “idle”, stalled in the pipeline, waiting for some event or necessary condition to happen (like a memory line arriving from RAM).

For example, if you are constantly waiting for the “random” RAM lines you want to access due to inefficient memory structures for scanning (like Oracle’s row-oriented datablocks), the CPU will be bottlenecked by RAM access. The CPU’s internal execution units, other than the load-store units, would be idle most of the time. The OS top command would still show you 100% utilization of a CPU by your process, but in reality you could squeeze much more out of your CPU if it didn’t have to wait for RAM so much.

In the other 3 examples above (columns 2-4), apparently there is no serious RAM (or other pipeline-stalling) bottleneck as in all cases we are able to use the multiple execution units of modern superscalar CPUs to complete more than one instruction per CPU cycle. Of course more improvements might be possible, but more about this in a following post.

For now I’ll conclude this (lengthy) post with one more chart with the fundamental derived metric instructions per cycle (IPC):

instructions per cycle.png

The IPC metric is derived from the previously shown instructions and CPU cycles metrics by a simple division. Higher IPC is better as it means that your CPU execution units are more utilized, it gets more done. However, as IPC is a ratio, you should never look into the IPC value alone, always look into it together with instructions and cycles metrics. It’s better to execute 1 Million instructions with IPC of 0.5 than 1 Billion instructions with an IPC of 3 – but looking into IPC in isolation doesn’t tell you how much work was actually done. Additionally, you’d want to use your application level metrics that give you an indication of how much application work got done (I used Oracle’s buffer gets and rows processed metrics for this).

Looks like there’s at least 2 more parts left in this series (advanced metrics and a summary), but let’s see how it goes. Sorry for any typos, it’s getting quite late and I’ll fix ’em some other day :)


NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

Related Posts

My New Youtube Channel

Mon, 2015-11-23 22:30

I have created a new youtube channel – and have uploaded some videos there already! Bookmark & Subscribe here:

More stuff is coming over the next weeks & months :-)


NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

Troubleshooting Another Complex Performance Issue – Oracle direct path inserts and SEG$ contention

Tue, 2015-11-10 18:35

Here’s an updated presentation I first delivered at Hotsos Symposium 2015.

It’s about lots of concurrent PX direct path insert ant CTAS statements that, when clashing with another bug/problem, caused various gc buffer busy waits and enq: TX – allocate ITL entry contention. This got amplified thanks to running this concurrent workload on 4 RAC nodes:

When reviewing these slides, I see there’s quite a lot that needs to be said in addition to what’s on slides, so this might just mean a (Powerpoint) hacking session some day!

NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

Related Posts

SQL Monitoring in Oracle Database 12c

Thu, 2015-10-29 12:53

Here’s my latest OOW presentation – SQL Monitoring in Oracle Database 12c:

[direct link]

You can download all my scripts from



NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

Related Posts

Connecting Hadoop and Oracle

Tue, 2015-10-27 18:06

Here are the slides of my yesterday’s OakTableWorld presentation. They also include a few hints about what our hot new venture Gluent is doing (although bigger annoucements come later this year).

[direct link]

Also, if you are at Oracle OpenWorld right now, my other presentation about SQL Monitoring in 12c is tomorrow at 3pm in Moscone South 103. See you there!


NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

Related Posts

My Oracle OpenWorld presentations

Fri, 2015-10-23 20:44

Oracle OpenWorld is just around the corner – I will have one presentation at OOW this year and another at the independent OTW event:

Connecting Oracle with Hadoop

Real-Time SQL Monitoring in Oracle Database 12c

  • Conference: OpenWorld
  • Time: Wednesday, 28 Oct, 3:00pm
  • Location: Moscone South 103
  • Abstract: Click here (sign up to guarantee a seat!)

I plan to hang out at the OTW venue on Monday and Tuesday, so see you there!


NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

Related Posts

Advanced Oracle Troubleshooting v2.5 (with 12c stuff too)

Fri, 2015-10-09 01:58

It took a while (1.5 years since my last class – I’ve been busy!), but I am ready with my Advanced Oracle Troubleshooting training (version 2.5) that has plenty of updates, including some more modern DB kernel tracing & ASH stuff and of course Oracle 12c topics!

The online training will take place on 16-20 November & 14-18 December 2015 (Part 1 and Part 2).

The latest TOC is below:

Seminar registration details:

A notable improvement of AOT v2.5: now attendees will get downloadable video recordings after the sessions for personal use! So, no crappy streaming with 14-day expiry date, you can download the video MP4 files straight to your computer or tablet and keep for your use forever!

I won’t be doing any other classes this year, but there will be some more (pleasant) surprises coming next year ;-)

See you soon!

NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

Related Posts

RAM is the new disk – and how to measure its performance – Part 2 – Tools

Mon, 2015-09-21 02:20

[ part 1 | part 2 | part 3 ]

In the previous article I explained that the main requirement for high-speed in-memory data scanning is column-oriented storage format for in-memory data. SIMD instruction processing is just icing on the cake. Let’s dig deeper. This is a long post, you’ve been warned.

Test Environment

I will cover full test results in the next article in this series. First, let’s look into the test setup, environment and what tools I used for peeking inside CPU hardware.

I was running the tests on a relatively old machine with 2 CPU sockets, with 6-core CPUs in each socket (2s12c24t):

$ egrep "MHz|^model name" /proc/cpuinfo | sort | uniq -c
     24 cpu MHz		: 2926.171
     24 model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

The CPUs support SSE4.2 SIMD extensions (but not the newer AVX stuff):

$ grep ^flags /proc/cpuinfo | egrep "avx|sse|popcnt" | sed 's/ /\n/g' | egrep "avx|sse|popcnt" | sort | uniq

Even though the /proc/cpuinfo above shows the CPU clock frequency as 2.93GHz, these CPUs have Intel Turboboost feature that allows some cores run at up to 3.33GHz frequency when not all cores are fully busy and the CPUs aren’t too hot.

Indeed, the turbostat command below shows that the CPU core executing my Oracle process was running at 3.19GHz frequency:

# turbostat -p sleep 1
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             6.43 3.02 2.93   0  93.57   0.00   0.00   59   0.00   0.00
 0   0   0   4.49 3.19 2.93   0  95.51   0.00   0.00   46   0.00   0.00
 0   1   1  10.05 3.19 2.93   0  89.95   0.00   0.00   50
 0   2   2   2.48 3.19 2.93   0  97.52   0.00   0.00   45
 0   8   3   2.05 3.19 2.93   0  97.95   0.00   0.00   44
 0   9   4   0.50 3.20 2.93   0  99.50   0.00   0.00   50
 0  10   5 100.00 3.19 2.93   0   0.00   0.00   0.00   59
 1   0   6   6.25 2.23 2.93   0  93.75   0.00   0.00   44   0.00   0.00
 1   1   7   3.93 2.04 2.93   0  96.07   0.00   0.00   43
 1   2   8   0.82 2.15 2.93   0  99.18   0.00   0.00   44
 1   8   9   0.41 2.48 2.93   0  99.59   0.00   0.00   41
 1   9  10   0.99 2.35 2.93   0  99.01   0.00   0.00   43
 1  10  11   0.76 2.36 2.93   0  99.24   0.00   0.00   44

I will come back to this CPU frequency turbo-boosting later when explaining some performance metrics.

I ran the experiments in Oct/Nov 2014, so used a relatively early Oracle version with a bundle patch (19189240) for in-memory stuff.

The test was deliberately very simple as I was researching raw in-memory scanning and filtering speed and was not looking into join/aggregation performance. I was running the query below with different hints and parameters to change access path options:

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

I used the CUSTOMERS table of Swingbench Sales History schema. I deliberately didn’t use COUNT(*), but COUNT(col) on an actual column “cust_valid” that was nullable, so values in that actual column had to be accessed for correct counting.

Also, I picked the last column in the table as accessing columns in the physical “end” of a row (in row-oriented storage format) would cause more memory/cache accesses and CPU execution branch jumps due to the run-length encoded structure of a row in a datablock. Of course this depends on number of columns and width of the row too, plus hardware characteristics like cache line size (64 bytes on my machine).

Anyway, querying the last column helps to illustrate better what kind of overhead you may be suffering from when filtering that 500-column fact table using columns in the end of it.

SQL> @desc ssh.customers_nopart
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      CUST_ID                         NOT NULL NUMBER
    2      CUST_FIRST_NAME                 NOT NULL VARCHAR2(20)
    3      CUST_LAST_NAME                  NOT NULL VARCHAR2(40)
   22      CUST_EFF_TO                              DATE
   23      CUST_VALID                               VARCHAR2(1)

The table has 69,642,625 rows in it and its segment size is 1613824 blocks / 12608 MB on disk (actual used space in it was slightly lower due to some unused blocks in the segment). I set the table PCTFREE to zero to use all space in the blocks. I also created a HCC-compressed copy of the same table for comparison reasons.

SQL> @seg tanel.customers_nopart

-------- ------ -------------------------  ------------- ----------
   12608 TANEL  CUSTOMERS_NOPART           TABLE            1613824
    6416 TANEL  CUSTOMERS_NOPART_HCC_QL    TABLE             821248

I made sure that the test tables were completely cached in Oracle buffer cache to eliminate any physical IO component from tests and also enabled in-memory columnar caching for the CUSTOMERS_NOPART table.

SQL> @imseg %.%

---------- ---------- ----- ------------- ------------------- ------------- ------
     12608       5913  100% TANEL         CUSTOMERS_NOPART    TABLE         COMPLE
---------- ----------
     12608       5913
CPU Activity Measurement Tools

In addition to the usual suspects (Oracle SQL Monitoring reports and Snapper), I used the awesome Linux tool called perf, but not in the typical way you might have used it in past.

On Linux, perf can be used for profiling code executing on CPUs by sampling the instruction pointer and stack backtraces (perf top), but also for taking snapshots of internal CPU performance counters (perf stat). These CPU performance counters (CPC) tell us what happened inside the CPU during my experiments.

This way we can go way deeper than the high level tools like top utility or getrusage() syscall would ever allow us to go. We’ll be able to measure what physically happened inside the CPU. For example: for how many cycles the CPU core was actually doing useful work pushing instructions further in the execution pipeline vs. was stalled, waiting for requested memory to arrive or for some other internal condition to come true. Also, we can estimate the amount of traffic between the CPU and main memory, plus CPU cache hits/misses at multiple cache levels.

Perf can do CPC snapshotting and accounting also at OS process level. This means you can measure the internal CPU/memory activity of a single OS process under examination and that was great for my experiment.

Note that these kinds of tools are nothing new, they’ve been around with CPU vendor code profilers ever since CPUs were instrumented with performance counters (but undocumented in the early days). Perf stat just makes this stuff easily accessible on Linux. For example, since Solaris 8, you could use cputrack for extracting similar process-level CPU counter “usage”, other platforms have their own tools.

I used the following command (-p specifies the target PID) for measuring internal CPU activity when running my queries:

perf stat -e task-clock,cycles,instructions,branches,branch-misses \
          -e stalled-cycles-frontend,stalled-cycles-backend \
          -e cache-references,cache-misses \
          -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses \
          -p 92106 sleep 30

In RHEL6 equivalents (and later) you can use perf stat -d option for getting similar detailed output without specifying all the counters separately – but I was on OEL5.8. Also, different CPU versions support different performance counters. Read the manuals and start from simpler stuff.

Below is an example output from one test run – where I ran a full table scan against the last column of a regular row-oriented table (all cached in buffer cache) and took a perf stat snapshot of the entire SQL execution. Note that even though the table was cached in Oracle’s in-memory column store, I had disabled its use with the NO_INMEMORY hint, so this full table scan was done entirely via traditional Oracle buffer cache (no physical IOs!):

 Performance counter stats for process id '34783':

      27373.819908 task-clock                #    0.912 CPUs utilized
    86,428,653,040 cycles                    #    3.157 GHz                     [33.33%]
    32,115,412,877 instructions              #    0.37  insns per cycle
                                             #    2.39  stalled cycles per insn [40.00%]
     7,386,220,210 branches                  #  269.828 M/sec                   [39.99%]
        22,056,397 branch-misses             #    0.30% of all branches         [40.00%]
    76,697,049,420 stalled-cycles-frontend   #   88.74% frontend cycles idle    [40.00%]
    58,627,393,395 stalled-cycles-backend    #   67.83% backend  cycles idle    [40.00%]
       256,440,384 cache-references          #    9.368 M/sec                   [26.67%]
       222,036,981 cache-misses              #   86.584 % of all cache refs     [26.66%]
       234,361,189 LLC-loads                 #    8.562 M/sec                   [26.66%]
       218,570,294 LLC-load-misses           #   93.26% of all LL-cache hits    [ 6.67%]
        18,493,582 LLC-stores                #    0.676 M/sec                   [ 6.67%]
         3,233,231 LLC-store-misses          #    0.118 M/sec                   [ 6.67%]
     7,324,946,042 L1-dcache-loads           #  267.589 M/sec                   [13.33%]
       305,276,341 L1-dcache-load-misses     #    4.17% of all L1-dcache hits   [20.00%]
        36,890,302 L1-dcache-prefetches      #    1.348 M/sec                   [26.66%]

      30.000601214 seconds time elapsed

I ran perf for 30 seconds for the above experiment, kicked it running just before executing the Oracle SQL and it finished right after the SQL had completed.

Let’s go through some of the above metrics – top down. I’m explaining these metrics at a fairly high level and in the context of my experiment – fully measuring a single SQL execution in a single Oracle process:

Basic CPU Performance Counter Reference
  1. task-clock (~27373 milliseconds)
    – This is a software event and shows how much the target Linux task (my Oracle process) spent running on CPU during the SQL execution, as far as the OS scheduler knows (roughly 27 seconds on CPU).
    – So while perf took a 30 second snapshot of my process, my test SQL ran in a couple of seconds shorter time (so didn’t run on CPU all 30 seconds). That should explain the “0.912 CPU utilized” derived metric above.
  2. cycles – (86B cycles)
    – This hardware metric shows how many CPU cycles did my process (running a SQL statement) consume during perf runtime.
    – dividing 86B CPU cycles with 27 CPU seconds gives that the CPU core must have operated at around 3.1 GHz speed (on average) during my SQL run.
    – Remember, earlier in this article I used turbostat to show how these 2.93 GHz CPU cores happened to be running at 3.19 GHz frequency thanks to turbo-boost!
  3. instructions – (32B instructions)
    – This hardware metric shows how many instructions the CPU managed to successfully execute (and retire). This is where things get interesting:
    – It’s worth mentioning that modern CPUs are superscalar and pipelined. They have multiple internal execution units, can have multiple instructions (decoded to µops) executed in its pipeline, memory loads & stores happening concurrently and possibly out-of-order – instruction level parallelism, data level parallelism.
    – When dividing 32B executed instructions with 86B CPU cycles we see that we managed to execute only ~0.37 instructions per CPU cycle (IPC) on average!
    – When inverting this number we get 86B/32B=2.69 Cycles Per Instruction (CPI). So, on average, every CPU instruction took ~2.69 CPU cycles to execute! We’ll get to the “why” part later.
  4. branches – (7.3B branches)
    – This hardware metric shows how many branches the execututed code took
    – A branch is basically a jump (unconditional JMP instruction or a conditional jump like JZ, JNZ and many more – this is how basic IF/THEN/ELSE, CASE and various LOOP statmenents work at CPU level).
    – The more decision-points in your code, the more branches it takes.
    – Branches are like speedbumps in a CPU execution pipeline, obstructing the execution flow and prefetching due to the uncertainty of which branch will be taken.
    – That’s why features like branch prediction with speculative execution are built into modern CPUs to alleviate this problem.
    – Knowing that we scanned through roughly 70M rows in this table, this is over 100 branches taken per row scanned (and counted)!
    – Oracle’s traditional block format rows are stored in run-length encoded format, where you know where the following column starts only after reading (and testing) the previous column’s length byte(s). The more columns you need to traverse, the more branches you’ll take per row scanned.
  5. branch-misses – (22M, 0.3% of all branches)
    – This hardware metric shows how many times the CPU branch predictor (described above) mispredicted which branch would be taken and causing a pipeline stall.
    – Correctly predicting branches (where will the code execution jump next?) is good, as this allows to speculatively execute upcoming instructions and prefetch data required by them.
    – However, the branch predictor doesn’t always predict the future correctly, despite various advancements in modern branch prediction, like branch history tables and branch target buffers etc.
    – In a branch misprediction case, the mispredicted branch state has to be discarded, pipeline flushed and the correct branch’es instructions will be fetched & put into the start of execution pipeline (in short: mispredictions waste CPU cycles).
  6. stalled-cycles-frontend – (~76.6M cycles, 88.7% of all cycles)
    – This hardware metric shows for how many cycles the front-end of the CPU were spent stalled, not producing new µops into pipeline for backend execution.
    – The front-end of an Intel pipelined CPU is basically the unit that fetches the good old x86/x86_64 instructions from L1 instruction cache (or RAM if needed), decodes these to RISC-like µops (newer CPUs also cache those µops) and puts these into backend instruction queue for execution.
    – The front-end also deals with remembering taken brances and branch prediction (decoding and sending the predicted branch’es instructions into the backend).
    – The frontend can stall due to various reasons, like instruction cache misses (waiting for memory lines containing instructions to arrive from RAM or a lower level cache), branch mispredictions or simply because the backend can not accept more instructions into its pipeline due to some bottlenecks there.
  7. stalled-cycles-backend – (~58.6M cycles, 67.8% of all cycles)
    – This hardware metric shows how many cycles in the back-end of the CPU were spent stalled instead of advancing the pipeline.
    – The backend of the CPU is where actual computation on data happens – any computation referencing main memory (not only registers) will have to wait until the referenced memory locations have arrived in CPU L1 cache
    – A common reason for backend stalls is due to waiting for a cache line to arrive from RAM (or a lower level cache), although there are many other reasons.
    – To reduce memory-access related stalls, the program should be optimized to do less memory accesses or switch to more compact data structures to avoid loading data it doesn’t need.
    – Simpler, more predictable data structures also help as the CPU hardware prefetcher may detect an “array scan” and start prefetching required memory lines in advance.
    – In the context of this blog series – sequentially scanning and filtering a column of a table’s data is good for reducing memory access related CPU stalls. Walking through random pointers of linked lists (cache buffers chains) and skipping through row pointers in blocks, plus many columns’ length bytes before getting to your single column of interest causes memory access related stalls.
  8. cache-references – (256M references)
    – Now we get into a series of CPU cache traffic related metrics, some of these overlap
    – This metric shows how many Last Level Cache accesses (both read and write) were done.
    – The memory location that CPU tried to access was not in a higher level (L1/L2) cache, thus the lowest cache, Last Level Cache, was checked.
    – Last Level Cache, also called LLC or Lower Level Cache or Longest Latency Cache is usually L3 cache on modern CPUs (although there are some hints that some perf versions still report L2 cache as LLC). I need to read some more perf source code to figure this out, but for this experiment’s purposes it doesn’t matter much if it’s L2 or L3. If I scan through a multi-GB table, it won’t fit into either level cache anyway.
  9. cache-misses – (222M misses)
    – This metric shows how many times the cache reference could not be satisified by the Last Level Cache and therefore RAM access was needed.
  10. LLC-loads – (234M loads)
    – The following 4 metrics just break down the above two in more detail.
    – This metric shows how many times a cache line was requested from LLC as it wasn’t available (or valid) in a higher level cache.
  11. LLC-load-misses – (218M misses)
    – This metric shows how many LLC loads could not be satisfied from the LLC and therefore RAM access was needed.
  12. LLC-stores – (18M stores)
    – This metric shows how many times a cache line was written into a LLC.
  13. LLC-store-misses – (3M misses)
    – This metric shows how many times we had to first read the cache line into LLC before completing the LLC write.
    – This may happen due to partial writes (for example: cache line size is 64 bytes and not currently present in LLC and the CPU tries to write into first 8 bytes of the line).
    – This metric may get incremented due to other cache coherency related reasons where the store fails as other CPU(s) currently own the memory line and have locked and modified it since it was loaded into current CPU cache.
  14. L1-dcache-loads – (7300M loads)
    – The following 3 metrics are similar as above, but for the small (but fast) L1 cache.
    – This metric shows how many times the CPU attempted to load a cache line from L1-data cache into a register.
    – The dcache in the metric name means data accesses from memory (icache means instruction cache – memory lines fetched from L1I cache with instructions for execution).
    – Note how the L1D cache loads metric is way higher than LLC-loads (7300M vs 234M) as many of the repeated tight loops over small internal memory structures can be satisfied from L1 cache.
  15. L1-dcache-load-misses – (305M misses)
    – This metric shows how many data cache loads from L1D cache couldn’t be satisfied from that cache and therefore a next (lower) level cache was needed.
    – If you are wondering how come the L1D cache load misses is much larger than the LLC-loads (305M vs 234M – shouldn’t they be equal), one explanation is that as there’s a L2 cache between L1 & L3, some of the memory accesses got satisfied in L2 cache (and some more explanations illustrating the complexity of CPU cache metrics are here).
  16. L1-dcache-prefetches – (37M prefetches)
    – This metric shows how many cache lines the CPU prefetched as the L1D cache prefetch (DCU prefetcher).
    – Usually this simple prefetcher just fetches the next cache line to the “currently” accessed one.
    – It would be interesting to know if this prefetcher is smart enough to prefetch previous cache lines as regular row-formatted Oracle datablocks are filled from bottom up (this does not apply to the column-oriented stuff).
    – If the full table scan code walks the block’s row directory so that it jumps to the bottom of the block first and works its way upwards, this means that some memory accesses will look like scanning backwards – and may affect prefetching.


I hope that this is a useful reference when measuring what’s going on inside a CPU. This is actually pretty basic stuff in the modern CPU world, there’s much more that you can measure in CPUs (via raw performance counters for example) and also different tools that you can use, like Intel VTune. It’s not trivial though, as at such low level even different CPU models by the same vendor may have different meaning (and numbering & flags) for their performance counters.

I won’t pretend to be a CPU & cache coherency expert, however these basic metrics and my understanding looks to be correct enough for comparing different Oracle access paths and storage formats (more about this in next parts of the series).

One bit of warning: It’s the best to run these experiments on a bare-metal server, not in a virtual machine. This is a low-level measurement exercise and in a VM you could suffer from all kinds of additional noise, plus some of the hardware counters would not be available for perf. Some hypervisors do not allow the guest OS to access hardware performance counters by default. One interesting article (by Frits Hoogland) about running perf in VMs is here.

Ok, enough writing for today! I actually started this post more than a month ago and it got way longer than planned. In the next part of this series I will interpret this post’s full-table scan SQL CPU metrics using the above reference (and explain where the bottleneck/inefficiency is). And in Part 4 I’ll  show you all the metrics from a series of experiments – testing memory access efficiency of different Oracle data access paths (indexes vs full table scan vs HCC vs in-memory column store).

Update: I have corrected a couple of typos, we had 86B cpu cycles and 32B instructions instead of 86M/32M as I had mistakenly typed in before. The instructions-per-cycle ratio calculation and the point remains the same though. Thanks to Mark Farnham for letting me know.

NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!

Related Posts

We are hiring!

Thu, 2015-09-17 15:32

Gluent – where I’m a cofounder & CEO – is hiring awesome developers and (big data) infrastructure specialists in US and UK!

We are still in stealth mode, so won’t be detailing publicly what exactly we are doing ;-)

However, it is evident that the modern data platforms (for example Hadoop) with their scalability, affordability-at-scale and freedom to use many different processing engines on open data formats are turning enterprise IT upside down.

This shift has already been going on for years in large internet & e-commerce companies and small startups, but now the shockwave is arriving to all traditional enterprises too. And every single one of them must accept it, in order to stay afloat and win in the new world.

Do you want to be part of the new world?



NB! Dates updated: After a 1.5 year break, this year’s only Advanced Oracle Troubleshooting training class (updated with Oracle 12c content) takes place on 14-18 December 2015 and 11-15 January 2016 (I had to reschedule the start from November to December). So sign up now if you want to learn new cool stuff!