Tanel Poder

Subscribe to Tanel Poder feed
Linux, Oracle, Hadoop.
Updated: 7 hours 45 min ago

Gluent New World #04: Next Generation Oracle Database Architectures using Super-Fast Storage with James Morle

Mon, 2016-06-13 11:30

It’s time to announce the 4th episode of Gluent New World webinar series by James Morle! James is a database/storage visionary and has been actively contributing to Oracle database scene for over 20 years – including his unique book Scaling Oracle 8i that gave a full-stack overview of how different layers of your database platform worked and performed together.

The topic for this webinar is:

When the Rules Change: Next Generation Oracle Database Architectures using Super-Fast Storage

Speaker:

  • James Morle has been working in the high performance database market for 25 years, most of which has been spent working with the Oracle database. After 15 years running Scale Abilities in the UK, he now leads the Oracle Solutions at DSSD/EMC in Menlo Park.

Time:

  • Tue, Jun 21, 2016 12:00 PM – 1:15 PM CDT

Abstract:

  • When enabled with revolutionary storage performance capabilities, it becomes possible to think differently about physical database architecture. Massive consolidation, simplified data architectures, more data agility and reduced management overhead. This presentation, based on the DSSD D5 platform, includes performance and cost comparison with other platforms and shows how extreme performance is not only for extreme workloads.

Register here:

This should be fun! As usual, I’ll be asking some questions myself and the audience can ask questions too. See you soon!

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

Gluent New World #03: Real Time Stream Processing in Modern Enterprises with Gwen Shapira

Mon, 2016-05-16 12:01

Update: Added links to video recording and slides below.

It’s time to announce the 3rd episode of Gluent New World webinar series! This time Gwen Shapira will talk about Kafka as a key data infrastructure component of a modern enterprise. And I will ask questions from an old database guy’s viewpoint :)

Apache Kafka and Real Time Stream Processing

Video recording & slides:

Speaker:

  • Gwen Shapira (Confluent)
  • Gwen is a system architect at Confluent helping customers achieve
    success with their Apache Kafka implementation. She has 15 years of
    experience working with code and customers to build scalable data
    architectures, integrating relational and big data technologies. She
    currently specializes in building real-time reliable data processing
    pipelines using Apache Kafka. Gwen is an Oracle Ace Director, an
    author of “Hadoop Application Architectures”, and a frequent presenter
    at industry conferences. Gwen is also a committer on the Apache Kafka
    and Apache Sqoop projects. When Gwen isn’t coding or building data
    pipelines, you can find her pedaling on her bike exploring the roads
    and trails of California, and beyond.

Time:

Abstract:

  • Modern businesses have data at their core, and this data is
    changing continuously. How can we harness this torrent of continuously
    changing data in real time? The answer is stream processing, and one
    system that has become a core hub for streaming data is Apache Kafka.This presentation will give a brief introduction to Apache Kafka and
    describe it’s usage as a platform for streaming data. It will explain
    how Kafka serves as a foundation for both streaming data pipelines and
    applications that consume and process real-time data streams. It will
    introduce some of the newer components of Kafka that help make this
    possible, including Kafka Connect, framework for capturing continuous
    data streams, and Kafka Streams, a lightweight stream processing
    library. Finally it will describe some of our favorite use-cases of
    stream processing and how they solved interesting data scalability
    challenges.

Register:

See you soon!

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

Gluent New World #02: SQL-on-Hadoop with Mark Rittman

Thu, 2016-04-07 10:02

Update: The video recording of this session is here:

Slides are here.

Other videos are available at Gluent video collection.

It’s time to announce the 2nd episode of the Gluent New World webinar series!

The Gluent New World webinar series is about modern data management: architectural trends in enterprise IT and technical fundamentals behind them.

GNW02: SQL-on-Hadoop : A bit of History, Current State-of-the-Art, and Looking towards the Future

Speaker:

  • This GNW episode is presented by no other than Mark Rittman, the co-founder & CTO of Rittman Mead and an all-around guru of enterprise BI!

Time:

  • Tue, Apr 19, 2016 12:00 PM – 1:15 PM CDT

Abstract:

Hadoop and NoSQL platforms initially focused on Java developers and slow but massively-scalable MapReduce jobs as an alternative to high-end but limited-scale analytics RDBMS engines. Apache Hive opened-up Hadoop to non-programmers by adding a SQL query engine and relational-style metadata layered over raw HDFS storage, and since then open-source initiatives such as Hive Stinger, Cloudera Impala and Apache Drill along with proprietary solutions from closed-source vendors have extended SQL-on-Hadoop’s capabilities into areas such as low-latency ad-hoc queries, ACID-compliant transactions and schema-less data discovery – at massive scale and with compelling economics.

In this session we’ll focus on technical foundations around SQL-on-Hadoop, first reviewing the basic platform Apache Hive provides and then looking in more detail at how ad-hoc querying, ACID-compliant transactions and data discovery engines work along with more specialised underlying storage that each now work best with – and we’ll take a look to the future to see how SQL querying, data integration and analytics are likely to come together in the next five years to make Hadoop the default platform running mixed old-world/new-world analytics workloads.

Register:

 

If you missed the last GNW01: In-Memory Processing for Databases session, here are the video recordings and slides!

See you soon!

 

 

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

Gluent Demo Video Launch

Wed, 2016-03-30 13:58

Although we are still in stealth mode (kind-of), due to the overwhelming requests for information, we decided to publish a video about what we do :)

It’s a short 5-minute video, just click on the image below or go straight to http://gluent.com:

Gluent Demo video

And this, by the way, is just the beginning.

Gluent is getting close to 20 people now, distributed teams in US and UK – and we are still hiring!

 

 

 

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

GNW01: In-Memory Processing for Databases

Mon, 2016-03-28 00:39

Hi, it took a bit longer than I had planned, but here’s the first Gluent New World webinar recording!

You can also subscribe to our new Vimeo channel here – I will announce the next event with another great speaker soon ;-)

A few comments:

  • Slides are here
  • I’ll figure a good way to deal with offline follow-up Q&A later on, after we’ve done a few of these events

If you like this stuff, please share it too – let’s make this series totally awesome!

 

 

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

Gluent New World: In-Memory Processing for Databases

Mon, 2016-03-14 14:52

As Gluent is all about gluing together the old world and new world in enterprises, it’s time to announce the Gluent New World webinar series!

The Gluent New World sessions cover the important technical details behind new advancements in enterprise technologies that are arriving into mainstream use.

These seminars help you to stay current with the major technology changes that are inevitably arriving into your company soon (if not already). You can make informed decisions about what to learn next – to still be relevant in your profession also 5 years from now.

Think about software-defined storage, open data formats, cloud processing, in-memory computation, direct attached storage, all-flash and distributed stream processing – and this is just a start!

The speakers of this series are technical experts in their field – able to explain in detail how the technology works internally, which fundamental changes in the technology world have enabled these advancements and why it matters to all of you (not just the Googles and Facebooks out there).

I picked myself as the speaker for the first event in this series:

Gluent New World: In-Memory Processing for Databases

In this session, Tanel Poder will explain how the new CPU cache-efficient data processing methods help to radically speed up data processing workloads – on data stored in RAM and also read from disk! This is a technical session about internal CPU efficiency and cache-friendly data structures, using Oracle Database and Apache Arrow as examples.

Time:

  • Tue, Mar 22, 2016 1:00 PM – 2:00 PM CDT

Register:

After registering, you will receive a confirmation email containing information about joining the webinar.

See you soon!

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

My BIWA Summit Presentations

Tue, 2016-01-26 17:01

Here are the two BIWA Summit 2016 presentations I delivered today. The first one is a collection of high level thoughts (and opinions) of mine and the 2nd one is more technical:

 

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

Public Appearances H1 2016

Sat, 2016-01-09 21:53

Here’s where I’ll hang out in the following months:

26-28 January 2016: BIWA Summit 2016 in Redwood Shores, CA

10-11 February 2016: RMOUG Training Days in Denver, CO

25 February 2016: Yorkshire Database (YoDB) in Leeds, UK

6-10 March 2016: Hotsos Symposium, Dallas, TX

10-14 April 2016: IOUG Collaborate, Las Vegas, NV

  • Beer session: Not speaking myself but planning to hang out on a first couple of conference days, drink beer and attend Gluent colleague Maxym Kharchenko‘s presentations

24-26 April 2016: Enkitec E4, Barcelona, Spain

18-19 May 2016: Great Lakes Oracle Conference (GLOC) in Cleveland, OH

  • I plan to submit abstracts (and hope to get some accepted :)
  • The abstract submission is still open until 1st February 2016

2-3 June 2016: AMIS 25 – Beyond the Horizon near Leiden, Netherlands

  • This AMIS 25th anniversary event will take place in a pretty cool location – an old military airport hangar (and abstract submission is still open :)
  • Update: I unfortunately had to cancel my speaking plans at the AMIS event 

5-7 June 2016: Enkitec E4, Dallas, TX

 

As you can see, I have changed my “I don’t want to travel anymore” policy ;-)

 

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

Gluent launch! New production release, new HQ, new website!

Fri, 2015-12-04 12:23

I’m happy to announce that the last couple of years of hard work is paying off and the Gluent Offload Engine is production now! After beta testing with our early customers, we are now out of complete stealth mode and are ready talk more about what exactly are we doing :-)

Check out our new website and product & use case info here!

Follow us on Twitter:

We are hiring! Need to fill that new Dallas World HQ ;-) Our distributed teams around the US and in London need more helping hands (and brains!) too.

You’ll be hearing more of us soon :-)

Paul & Tanel just moved in to Gluent World HQPaul & Tanel just moved in to Gluent World HQ

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

RAM is the new disk – and how to measure its performance – Part 3 – CPU Instructions & Cycles

Mon, 2015-11-30 00:45

If you haven’t read the previous parts of this series yet, here are the links: [ Part 1 | Part 2 ].

A Refresher

In the first part of this series I said that RAM access is the slow component of a modern in-memory database engine and for performance you’d want to reduce RAM access as much as possible. Reduced memory traffic thanks to the new columnar data formats is the most important enabler for the awesome In-Memory processing performance and SIMD is just icing on the cake.

In the second part I also showed how to measure the CPU efficiency of your (Oracle) process using a Linux perf stat command. How well your applications actually utilize your CPU execution units depends on many factors. The biggest factor is your process’es cache efficiency that depends on the CPU cache size and your application’s memory access patterns. Regardless of what the OS CPU accounting tools like top or vmstat may show you, your “100% busy” CPUs may actually spend a significant amount of their cycles internally idle, with a stalled pipeline, waiting for some event (like a memory line arrival from RAM) to happen.

Luckily there are plenty of tools for measuring what’s actually going on inside the CPUs, thanks to modern processors having CPU Performance Counters (CPC) built in to them.

A key derived metric for understanding CPU-efficiency is the IPC (instructions per cycle). Years ago people were actually talking about the inverse metric CPI (cycles per instruction) as on average it took more than one CPU cycle to complete an instruction’s execution (again, due to the abovementioned reasons like memory stalls). However, thanks to today’s superscalar processors with out-of-order execution on a modern CPU’s multiple execution units – and with large CPU caches – a well-optimized application can execute multiple instructions per a single CPU cycle, thus it’s more natural to use the IPC (instructions-per-cycle) metric. With IPC, higher is better.

Here’s a trimmed snippet from the previous article, a process that was doing a fully cached full table scan of an Oracle table (stored in plain old row-oriented format):

Performance counter stats for process id '34783':

      27373.819908 task-clock                #    0.912 CPUs utilized
    86,428,653,040 cycles                    #    3.157 GHz                     [33.33%]
    32,115,412,877 instructions              #    0.37  insns per cycle
                                             #    2.39  stalled cycles per insn [40.00%]
    76,697,049,420 stalled-cycles-frontend   #   88.74% frontend cycles idle    [40.00%]
    58,627,393,395 stalled-cycles-backend    #   67.83% backend  cycles idle    [40.00%]
       256,440,384 cache-references          #    9.368 M/sec                   [26.67%]
       222,036,981 cache-misses              #   86.584 % of all cache refs     [26.66%]

      30.000601214 seconds time elapsed

The IPC of the above task is pretty bad – the CPU managed to complete only 0.37 instructions per CPU cycle. On average every instruction execution was stalled in the execution pipeline for 2.39 CPU cycles.

Note: Various additional metrics can be used for drilling down into why the CPUs spent so much time stalling (like cache misses & RAM access). I covered the typical perf stat metrics in the part 2 of this series so won’t go in more detail here.

Test Scenarios

The goal of my experiments was to measure the number CPU-efficiency of different data scanning approaches in Oracle – on different data storage formats. I focused only on data scanning and filtering, not joins or aggregations. I ensured that everything would be cached in Oracle’s buffer cache or in-memory column store for all test runs – so disk IO was not a factor here (again, read more about my test environment setup in part 2 of this series).

The queries I ran were mostly variations of this:

SELECT COUNT(cust_valid) FROM customers_nopart c WHERE cust_id > 0

Although I was after testing the full table scanning speeds, I also added two examples of scanning through the entire table’s rows via index range scans. This allows me to show how inefficient index range scans can be when accessing a large part of a table’s rows even when all is cached in memory. Even though you see different WHERE clauses in some of the tests, they all are designed so that they go through all rows of the table (just using different access patterns and code paths).

The descriptions of test runs should be self-explanatory:

1. INDEX RANGE SCAN BAD CLUSTERING FACTOR

SELECT /*+ MONITOR INDEX(c(cust_postal_code)) */ COUNT(cust_valid)
FROM customers_nopart c WHERE cust_postal_code > '0';

2. INDEX RANGE SCAN GOOD CLUSTERING FACTOR

SELECT /*+ MONITOR INDEX(c(cust_id)) */ COUNT(cust_valid)
FROM customers_nopart c WHERE cust_id > 0;

3. FULL TABLE SCAN BUFFER CACHE (NO INMEMORY)

SELECT /*+ MONITOR FULL(c) NO_INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c WHERE cust_id > 0;

4. FULL TABLE SCAN IN MEMORY WITH WHERE cust_id > 0

SELECT /*+ MONITOR FULL(c) INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c WHERE cust_id > 0;

5. FULL TABLE SCAN IN MEMORY WITHOUT WHERE CLAUSE

SELECT /*+ MONITOR FULL(c) INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c;

6. FULL TABLE SCAN VIA BUFFER CACHE OF HCC QUERY LOW COLUMNAR-COMPRESSED TABLE

SELECT /*+ MONITOR */ COUNT(cust_valid) 
FROM customers_nopart_hcc_ql WHERE cust_id > 0

Note how all experiments except the last one are scanning the same physical table just with different options (like index scan or in-memory access path) enabled. The last experiment is against a copy of the same table (same columns, same rows), but just physically formatted in the HCC format (and fully cached in buffer cache).

Test Results: Raw Numbers

It is not enough to just look into the CPU performance counters of different experiments, they are too low level. For the full picture, we also want to know how much work (like logical IOs etc) the application was doing and how many rows were eventually processed in each case. Also I verified that I did get the exact desired execution plans, access paths and that no physical IOs or other wait events happened using the usual Oracle metrics (see the log below).

Here’s the experiment log file with full performance numbers from SQL Monitoring reports, Snapper and perf stat:

I also put all these numbers (plus some derived values) into a spreadsheet. I’ve pasted a screenshot of the data below for convenience, but you can access the entire spreadsheet with its raw data and charts here (note that the spreadsheet has multiple tabs and configurable pivot charts in it):

Raw perf stat data from the experiments:

oracle scan test results.png

Now let’s plot some charts!

Test Results: CPU Instructions

Let’s start from something simple and gradually work our way deeper. I will start from listing the task-clock-ms metric that shows the CPU time usage of the Oracle process in milliseconds for each of my test table scans. This metric comes from the OS-level and not from within the CPU:

task-clock-ms.pngCPU time used for scanning the dataset (in milliseconds)

As I mentioned earlier, I added two index (full) range scan based approaches for comparison. Looks like the index-based “full table scans” seen in first and second columns are using the most CPU-time as the OS sees it (~120 and close to 40 seconds of CPU respectively).

Now let’s see how many CPU instructions (how much work “requested” from CPU) the Oracle process executed for scanning the same dataset using different access paths and storage formats:

oracle table scan instructions clean.pngCPU instructions executed for scanning the dataset

Wow, the index-based approaches seem to be issuing multiple times more CPU instructions per query execution than any of the full table scans. Whatever loops the Oracle process is executing for processing the index-based query, it runs more of them. Or whatever functions it calls within those loops, the functions are “fatter”. Or both.

Let’s look into an Oracle-level metric session logical reads to see how many buffer gets it is doing:

oracle buffer gets clean.pngBuffer gets done for a table scan

 

Wow, using the index with bad clustering factor (1st bar) causes Oracle to do over 60M logical IOs, while the table scans do around 1.6M of logical IOs each. Retrieving all rows of a table via an index range scan is super-inefficient, given that the underlying table size is only 1613824 blocks.

This inefficiency is due to index range scans having to re-visit the same datablocks multiple times (up to one visit per row, depending on the clustering factor of the index used). This would cause another logical IO and use more CPU cycles for each buffer re-visit, except in cases where Oracle has managed to keep a buffer pinned since last visit. The index range scan with a good clustering factor needs to do much fewer logical IOs as given the more “local” clustered table access pattern, the re-visited buffers are much more likely found already looked-up and pinned (shown as the buffer is pinned count metric in V$SESSTAT).

Knowing that my test table has 69,642,625 rows in it, I can also derive an average CPU instructions per row processed metric from the total instruction amounts:

instructions per row.png

The same numbers in tabular form:

Screen Shot 2015-11-30 at 00.38.12

Indeed there seem to be radical code path differences (that come from underlying data and cache structure differences) that make an index-based lookup use thousands of instructions per row processed, while an in-memory scan with a single predicate used only 102 instructions per row processed on average. The in-memory counting without any predicates didn’t need to execute any data comparison logic in it, so could do its data access and counting with only 43 instructions per row on average.

So far I’ve shown you some basic stuff. As this article is about studying the full table scan efficiency, I will omit the index-access metrics from further charts. The raw metrics are all available in the raw text file and spreadsheet mentioned above.

Here are again the buffer gets of only the four different full table scan test cases:

oracle buffer gets table scan only.pngBuffer gets done for full table scans

All test cases except the HCC-compressed table scan cause the same amount of buffer gets (~1.6M) as this is the original table’s size in blocks. The HCC table is only slightly smaller – didn’t get great compression with the query low setting.

Now let’s check the number CPU instructions executed by these test runs:

oracle table scan only instructions.pngCPU instructions executed for full table scans

Wow, despite the table sizes and number of logical IOs being relatively similar, the amount of machine code the Oracle process executes is wildly different! Remember, all that my query is doing is just scanning and filtering the data followed with a basic COUNT(column) operation – no additional sorting, joining is done. The in-memory access paths (column 3 & 4) get away with executing much fewer CPU instructions than the regular buffered tables in row-format and HCC format (columns 1 & 2 in the chart).

All the above shows that not all logical IOs are equal, depending on your workload and execution plans (how many block visits, how many rows extracted per block visit) and underlying storage formats (regular row-format, HCC in buffer cache or compressed columns in In-Memory column store), you may end up doing a different amount of CPU work per row retrieved for your query.

This was true before the In-Memory option and even more noticeable with the In-Memory option. But more about this in a future article.

Test Results: CPU Cycles

Let’s go deeper. We already looked into how many buffer gets and CPU instructions the process executed for the different test cases. Now let’s look into how much actual CPU time (in form of CPU cycles) these tests consumed. I added the CPU cycles metric to instructions for that:

instructions and cycles.pngCPU instructions and cycles used for full table scans

Hey, what? How come the regular row-oriented block format table scan (TABLE BUFCACHE) takes over twice more CPU cycles compared to its instructions executed?

Also, how come all the other table access methods use noticeably less CPU cycles than the number of instructions they’ve executed?

If you paid attention to this article (and previous ones) you’ll already know why. In the 1st example (TABLE BUFCACHE) the CPU must have been “waiting” for something a lot, instructions having spent multiple cycles “idle”, stalled in the pipeline, waiting for some event or necessary condition to happen (like a memory line arriving from RAM).

For example, if you are constantly waiting for the “random” RAM lines you want to access due to inefficient memory structures for scanning (like Oracle’s row-oriented datablocks), the CPU will be bottlenecked by RAM access. The CPU’s internal execution units, other than the load-store units, would be idle most of the time. The OS top command would still show you 100% utilization of a CPU by your process, but in reality you could squeeze much more out of your CPU if it didn’t have to wait for RAM so much.

In the other 3 examples above (columns 2-4), apparently there is no serious RAM (or other pipeline-stalling) bottleneck as in all cases we are able to use the multiple execution units of modern superscalar CPUs to complete more than one instruction per CPU cycle. Of course more improvements might be possible, but more about this in a following post.

For now I’ll conclude this (lengthy) post with one more chart with the fundamental derived metric instructions per cycle (IPC):

instructions per cycle.png

The IPC metric is derived from the previously shown instructions and CPU cycles metrics by a simple division. Higher IPC is better as it means that your CPU execution units are more utilized, it gets more done. However, as IPC is a ratio, you should never look into the IPC value alone, always look into it together with instructions and cycles metrics. It’s better to execute 1 Million instructions with IPC of 0.5 than 1 Billion instructions with an IPC of 3 – but looking into IPC in isolation doesn’t tell you how much work was actually done. Additionally, you’d want to use your application level metrics that give you an indication of how much application work got done (I used Oracle’s buffer gets and rows processed metrics for this).

Looks like there’s at least 2 more parts left in this series (advanced metrics and a summary), but let’s see how it goes. Sorry for any typos, it’s getting quite late and I’ll fix ’em some other day :)

 

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)