Skip navigation.

BI & Warehousing

Data Integration Tips: GoldenGate on RAC – Action Script for Windows

Rittman Mead Consulting - Fri, 2014-06-06 14:29

I want to briefly interrupt my blog series on GoldenGate and ODI – A Perfect Match in 12c… to provide another Data Integration Tip that was found during a recent client engagement. I was tasked with installing GoldenGate 11g into a 2-node RAC, which is a pretty straight-forward process and well documented by the Oracle Data Integration product team. The challenge is that the client’s Oracle RAC environment was installed on Windows Server 2003.

The Scenario

Working through the GoldenGate setup and configuration process for RAC / Clusterware, most of the same approach applies to either Unix/Linux or Windows (maybe a future blog post?). There is also an example action script towards the bottom, but it’s written in shell script..and that just won’t work in Windows! For those unfamiliar with Clusterware, when an application is added as a resource, Clusterware will use the action script to perform specific functions, or actions, against the application. These actions, Check, Start, Stop, etc., will often times run commands through GoldenGate’s GGSCI application, ensuring the Manager process is running, starting the Manager, etc.

We decided to convert the shell script example to batch command script, as this is about as native as it gets in Windows. Everything was going well with the conversion until we reached this function and the highlighted code below.

#call_ggsci is a generic routine that executes a ggsci command
call_ggsci () {
  log "entering call_ggsci"
  cd ${GGS_HOME}

  ggsci_output=`${GGS_HOME}/ggsci << EOF

  log "got output of : $ggsci_output"

The script will simply change directory to the GoldenGate home, run the GGSCI application, execute the command (passed in as an argument), and then exit GGSCI. This is all wrapped within two EOF “tags”, indicating that the formatting in the script, such as hard returns, etc, should remain when the external command is run. It’s known as a heredoc, and we found that it’s not really possible to do in a batch script.

When attempting similar code in the batch script, without the heredoc, we ran into an issue. GGSCI attempted to run all of the commands on the same line, which failed miserably. We needed another approach to ensure the script would execute properly.

Here’s the Tip…

The goal was to find a way to run a GGSCI command on a single line. The solution: use GGSCI to execute GoldenGate obey files. Rather than attempt to place hard returns in the batch command script, we simply placed all of the necessary commands in an obey file and passed the obey command and script file location it into the function.

call :call_ggsci %GGS_HOME%\diroby\StartMgr.oby


 REM set ggsci_command variable with the passed-in value
 SET ggsci_command=%~1
 call :log "ggsci_command: %ggsci_command%"

 REM log whether or not the GGS_HOME path exists
 if exist %GGS_HOME% (
  call :log "%GGS_HOME% exists"
 ) else (
  call :log "%GGS_HOME% does not exist"

The obey file could be as complex as you like, but in this case it simply starts the manager process.


This approach works very well, but we did end up with the additional obey files to maintain. Next time around, I’ll probably use a scripting language such as Groovy or Python, as either should work just fine on Linux or Windows.

Look for more Data Integration Tips on the blog, as Rittman Mead always coming up with innovative solutions to interesting challenges!

Categories: BI & Warehousing

Rittman Mead Announce New Partnerships with Cloudera and Amazon Web Services

Rittman Mead Consulting - Thu, 2014-06-05 06:00

To complement our strategic partnership with Oracle Corporation, Rittman Mead are very pleased to announce two new technology partnerships to complement our Oracle-based services; with Cloudera, the industry-leading Hadoop software vendor, and with Amazon Web Services, the infrastructure-as-a-service provider that’s become the standard for deploying highly-available, secure cloud applications. As business intelligence, analytics and data warehousing systems extend onto the cloud and to encompass big data, these two new partnerships will allow us to offer best-of-breed services in these area whilst complementing our main technology partnership with Oracle. So why Cloudera, why Amazon Web Services, and why now?

Extending BI from the EDW to the “Enterprise Data Hub”

If you’re a regular reader of this blog you’ll probably have read several articles by myself and other authors about cloud, Hadoop and big data. Whilst the vast majority of BI & DW systems we put together for clients today are based around regular database and application data sources, increasingly customers are asking us to help them bring non-traditional, NoSQL and Hadoop-based data sources into their BI platform, adding what’s becoming termed “enterprise data hubs” alongside the more traditional enterprise data warehouses. We’re also seeing more interest in deploying these systems into the cloud – either driven by the economics of the cloud vs. on-premise systems, or because other key line-of-business systems are moving into the cloud and it makes sense to deploy your BI there too. And Oracle, of course, have noticed this trend, with support being added to OBIEE and ODI for reporting against, and loading data out of/into Hadoop data sources, and Oracle themselves due to launch Oracle Business Intelligence Cloud Services in the next couple of months.


Oracle of course already have some Hadoop-based products out in the market, including the very-impressive Oracle Big Data Appliance and their Oracle Big Data Connectors, and we think these products have the potential to make a massive impact on the types of project and solutions we can offer customers today. But at Rittman Mead we like to get ahead of the technology curve, and so we decided to in addition partner directly with Cloudera, makers of the Hadoop software within Big Data Appliance but a significant software vendor in themselves, to ensure our team is at the forefront of delivering cutting-edge big data solutions based around Cloudera CDH, Apache Spark and Cloudera Impala.

Over the past few months we’ve started to deliver projects and PoCs for customers around Cloudera Hadoop technology and Oracle Engineered Systems, Database and BI tools, and what’s emerged as our core competence is the ability to bring together Oracle’s chosen Hadoop distribution and tools with the core Oracle technologies our customers use. We’ve developed systems using Oracle’s Big Data Connectors, to for example analyse huge datasets across multiple Hadoop nodes using Oracle R Advanced Analytics for Hadoop, and we’re currently working with a couple of customers’ Oracle DBA teams who are in the process of adopting Oracle Big Data Appliance. Most importantly, as official Cloudera Partners we’ve got access to their technical and solutions architecture resources, giving us a similar same level of technical backup as we have access to on our core Oracle projects.

BI Moves into the Cloud, for Improved Business Agility and TCO

The other big trend that’s driving a lot of innovation in the BI industry is “cloud”. If you were at one of our BI Forum events in Brighton and Atlanta this year, you’ll know that cloud is front-and-centre in Oracle’s product strategy at the moment, both in terms of making BI available as part of the wider Oracle Public Cloud, but also as a way of accelerating innovation and making more of the product self-service. What’s now been officially named “Oracle BI Cloud Service” (BICS) was officially launched at last year’s Openworld and we’re about to help beta-test the product for Oracle prior to its launch later in the year, and we’re expecting BICS to be particularly attractive to existing OBIEE customers looking to quickly spin-up departmental BI environments without the need for IT to get involved.

But as I covered in a couple of blog posts earlier in the year, BICS functionality is likely to be limited in its initial incarnation and many customers are going to be looking to run “full” OBIEE in the cloud, along with a full Oracle database and an ETL infrastructure using tools such as ODI or Informatica, and for these types of customer a more complete cloud solution will be needed – which is why we’ve also partnered with Amazon Web Services, in our view by far the best cloud service provider on the market and the platform behind companies such as Netflix and Dropbox.

Logo amazon aws

We’ve been long-term users of Amazon AWS since around three or four years ago, initially running our training servers on the platform but more recently, undertaking internal and customer development work on their “virtual private cloud” platform as part of our new “Extreme BI in the Cloud” initiative. As techies, we appreciate the power and flexibility of the Amazon AWS API which we use to script and automate much of our “DevOps” work, and we’re finding more and more of our customers are moving to AWS independently, due to their excellent uptime and levels of customer service, and the wide ecosystem of AWS-native and partner products such as Amazon Redshift, Amazon Elastic MapReduce, Amazon EC2 and S3, and Attunity Cloudbeam. Again, we’ve partnered officially with Amazon AWS so that we can train our team-up and access technical and solutions architecture resources, and as with the Cloudera partnership, our particular specialisation is in deploying Oracle-based technologies onto the Amazon AWS platform.

So does this mean that Rittman Mead are de-emphasising our Oracle Partnership or otherwise taking our attention away from solutions built-on OBIEE, ODI, Essbase and the Oracle Database? Far from it; the core of the company will always be around Oracle technology, but by partnering with two best-of-breed complementary organisations – Cloudera, for Oracle-centric Hadoop and big data solutions, and Amazon AWS, for cloud and hybrid deployments – we can continue to offer customers the most innovative and industry-leading solutions as BI evolves from its database and on-premise roots to cover big data and cloud deployments.

Look out for our our blog over the coming months as we cover integrating Oracle technology with Cloudera’s Hadoop platform and Oracle Big Data Appliance, and look out for products and services where we combine Oracle and Cloudera technology in innovative ways, creating cutting edge solutions for our customers both on-premise, in the Amazon and Oracle clouds, or as hybrid on-premise/cloud deployments.

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 2: Journalizing Knowledge Module

Rittman Mead Consulting - Wed, 2014-06-04 16:35

This is the second post in a blog series on the integration between Oracle Data Integrator (ODI) 12c and GoldenGate 12c. The first post focused on the latest, yet-to-be-released, Oracle Information Management Reference Architecture and some high-level features new to the 12c versions of each product. We also began working through the details of the GoldenGate installation and configuration, specifically the JAgent setup, which is necessary for communication between ODI and GoldenGate during the setup of “online” journalizing. In this post, we’ll look at the new features of the Journalizing Knowledge Module “JKM Oracle to Oracle Consistent (OGG Online)” and get started on the modifications to the JKM that must occur for us to load the Raw Data Reservoir (RDR – acronym coined by Nick Hurt in the comments of my first post…I like it!) and Foundation Layer simultaneously.

ODI Journalizing

Before I get into editing the JKM, let me briefly go through an introduction of ODI Journalizing, aka Change Data Capture (CDC). In case you missed it earlier, ODI CDC is implemented using a Journalized Knowledge Module. The JKM generates the infrastructure for the CDC, creating journal tables that store the change rows and views that provide access to the change rows for use in ODI Mappings. ODI CDC can be implemented using various capture processes, such as triggers on source tables, timestamps on rows, or mining of the database logs via Oracle Streams or, in our case, Oracle GoldenGate. A great explanation of ODI JKMs and how they work, written by Christophe Dupupet of the Oracle Data Integration A-Team, can be found here.

Import ODI JKMs

The integration between GoldenGate and ODI begins with a Journalized Knowledge Module, in our example we’ll be using the “JKM Oracle to Oracle Consistent (OGG Online)”. The JKM, delivered with ODI, will create the GoldenGate parameter files, configure GoldenGate process groups (and start them up), and generate the ODI CDC tables and views. The JKM, added to the source Model in 12c, uses the ODI metadata to generate the GoldenGate parameter file mappings. This alone saves quite a bit of manual work and reduces possible typos caused by human error.

JKM Oracle to Oracle Consistent (OGG Online)

In the previous post, I mentioned the new capability of the JKM that allows for an “online” integration between ODI and GoldenGate. But, there are many other new features that need to be described, so we’ll walk through those here.

ODI Tool: OdiOggCommand

The JKM uses an undocumented ODI Tool called OdiOggCommand in the target command of some tasks that are executed only when in “online” mode. This tool has different values for a parameter called OPERATION.

EXECUTECMD: executes various commands, such as Add Extract, within GGSCI
EXECUTEOBEY: runs the OBEY command against an obey file in GGSCI
DEFGEN: generates the source definitions file by executing the DEFGEN command
SAVEFILE: uploads the parameter and obey files to the GoldenGate installation directory


I imagine the code behind this ODI Tool is simply executing command line calls to GGSCI, DEFGEN, etc. It would be great to see some Oracle documentation on this one!

GoldenGate Topology

The “online” aspect of the JKM requires that a Data Server, Physical Schema and Logical Schema all be setup under the GoldenGate technology in ODI. The Data Server contains the location and connection information for communicating with the JAgent on either the source or target GoldenGate server. Under that, a Physical Schema must be setup for each extract (the pump is included) and replicat process group that will be implemented on that server.

GoldenGate Physical Schema

In this screenshot, we have a Physical Schema with the process type set to Capture, also called the extract. Here we set up the directory path on the source to the location where captured transactions from the source database logs will be stored in GoldenGate’s own log files, called trail files. The remote trail file directory, setup on the target server and accessed by the pump process in order to move transactions from the source trail to the target trail, must also be added. Additional options such as trail file size (how large a trail file can get before rolling over to the next file) are also available to be set. Just below the Capture Process Properties are the Additional Options, parameters that can be added to the extract or replicat parameter files based on specific needs of the solution.

GoldenGate Physical Schema Additional Options

Here we can add options to handle different aspects of the extract or replicat, such as TRANLOGOPTIONS on the extract. Once I add this option, I get a template of the additional parameters I can configure. For example, I may want to access my transaction logs stored in ASM via the database rather than directly from ASM. The DBLOGREADER option lets me utilize the ASM API in the database, simplifying my access to the logs.


When I add the additional option to the capture physical schema, it will be generated as a part of the extract parameter file. This helps to drastically reduce the amount of manual editing that had to occur after the parameter files were generated in the 11g version, and is a great addition to the JKM.

JKM Setup and Options

The JKM is actually applied to the source ODI Model, which is a grouping of logical table structures called Datastores. On the Journalizing tab in the Model, we first set the type (Consistent Set) and choose the appropriate JKM.

GoldenGate Process Selection

In the GoldenGate Process Selection section, we choose the Capture and Delivery Logical Schemas that were setup in the ODI Topology. If none have been created, but the Data Server for the source and target GoldenGate installations exist, we can choose to create the Physical and Logical Schemas directly from the Model by clicking the Create button. This is a nifty way to separate the system administration role (setting up the Data Server) from the metadata management or developer role.

The JKM also has a set of Options with configurable values, some being required and others optional. The number of Options has been reduced by quite a bit in the 12c JKM. This makes sense, as we’ve seen that additional metadata such as the trail file location and trail file size are set elsewhere. We’ll go through these in more detail later on when setting the values in our example.

JKM Options

That covers the updated features in the JKM, now let’s talk Knowledge Module customization.

JKM Customization

In the first post, I described how we want to extract the source data once and replicate it into both the Foundation Layer and Raw Data Reservoir in parallel. The JKM is set up to load the ODI CDC Framework (the fully replicated table and J$ table) out of the box, but not the Foundation table. In case you’re unfamiliar with the purpose of the Foundation layer, the idea is to store all transactional history from the source tables by converting every transaction into an insert, and tracking the type of change (insert / update / delete), commit date, and commit SCN. With this information stored from the beginning of the data warehouse, it can be used for historical drill-throughs from the dimensional model, or to completely reload a star schema – including all history. With the stage set, let’s look at what we’re going to change in the JKM.

1. Add New Option “APPLY_FOUNDATION”
This option, when true, will allow the Start Journal process to generate the source-to-foundation mapping statement in the Replicat (apply) process.

2. Add New Option “FND_LSCHEMA”
The Logical Schema name for the Foundation layer schema.

3. Add New Task “Create apply prm (4) RM”
This task will create the source-to-foundation mapping code and add it to the replicat parameter file.

4. Set Option on “Create apply prm (4) RM” Task
Set the execution Options to have APPLY_FOUNDATION as the only checked option. This will determine whether or not that task will execute when Start Journal is run.

5. Edit Task “Execute apply commands online RM”
We’ll comment out the “start replicat …” command, as we’ll first need to complete an initial load of the source data to the target.

Now, I did say I would get into the editing of the JKM in this post, but after describing the new features of the Knowledge Module, I’ll save the actual details for the part 3 of the series! Up next, editing the JKM and building the necessary ODI objects so we can start journalizing.

Categories: BI & Warehousing

Fun with SQL - Silver Pockets Full

Chet Justice - Wed, 2014-06-04 16:13
Silver Pockets Full, send this message to your friends and in four days the money will surprise you. If you don't, well, a pox on your house. Or something like that. I didn't know what it was, I just saw this in my FB feed:

Back in November, I checked to see the frequency of having incremental numbers in the date, like 11/12/13 (my birthday) and 12/13/14 (kate's birthday). I don't want to hear how the rest of the world does their dates either, I know (I now write my dates like YYYY/MM/DD on everything, just so you know, that way I can sort it...or something).

Anyway, SQL to test out the claim of once every 823 years. Yay SQL.

OK, I'm not going to go into the steps necessary because I'm lazy (and I'm just lucky to be writing here), so here it is:
select *
to_char( d, 'yyyymm' ) year_month,
count( case
when to_char( d, 'fmDay' ) = 'Saturday' then 1
else null
end ) sats,
count( case
when to_char( d, 'fmDay' ) = 'Sunday' then 1
else null
end ) suns,
count( case
when to_char( d, 'fmDay' ) = 'Friday' then 1
else null
end ) fris
select to_date( 20131231, 'yyyymmdd' ) + rownum d
from dual
connect by level <= 50000
group by
to_char( d, 'yyyymm' )
where fris = 5
and sats = 5
and suns = 5
So over the next 50,000 days, this happens 138 times. I'm fairly certain that doesn't rise to the once every 823 years claim. But it's cool, maybe.
YEAR_MONTH       SATS       SUNS       FRIS
---------- ---------- ---------- ----------
201408 5 5 5
201505 5 5 5
201601 5 5 5
201607 5 5 5
201712 5 5 5
128 more occurrences...
214607 5 5 5
214712 5 5 5
214803 5 5 5
214908 5 5 5
215005 5 5 5

138 rows selected
I'm not the only dork that does this either, here's one in perl. I'm sure there are others, but again, I'm lazy.
Categories: BI & Warehousing

How We Deliver Agile OBIEE Projects – Introducing ExtremeBI

Rittman Mead Consulting - Wed, 2014-06-04 04:56

Most OBIEE projects that we see are delivered through some sort of “waterfall” method, where requirements are defined up-front, there’s several stages of development, one or more major releases at the end, and any revision to requirements takes the form of a change request. These work well where requirements can be defined upfront, and can be reassuring to customers when they want to agree a fixed-price up-front with every subsequent change clearly costed. But, as with the development world in general, some customers are starting to look at “agile” methods for delivering BI projects, where requirements emerge over the course of a project, there isn’t so much of a fixed design or specification at the start, but instead the project adds features or capabilities in response to what are called “user stories”, making it more likely in-the-end that what ends-up getting delivered is more in-line with what users want – and where changes and additions to requirements are welcomed, rather than extra-cost change requests.

OBIEE naturally lends itself to working in an agile manner, through the three-layer nature of the repository (RPD); by separating the physical representation of the source data from how it is then presented to the end-users, you can start from the off with the dimensional model that’s your end goal, and then over time evolve the back-end physical layer from pointing directly at the source system to instead point at a data warehouse or OLAP cube. In fact, I covered this approach back in 2008 in a blog post called “A Future Oracle OBIEE Architecture” where I positioned OBIEE’s BI Server as a “business logic layer”, and speculated that at some point in the future, OBIEE might be able to turn the logical > physical mappings in the RPD into actual ODI mappings and transformation.


In the end, although OBIEE’s aggregate persistence feature gave us the ability to spin-off aggregate tables and cubes from the RPD logical model, full ETL “push-down” never came although you can see traces of it if you have a good poke around the DLLs and directories under the BI Server component. What did happen though was Exadata; with Exadata, features such as SmartScan, and its ability to do joins between normalised tables much faster than regular databases meant that it became possible to report directly against an OLTP schema, or a ODS-like foundation layer, only adding ETL to build a performance star schema layer if it was absolutely necessary. We covered this in a series of posts on Agile Data Warehousing with Exadata, and the focus of this method was performance – by adding Exadata, and the metadata flexibility in OBIEE’s RPD, we could deliver agile projects where Exadata gave us the performance even when we reported directly against a third-normal form data source.




And this approach worked well for our customers; if they’d invested in Exadata, and were open to the idea of agile, iterative development, we could typically deliver a working system in just a few months, and at all times what the users got was what they’d requested in their user story backlog. But there were still ways in which we could improve this method; not everyone has access to Exadata, for example, and reporting directly against a source system makes it tricky to add DW features like history, and surrogate keys, so recently we introduced the successor to this approach, in the form of an OBIEE development method we called “ExtremeBI”. Building our previous agile work, ExtremeBI introduced an integration element, using GoldenGate and ODI to replicate in real time any source systems we were interested in to the DW foundation layer, add the table metadata that DW systems expect, and then provide a means to transform the logical to physical RPD mappings into ODI ETL specifications.


But in a way, all the technical stuff is by-the-by; what this means in practice for customers is that we deliver working systems from the first iteration; initially, by reporting directly against a replicated copy of their source system (with replication and metadata enhancement by GoldenGate, and optionally ODI),and then over subsequent iterations adding more end-user functionality, OR hardened ODI ETL code, all the while driven by end-user stories and not some technical design signed-off months ago and which no longer reflects what users actually want.


What we’ve found though from several ExtremeBI customer engagements, is that it’s not just down to the technology and how well ODI, OBIEE and GoldenGate work; the major factors in successful projects are firstly, having the project properly pre-qualified at the start; not every project, and not every client, suits agile working, and agile works best if you’re “all in” as opposed to just agreeing to work in sprints but still having a set-in-stone set of requirements which have to be met at a certain time. The second important success factor is proper project organisation; we’ve grown from just a couple of guys with laptops back in 2007 to a fully-fledged, end-to-end development organisation, with full-time delivery managers,a managed services desk and tools such as JIRA, and you need to have this sort of thing in place, particularly a project management structure that’s agile-friendly and a good relationship with the customer where they’re fully-signed up to the agile approach. As such, we’ve found the most success where we’ve used ExtremeBI for fairly technically-savvy customers, for example a MIS department, who’ve been tasked with delivering something for reasonable price and over a short amount of months, who understand that not all requirements can be delivered, but really want their system to get adopted, delight their customer and focus its features on what’s important to end-users.

As well as processes and a method, we’ve also developed utilities and accelerators to help speed-up the initial setup, and ensure the initial foundation and staging layers are built consistently, with GoldenGate mappings already put in place, and ready for our developers to start delivering reports against the foundation layer, or use these foundation-layer tables as the basis of a data mart or warehouse build-out. The screenshot below shows this particular tool, built using Groovy and run from within the ODI Studio user interface, where the developer selects a set of source tables from an ODI model, and then the utility builds out the staging and foundation layers automatically, typically saving days over the manual method.


We’ve also built custom KMs for ExtremeBI, including one that uses Oracle Database’s flashback query feature to pull historical transactions from the UNDO log, as an alternative to Oracle Streams or Oracle GoldenGate when these aren’t available on the project.

All together, using Rittman Mead’s ExtremeBI method along with OBIEE, ODI and optionally GoldenGate has meant we’ve been able to deliver working OBIEE systems for customers over just a few months, typically for a budget less than £50k. Coupled with cloud hosting, where we can get the customer up-and-running immediately rather than having to wait for their IT department to provision servers, we think this the best way for most OBIEE11g projects to be delivered in the future. If you’re interested, we’ve got more details on our “ExtremeBI in the Cloud” web page, or you can contact me via email – – if you’d like to discuss it more,

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 1: Getting Started

Rittman Mead Consulting - Fri, 2014-05-30 14:13

Over the years, I’ve blogged quite a bit about integration between Oracle Data Integrator and GoldenGate, and how to make it all work with the Oracle Reference Architecture. With the release of the 12c versions of ODI and GoldenGate last October, and a soon-to-be-updated reference architecture, it’s time to write a few posts on the subject again.

Getting Started with 12c

First, let me describe the new Journalizing Knowledge Module (JKM) that has been introduced in ODI 12c for integration with GoldenGate: JKM Oracle to Oracle Consistent (OGG Online). This JKM now allows GoldenGate to be setup in an “online” mode, meaning the GoldenGate parameter files and process groups will be configured and installed on the source and target GoldenGate servers. ODI communicates with the GoldenGate JAgent to perform the installation. The “offline” mode still exists as it did in the 11g version of the JKM, in which the parameter files, etc. are created in a temporary location, and then manually moved to the source and target. I’ll use the “online” JKM throughout this series of posts.

Another change to how Journalizing is implemented for GoldenGate in ODI is the Model to which the JKM is applied. In ODI 11g, the GoldenGate JKM was always applied to the Model containing the target tables, leaving the source table metadata completely out of the picture. This made sense, as GoldenGate handled everything on the source side. Now in ODI 12c, the source tables are reverse engineered and the JKM applied to the source Model. This allows the source table to be used in a single mapping for both the initial load and incremental load of the performance layer through the use of ODI 12c deployment specifications. The target, or fully replicated table, is no longer necessary in the metadata. We’ll talk through this concept in more detail later on.

Reference Architecture Update

Finally, before we get into the details, let’s go over the latest (yet to be released, I might add) version of the Oracle Information Management Reference Architecture. It was first presented by Stewart Bryson and Andrew Bond (Oracle) at the Rittman Mead BI Forum in Brighton (which is why I was given the OK to mention it pre-release!).

Information Management - Logical View

This latest reference architecture is not much different than previous versions. The main difference that pertains to this blog post is that the Staging Layer has been renamed the Raw Data Reservoir. If you look through the presentation by Stewart and Andrew, you’ll see that the many of the principles remain the same. They also describe more about an Agile approach to implementing the reference architecture, using GoldenGate, OBIEE against transactional schemas, and eventually ETL development using ODI, a methodology we here at Rittman Mead use with our clients, and call ExtremeBI. Look for the official release of the latest Information Management Reference Architecture in the next couple of months.

In this blog series, we’ll look at how to load the Raw Data Reservoir and Foundation Layer using GoldenGate, and subsequently load the Access and Performance Layer with Oracle Data Integrator Mappings. If you recall from my 11g posts on the subject, we don’t need to load these layers in sequence. GoldenGate will allow the extract of source data once and replication to multiple targets in parallel.


Now that we’ve gone through some of the updated concepts for 12c and the Reference Architecture, let’s look at the high-level steps that must be taken in order to implement GoldenGate and ODI integration.

  • Install GoldenGate on the source and target servers – including JAgent configuration
  • Edit the “JKM Oracle to Oracle Consistent OGG (Online)” Knowledge Module (enable Foundation Layer load)
  • Setup ODI Topology (database schema and GoldenGate connections)
  • Setup and start Journalizing on the source Model
  • Develop Mappings for initial load and incremental load
  • Perform initial load and start replication

There is quite a lot of detail to add to these steps, so let’s get right into it.

GoldenGate 12c Installation and JAgent Configuration

The install of GoldenGate 12c is pretty straight-forward, so I don’t plan on going into much detail here. A step-by-step guide can be found on the DBASolved blog, with the installation setting up and starting the manager process and creating the necessary subdirectories. We then need to configure the JAgent on both the source and target GoldenGate installations, enabling ODI to communicate with GoldenGate during the “online” JKM start journalizing process, which will automatically configure and start the GoldenGate process groups. Setting up the JAgent for ODI integration is essentially the same as if you were setting up Oracle Enterprise Manager integration with GoldenGate.

First, you’ll notice that the file jagent.prm exists in the dirprm directory after installation completes. This parameter file will be used by the jagent process once started in GGSCI.


Next, we need to enable monitoring of GoldenGate by adding an entry, ENABLEMONITORING, to the GLOBALS file. Create the GLOBALS file (with no extension) in the GoldenGate home directory and open it in your favorite text editor. Simply add the line to enable monitoring, close and save the file.

Edit GLOBALS file

To allow secure communication between ODI and GoldenGate, we must create an Oracle Wallet with a password for JAgent. From the GoldenGate install directory, run the password agent. This will create the cwallet.sso file in the dirwlt subdirectory.

Create Oracle wallet

We’re almost there! Now we need to make a slight change to the file in the cfg directory under the GoldenGate home. Edit this file and make the following changes:

Set the agent type to Oracle Enterprise Manager. If left as the default OGGMON, the JAgent will attempt to register with the Monitor server, which most likely is not installed.


Under the JMX Username, add the line to signify that SSL is not being used by JAgent (unless SSL is actually being used!).


Finally, ensure the JAgent port is unique across all installations, and on the server in general. In my example, I’m using a single Virtual Machine to host both the source and target GoldenGate installations, so I need to be careful about which ports are in use.


Before starting the JAgent, go ahead and stop, then start the Manager process to ensure all changes have been initialized. Then, start the JAgent and check to ensure both Manager and JAgent are running. That completes the GoldenGate installation and JAgent configuration.

Start JAgent

If you want to just skip all of this installation work and get right to it, you can always download the Prebuilt Machine for Oracle Data Integrator 12c. It’s a VirtualBox VM with ODI 12c and GoldenGate 12c already installed and configured to work with the Getting Started Guide. The JAgent is already configured on the source and target GoldenGate installations, making it easy to get up and running. This is a great resource that the Oracle Data Integration product team has provided, and it sounds like they plan to continue adding to it in the future.

In Part 2 of the blog post series, we’ll edit the JKM to enable parallel load of the Raw Data Reservoir and Foundation Layer, as well as begin setup of the ODI Topology and metadata.

Categories: BI & Warehousing

Visual Regression Testing of OBIEE with PhantomCSS

Rittman Mead Consulting - Fri, 2014-05-23 09:20

Earlier this year I wrote a couple of blogs posts (here and here) discussing the topic of automated Regression Testing and OBIEE. One of the points that I was keen make was that OBIEE is a stack of elements and depending on the change being tested, it may be sensible to focus on certain elements in the stack instead of all of it. For example, if you are changing the RPD, there is little value in doing a web-based test when you can actually test for the vast majority of regressions using the nqcmd tool alone.

I also argued that testing the front end of OBIEE using tools such as Selenium is difficult to do comprehensively, it can be inflexible, time-consuming and in some cases just not a sensible use of effort. These tools work around the idea of parsing the web page that is served up and checking for presence (or absence) of a particular piece of text or an element on a web page. So for example, you could run a test and tell it to fail if it finds the text “Error” on the page, or you could say only pass the test if some known-content is present, such as a report title or data figure. This type of testing is prone to a great deal of false-negatives, because to efficiently build any kind of test case you must focus on something to check for in the page, but you cannot code for every possible error or failure. It is also usually based heavily on the internal IDs of elements on the page in locating the ‘something’ to check for. As the OBIEE Document Object Model (DOM) is undocumented code, Oracle are at presumably at liberty to change it whenever they feel like it, and thus any tests written based on it may fail. Finally, OBIEE 11g still defaults to serving up graphs as Flash objects, which Selenium et al just cannot handle, and so cannot be tested.

So, what do we do about regression testing the OBIEE front end?

What do we need to test in the front end?

There is still a strong case for regression testing the OBIEE front end. Analyses get changed, Dashboards break, permissions are updated – all these things can cause errors or problems for the end user, but which are something that testing further down the OBIEE stack (using something like nqcmd) will not cover.

Consider a simple dashboard:

If one of the dashboard pages that are linked to in the central section get moved in the Presentation Catalog, then this happens:

OK, so Invalid Link Path: is pretty easy to code in as an error check into Selenium. But, what about if the permissions on an analysis used in the dashboard get changed and the user can no longer access it when running the dashboard?

This is a different problem altogether. We need to check for the absence of something. There’s no error, there just isn’t the analysis that ought to be present. One way around this would be to code for the presence of the analysis title text or content – but that is not going to scale nor be maintainable to do for every dashboard being tested.

Another thing that is important to check in the front end is that authorisations are enforced as they should be. That is, a user can see the dashboards that they should be able to, and that they cannot see the ones they’re not. Changes made in the LDAP directory holding users and their groups, or a configuration change in the Application Roles, could easily mean that a user can no longer see the dashboards they should be able to. We could code for this specific issue using something like Web Services to programatically check each and every actual permission – but that could well be overkill.

What I would like to introduce here is the idea of testing OBIEE for regressions visually - but automated, of course.

Visual Regression Testing

Driven by the huge number of applications that are accessed solely on the web (sorry, “Cloud”), a new set of tools have been developed to support the idea of testing web pages for regressions visually. Instead of ‘explaining’ to the computer specifically what to look for in a page (no error text, etc), visual regression testing uses a process to compare images of a web page, comparing a baseline to a sample taken afterwards. This means that the number of false-negatives (missing genuine errors because the test didn’t detect them) drops drastically because instead of relying on coding a test program to parse the Document Object Model (DOM) of an OBIEE web page (which is extremely complex), instead it is simply considering if two snapshots of the resulting rendered page look the same.

The second real advantage of this method is that typically the tools (including the one I have been working with and will demonstrate below, PhantomCSS) are based on the actual engine that drives the web browsers in use by real end-users. So it’s not a case of parsing the HTML and CSS that the web server sends us and trying to determine if there’s a problem or not – it is actually rendering it the same as Chrome etc and taking a snapshot of it. PhantomCSS uses PhantomJS, which uses the engine that Safari is built on, WebKit.

Let’s Pretend…

So, we’ve got a tool – that I’ll demonstrate shortly – that can programatically fetch and snapshot OBIEE pages, and compare the snapshots to check for any changes. But what about graphs rendered in flash? These are a blindspot usually. Well here we can be a bit cheeky. If you pretend (in the User-Agent HTTP request header) to be an iPhone or iPad (devices that don’t support flash) then OBIEE obligingly serves up PNG graphs plus some javascript to do the hover tooltips. Because it’s a PNG image that means that it will be rendered correctly in our “browser”, and so included in the snapshot for comparison.


Let’s see this scripting in action. Some clarification of the programs we’re going to use first:

  • PhantomJS is the core functionality we’re using, a headless browser sporting Javascript (JS) APIs
  • CasperJS provides a set of APIs on top of PhantomJS that make working with web page forms, navigation etc much easier
  • PhantomCSS provides the regression testing bit, taking snapshots and running code to compare them and report differences.

We’ll consider a simple CasperJS example first, and come on to PhantomCSS after. Because PhantomCSS uses CasperJS for its core interactions, it makes sense to start with the basics.

Here is a bare-bones script. It loads the login page for OBIEE, echoes the page title to the console, takes a snapshot, and exits:

var casper = require('casper').create();

casper.start('http://rnm-ol6-2:9704/analytics', function() {

I run it from the command line:

$ casperjs casper_example_01.js
Oracle Business Intelligence Sign In

As you can see, it outputs the title of the page, and then in the screenshots folder I have this:

I want to emphasise again to make clear why this is so useful: I ran this from the commandline only. I didn’t run a web browser, I didn’t take any snapshots by hand – it was all automatic.

Now, let’s build a bit of a bigger example, where we login to OBIEE and see what dashboards are available to us:

// Set the size of the browser window as part of the 
// Casper instantiation
var casper = require('casper').create({viewportSize: {
        width: 800,
        height: 600

// Load the login page
casper.start('http://rnm-ol6-2:9704/analytics', function() {

// Do login
  this.fill('form#logonForm', { NQUser: 'weblogic' ,
                                NQPassword: 'Password01'
                              }, true);
  this.echo('Logged into OBIEE','INFO')

// Now "click" the Dashboards menu
casper.then(function() {
  this.echo('Clicking Dashboard menu','INFO')'#dashboard');
  this.waitUntilVisible('div.HeaderPopupWindow', function() {

So I now get a screenshot of after logging in:

and after “clicking” the Dashboard menu:

The only bit of the script above that isn’t self-explanatory is where I am referencing elements. The references are as CSS3 selectors and are easily found using something like Chrome Developer Tools. Where the click on Dashboards is simulated, there is a waitUntilVisible function, which is crucial for making sure that the page has rendered fully. For a user clicking the menu, they’d obviously wait until it appears but computers work much faster so functions like this are important for reining them back.

To round off the CasperJS script, let’s add to the above navigating to a Dashboard, snapshotting it (with graphs!), and then logging out.

  this.echo('Navigating to GCBC Dashboard','INFO')
  casper.clickLabel('GCBC Dashboard');

casper.waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?dashboard', function() {
  casper.waitWhileVisible('div.AjaxLoadingOpacity', function() {
    casper.waitWhileVisible('div.ProgressIndicatorDiv', function() {

casper.then(function() {
  this.echo('Signing out','INFO')
  casper.clickLabel('Sign Out');

Again, there’s a couple of waitWhileVisible functions in there, necessary to get CasperJS to wait until the dashboard has rendered properly. The dashboard rendered is captured thus:


So now let’s see how we can use the above CasperJS code in conjunction with PhantomCSS to generate a viable regression test scenario for OBIEE.

The script remains pretty much the same, except CasperJS’s capture gets replaced with a phantomcss.screenshot based on an element (html for the whole page), and there’s some extra code “footer” to include that executes the actual test.

So let’s see how the proposed test method holds up to the examples above – broken links and disappearing reports.

First, we run the baseline capture, the “known good”. The console output shows that this is the first time it’s been run, because there are no existing images against which to compare:

In the screenshots folder is the ‘baseline’ image for each of the defined snapshots:

Now let’s break something! First off I’ll rename the target page for one of the links in the central pane of the dashboard, which will cause the ‘Invalid Link Path’ message to display.

Now I run the same PhantomCSS test again, and this time it tells me there’s a problem:

When an image is found to differ, a composite of the two highlighting the differences is created:

OK, so first test passed (or rather, failed), but arguably this could have been picked up simply by parsing the page returned from the OBIEE server for known error strings. But what about a disappearing analysis – that’s more difficult to ascertain from the page source alone.

Again, PhantomCSS picks up the difference, and highlights it nice and clearly in the generated image:

For the baseline image that you capture it would be against a “gold” version of a dashboard – no point including ad-hoc reports or dashboards under development. You’d also want to work with data that was unchanging, so where available a time filter fixed at a point in the past, rather than ‘current day’ which will be changing frequently.

Belts and Braces?

So visual regression testing is a great thing, but I think a hybrid approach, of parsing the page contents for text too, is worthwhile. CasperJS provides its own test APIs (which PhantomCSS uses), and we can write simple tests such as the following:

this.test.assertTextDoesntExist('Invalid Link Path', 'Check for error text on page');
this.test.assertTextDoesntExist('View Display Error', 'Check for error text on page');
phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

So check for a couple of well-known errors, and then snapshot the page too for subsequent automatic comparison. If an assertion is failed, it shows in the console:

This means that what is already be being done in Selenium (or for which Selenium is an assumed default tool) could even be brought into the same single test rig based around CasperJS/PhantomCSS.

Frame of Reference

The eagle-eyed of you will have noticed that the snapshots generated by PhantomCSS above are not the entire OBIEE webpage, whereas the ones from CasperJS earlier in this article are. That is because PhantomCSS deliberately wants to focus on an area of the page to test, identified using a CSS3 selector. So if you are testing a dashboard, then considering the toolbar is irrelevant and can only lead to false-positives.

phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

Similarly, considering the available dashboard list (to validate enforced authorisations) just needs to look at the list itself, not the rest of the page.  (and yes, that does say “Protals” – even developers have fat fingers sometimes ;-) )

phantomcss.screenshot('div.HeaderSharedProtals','Dashboard list');

Using this functionality means that the generated snapshots used for comparison can be done to exclude things like the alerts bar (which may appear or disappear between tests).

The Devil’s in the Detail

I am in no doubt that the method described above has definitely got its place in the regression testing arsenal for OBIEE. What I am yet to be fully convinced of is quite to what extent. My beef with Selenium et al is the level of detail one has to get in to when writing tests – identifying strings to test for, their location in the DOM, and so on. Yet above in my CasperJS/PhantomCSS examples, I have DOM selectors too, so is this just the same problem? At the moment, I don’t think so. For Selenium, to build a comprehensive test, you have to dissect the DOM for every single test you want to build. Whereas with CasperJS/PhantomCSS I think there is the need to write a basic framework for OBIEE (the basics of which are provided in this post; you’re welcome), which can then be parameterised based on dashboard name and page only. Sure, additional types of tests may need new code, but it would be more reusable.

Given that OBIEE doesn’t come with an out of the box test rig, whatever we build to test it is going to be bespoke, whether its nqcmd, Selenium, JMeter, LoadRunner, OATS, QTP, etc etc — the smart money is picking the option that will be the most flexible, more scalable, easiest to maintain, and take the least effort to develop. There is no one “program to rule them all” – an accurate, comprehensive, and flexible test suite is invariably going to utilise multiple components focussing on different areas.

In the case of regression testing – what is the aim of the testing? What are you looking to validate hasn’t broken after what kind of change?  If all that’s changed in the system is the DBAs adding some indexes or partitioning to the data, I really would not be going anywhere near the front end of OBIEE. However, more complex changes affecting the Presentation Catalog and the RPD can be well covered by this technique in conjunction with nqcmd. Visual regression testing will give you a pass/fail, but then it’s up to you to decipher the images, whereas nqcmd will give you a pass/fail but also an actual set of data to show what has changed.

Don’t forget that other great tool — you! Or rather, you and your minions, who can sit at OBIEE for 5 minutes and spot certain regressions that would take magnitudes of order greater in time to build a test to locate. Things like testing for UI/UX changes between OBIEE versions is something that is realistically handled manually. The testing of the dashboards can be automated, but faster than I can even type the requirement, let alone build a test to validate it – does clicking on the save icon bring up the save box? Well go click for yourself – done? Next test.


I have just scratched the surface of what is possible with headless browser scripting for testing OBIEE. Being able to automate and capture the results of browser interactions as we’ve seen above is hugely powerful. You can find the CasperJS API reference here if you want to find out more about how it is possible to interact with the web page as a “user”.

I’ve put the complete PhantomCSS script online here. Let me know in the comments section or via twitter if you do try it out!

Thanks to Christian Berg and Gianni Ceresa for reading drafts of this article and providing valuable feedback. 

Categories: BI & Warehousing

Trickle-Feeding Log Data into the HBase NoSQL Database using Flume

Rittman Mead Consulting - Wed, 2014-05-21 16:02

The other day I posted an article on the blog around using Flume to transport Apache web log entries from our website into Hadoop, with the final destination for the entries being an HDFS file – with the HDFS file essentially mirroring the contents of the webserver log file. Once you’ve set this transport mechanism up, you could create a Hive table over the HDFS files, for example, or further transform the data using Pig, Spark or some other mechanism.

When you load data into HDFS files though, there are a couple of things you need to be aware of; HDFS is optimised for large, streaming reads of files stored in very large disk blocks, with the classic use-case being MapReduce transformations that crunch large sets of incoming data and hand-off the results to another process. What it’s not good at is random retrievals of single file records, something you’ll notice if you try and return a single row from a Hive table request. Moreover, HDFS files are write-once, no updates or overwrites, which is why Hive only supports SELECTS and not UPDATES or DELETES. Altogether, whilst HDFS is great for landing and then processing large chunks of data, if you’re looking for more granular, database-type storage on Hadoop, you’ll need to think of something else.

And within the context of Cloudera Hadoop, that other thing is HBase, a “NoSQL” database that’s also open-source and runs on the Hadoop framework. Whilst you can work with HBase in similar ways to how you work with relational databases – you can create columns, load data into it, insert and update data and so forth – HBase and NoSQL are in lots of ways the complete opposite of relational databases like Oracle Database, as they trade-off things we normally take for granted but that have performance and scalability impacts – ACID transactions, the ability to support complex table relationships, very rich query languages and application support – for extreme scalability and flexibility. If you’re scared of losing your data then HBase is one of the better NoSQL databases, with strong (rather than “eventual”) consistency, automatic shading and lots of high-availability features, but it’s not designed for running your payroll (yet).

One reason we might want to land data in HBase or another NoSQL database, rather than in regular HDFS files, is if we then want to do fast individual record lookups within the landed data. Another reason would be HBase’s support for complex record types, making it easy to store for example nested XML datasets, and its ability – like the Endeca Server – to hold completely different sets of “columns” for each row in the database, and even version those rows giving us almost a “multi-dimensional” database. Internally, HBase stores data as key-value pairs giving it the ability to hold completely different data in each database row, and under the covers HBase data is in turn stored in indexed “StoreFiles” within HDFS, giving it HDFS’s scalability and access to the Hadoop framework, but adding fast random access to individual records.


Where HBase (and most NoSQL databases) get complicated though is that there’s no SQL*Developer or TOAD to create tables, and no SQL or PL/SQL to load and manipulate them – it’s all done through Java and custom code – this article by Lars George who gave the Hadoop Masterclass as last week’s BI Forum goes into a bit more detail, along with his HBase slides and his book, “HBase: The Definitive Guide”.

So let’s look at a simple example of loading Apache CombinedLogFormat log file entries into HBase, using Flume to transport and ingest the data from our webserver into Hadoop and put together again by Nelio Guimaraes from the RM team. We’ll start by defining the HBase table, which like regular relational tables has rows but which has the concept of column families and column qualifiers rather than just columns. In practice, a column family + qualifier name makes what we’d normally think of as a column, but crucially under the covers column within families are stored together on disk, like column-store relational databases, making them fast to query and randomly access. Like a spreadsheet or OLAP database each combination of row and column family/qualifier is called a “cell”, and moreover only populated cells are stored on disk, with the added bonus of cell entries being timestamped, giving us the ability to retrieve previous versions of cell entries, like the temporal query feature in Oracle Database 12c.


For more details on how HBase stores data, and how HBase schemas are defined, the white paper “Introduction to HBase Schema Design” by Cloudera’s Amandeep Khurana is a good reference point and introduction. So let’s go into the HBase shell and create a table to contain our log data; we’ll define as containing three column families (“common”,”http” and “misc”), with the actual column qualifiers defined at the point we load data into the table – one of the key features of HBase, and most NoSQL databases, is that you can introduce new columns into a store at the point of loading, just by declaring them, with each row potentially containing its own unique selection of columns – which is where Endeca Server gets its ability to store “jagged datasets” with potentially different attribute sets held for groups of rows.

[root@cdh5-node1 ~]# hbase shell
14/05/21 06:00:07 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell

hbase(main):001:0> list
0 row(s) in 2.8030 seconds

=> []
hbase(main):002:0> create 'apache_access_log', 
hbase(main):003:0* {NAME => 'common'},
hbase(main):004:0* {NAME => 'http'},
hbase(main):005:0* {NAME => 'misc'}
0 row(s) in 0.5460 seconds

In this example, the way we’re going to populate the HBase table is to use Flume; like the Flume and HDFS example the other day, we’ll use a “sink”, in this case a HBase sink, to take the incoming Flume activity off the channel and load it into the HBase table. Flume actually has two HBase sinks; one called HBaseSink which writes synchronously (more straightforward but slower) and another called AysncHBaseSink which writes asynchronously, potentially with higher overall throughput than synchronous writes and with full consistency even if there’s a failure (based on replaying the channel data), but with a slightly more complex serialisation approach. We’ll use the asynchronous sink in this example, and assuming you’ve already got the source configuration file set-up (see the previous blog post on Flume and HDFS for an example), the target Flume conf file in our case looked like this:

## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent-hbase.conf -n collector

collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind =
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2 mc3

## Channels ##
## Source writes to 3 channels, one for each sink
collector.channels = mc1 mc2 mc3


collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 1000

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 1000

collector.channels.mc3.type = memory
collector.channels.mc3.capacity = 1000

## Sinks ##
collector.sinks = LocalOut HadoopOut HbaseOut

## Write copy to Local Filesystem 
collector.sinks.LocalOut.type = file_roll = /var/log/flume-ng
collector.sinks.LocalOut.sink.rollInterval = 0 = mc1

## Write to HDFS
collector.sinks.HadoopOut.type = hdfs = mc2
collector.sinks.HadoopOut.hdfs.path = /user/root/flume-channel/%{log_type}/%d%m%Y
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

# HBase sink config 
collector.sinks.HbaseOut.type = org.apache.flume.sink.hbase.AsyncHBaseSink = mc3
collector.sinks.HbaseOut.table = apache_access_log
collector.sinks.HbaseOut.columnFamily = common
collector.sinks.HbaseOut.batchSize = 5000
collector.sinks.HbaseOut.serializer = com.hbase.log.util.AsyncHbaseLogEventSerializer
collector.sinks.HbaseOut.serializer.columns = common:rowKey,common:hostname,common:remotehost,common:remoteuser,common:eventtimestamp,http:requestmethod,http:requeststatus,http:responsebytes,misc:referrer,misc:agent

A few points to note:

  • The collector.sinks.HbaseOut.type setting determines the sink type we’ll use, in this case org.apache.flume.sink.hbase.AsyncHBaseSink
  • collector.sinks.HbaseOut.table sets the HBase table name we’ll load, “apache_access_log”
  • collector.sinks.HbaseOut.serializer.columns actually defines the column qualifiers, in this case mapping incoming serialised log file rows into a set of HBase column families and qualifiers
  • collector.sinks.HbaseOut.serializer is the most important bit – and tells HBase how to turn the incoming Flume data into HBase loads, through a Java program called the “serializer”.

And its this serializer, the Java program that does the actual loading of the HBase table, that’s the final piece of the jigsaw. There are standard templates to use when writing this piece of code, and in our case the serializer looked like this:

package com.hbase.log.util;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.hbase.async.AtomicIncrementRequest;
import org.hbase.async.PutRequest;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.SimpleHbaseEventSerializer.KeyType;
import org.apache.flume.sink.hbase.AsyncHbaseEventSerializer;

 * A serializer for the AsyncHBaseSink, which splits the event body into
 * multiple columns and inserts them into a row whose key is available in
 * the headers
 * Originally from
public class AsyncHbaseLogEventSerializer implements AsyncHbaseEventSerializer 
    private byte[] table;
    private byte[] colFam;
    private Event currentEvent;
    private byte[][] columnNames;
    private final List<PutRequest> puts = new ArrayList<PutRequest>();
    private final List<AtomicIncrementRequest> incs = new ArrayList<AtomicIncrementRequest>();
    private byte[] currentRowKey;
    private final byte[] eventCountCol = "eventCount".getBytes();
    // private String delim;

    public void initialize(byte[] table, byte[] cf) 
        this.table = table;
        this.colFam = cf;

    public void setEvent(Event event) 
        // Set the event and verify that the rowKey is not present
        this.currentEvent = event;
        String rowKeyStr = currentEvent.getHeaders().get("rowKey");
        //if (rowKeyStr == null) {
        //  throw new FlumeException("No row key found in headers!");
        //currentRowKey = rowKeyStr.getBytes();

    public String[] logTokenize(String event)

        String logEntryPattern = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"";
        Pattern p = Pattern.compile(logEntryPattern);
        Matcher matcher = p.matcher(event);

        if (!matcher.matches()) 
            System.err.println("Bad log entry (or problem with RE?):");
            return null;

        String[] columns = new String[matcher.groupCount()+1];
        columns[0]= Long.toString(System.currentTimeMillis());
        for (int i = 1; i <= matcher.groupCount(); i++) 
            columns[i] =;

        return columns;


    public List<PutRequest> getActions() 
        // Split the event body and get the values for the columns
        String eventStr = new String(currentEvent.getBody());
        long unixTime = System.currentTimeMillis();
        //String[] cols = eventStr.split(",");
        //String[] cols = eventStr.split(regEx);
        //String[] cols = eventStr.split("\\s+");
        //String[] cols = eventStr.split("\\t");
        //String[] cols = eventStr.split(delim);
        String[] cols = logTokenize(eventStr);
        String[] columnFamilyName;
        byte[] bCol;
        byte[] bFam;
        for (int i = 0; i < cols.length; i++) 
            //Generate a PutRequest for each column.
            columnFamilyName = new String(columnNames[i]).split(":");
            bFam = columnFamilyName[0].getBytes();
            bCol = columnFamilyName[1].getBytes();

            if (i == 0) 
                currentRowKey = cols[i].getBytes();
            //PutRequest req = new PutRequest(table, currentRowKey, colFam,
            //columnNames[i], cols[i].getBytes());
            PutRequest req = new PutRequest(table, currentRowKey, bFam,
            bCol, cols[i].getBytes());
        return puts;

    public List<AtomicIncrementRequest> getIncrements() 
        //Increment the number of events received
        incs.add(new AtomicIncrementRequest(table, "totalEvents".getBytes(), colFam, eventCountCol));
        return incs;

    public void cleanUp() 
        table = null;
        colFam = null;
        currentEvent = null;
        columnNames = null;
        currentRowKey = null;

    public void configure(Context context) 
        //Get the column names from the configuration
        String cols = new String(context.getString("columns"));
        String[] names = cols.split(",");
        columnNames = new byte[names.length][];
        int i = 0;
        for(String name : names) 
            columnNames[i++] = name.getBytes();
        //delim = new String(context.getString("delimiter"));

    public void configure(ComponentConfiguration conf) {}

HBase, rather than supporting the regular SELECT and INSERTS we’re used to with Oracle, instead uses “get” and “put” methods to retrieve, and store, data – along with “delete” and “scan”. The regular synchronous HBase sync uses these methods directly, taking data off the Flume channel and inserting it into the HBase table (or indeed, updating existing rows based on the row key), whilst the asychnronous method uses a layer in-between the incoming data and the write, allowing data (or “events”) to continue streaming in even if all the downstream data hasn’t get been committed. It’s this code though that maps each incoming bit of data – in this case, a parsed log file – to column families and qualifiers in the HBase table, and you’d need to write new code like this, or amend the exiting one, if you wanted to load other HBase tables in your Hadoop cluster – a long way from the point-and-click ETL approach we get with ODI, but a lot more flexible too (if that’s what you want).

Then it’s a case of compiling the Java code, like this:

mkdir com; mkdir com/hbase; mkdir com/hbase/log; mkdir com/hbase/log/util
vi com/hbase/log/util/
export CLASSPATH=/usr/lib/flume-ng/lib/*
javac com/hbase/log/util/
jar cf LogEventUtil.jar com
jar tf LogEventUtil.jar com
chmod 775 LogEventUtil.jar
cp LogEventUtil.jar /usr/lib/flume-ng/lib

Next, we had to run the following command before enabling Flume with this setup, because of an issue we found with Zookeeper stopping Flume working in this setup:

mv /etc/zookeeper/conf/zoo.cfg /etc/zookeeper/conf/zoo.cfg-unused

and finally, we start up the Flume target server agent, followed by the source one (again see the previous article for setting up the source Flume agent):

flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent-hbase.conf -n collector

Then, after a while, log data starts getting loaded into the HBase table. You can check on it using Hue, and the HBase Browser:


Or you can go back into the HBase shell and run the scan command to view the data, with each row representing a cell in the overall table storage:

hbase(main):001:0> scan 'apache_access_log'
ROW                   COLUMN+CELL                                               
 1400628560331        column=common:eventtimestamp, timestamp=1400628560350, val
                      ue=20/May/2014:15:28:06 +0000                             
 1400628560331        column=common:hostname, timestamp=1400628560335, value=89.
 1400628560331        column=common:remotehost, timestamp=1400628560336, value=-
 1400628560331        column=common:remoteuser, timestamp=1400628560338, value=-
 1400628560331        column=common:rowKey, timestamp=1400628560333, value=14006
 1400628560331        column=http:requestmethod, timestamp=1400628560352, value=
                      GET / HTTP/1.1                                            
 1400628560331        column=http:requeststatus, timestamp=1400628560356, value=
 1400628560331        column=http:responsebytes, timestamp=1400628560358, value=
 1400628560331        column=misc:agent, timestamp=1400628560377, value=Mozilla/
                      5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.
                      14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14     
 1400628560331        column=misc:referrer, timestamp=1400628560359, value=-    
 1400628560344        column=common:eventtimestamp, timestamp=1400628560383, val
                      ue=20/May/2014:15:28:06 +0000

This is all great, and a good starting point if you plan to process your data with other Java programs as the next step. But what if you want to view the data in a more convenient way, perhaps as a regular table? To do that you can use Hive again, this time using Hive’s HBase integration features to tell it the data is stored in HBase format, and to let it know how to display the various HBase column families and qualifiers. In our case, the DDL to create the corresponding Hive table looks like this:

DROP TABLE IF EXISTS hive_apache_access_log;
CREATE EXTERNAL TABLE hive_apache_access_log
unixtimestamp string,
eventtimestamp string,
hostname string,
remotehost string,
remoteuser string,
requestmethod string,
requeststatus string,
responsebytes string,
agent string,
referrer string
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,common:eventtimestamp,common:hostname,common:remotehost,common:remoteuser,http:requestmethod,http:requeststatus,http:responsebytes,misc:agent,misc:referrer')
TBLPROPERTIES ('' = 'apache_access_log');

giving us the ability, either from the Hive shell like this, or from tools like OBIEE and ODI, to query the NoSQL database and brings its data into more regular, relational data stores.

hive> select * from hive_apache_access_log;
1400628560331   20/May/2014:15:28:06 +0000   -   -   GET / HTTP/1.1  200 9054    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14 -
1400628560344   20/May/2014:15:28:06 +0000   -   -   GET /wp-content/plugins/crayon-syntax-highlighter/css/min/crayon.min.css?ver=2.5.0 HTTP/1.1 304 -   Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14
1400628560345   20/May/2014:15:28:06 +0000   -   -   GET /wp-content/plugins/jetpack/modules/widgets/widgets.css?ver=20121003 HTTP/1.304 -   Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14

We’ll be covering more on HBase, and Oracle’s NoSQL Database, in future articles on the blog.

Categories: BI & Warehousing

Photos and Presentation Downloads from the Rittman Mead BI Forum 2014

Rittman Mead Consulting - Mon, 2014-05-19 14:59

Well we’re back in Brighton, UK now after the second successful week of the Rittman Mead BI Forum 2014. In Week 1, we went to the Seattle Hotel in Brighton (read the recap here), and then straight-after we flew over to Atlanta, GA, to run the second week – and it was possibly even better than Brighton ;-)

Congratulations to Omri Traub, winner of the Atlanta Best Speaker award – the first person from Oracle to win, in fact – and to all of the other presenters who helped put together an excellent event. If you’re interested, I’ve uploaded a bunch of photos from both Brighton and Atlanta to Flickr, and you can view the BI Forum 2014 photo set here.


As usual, where we’ve got permission (or the PDF) from the presenter, we’re making all of the presentations available for download to both attendees and non-attendees – not everyone can make it to the event, but we don’t want you to miss-out. We’re also very grateful to Lars George and Cloudera for making their Hadoop Masterclass slides available too – thanks everyone.

Other than that – thanks again to everyone who attended, and hopefully we’ll see you all again next year!

Categories: BI & Warehousing

List of our Recent “Getting Started” Hadoop Articles

Rittman Mead Consulting - Sun, 2014-05-18 04:54

We’ve published a number of “getting started with Hadoop” articles over the past few months, but these aren’t always easy to find on the blog. I’ve therefore compiled a list of the more recent ones, which you’ll find below:

Categories: BI & Warehousing

Trickle-Feeding Log Files to HDFS using Apache Flume

Rittman Mead Consulting - Sun, 2014-05-18 03:20

In some previous articles on the blog I’ve analysed Apache webserver log files sitting on a Hadoop cluster using Hive, Pig and most recently, Apache Spark. In all cases the log files have already been sitting on the Hadoop cluster, SFTP’d to my local workstation and then uploaded to HDFS, the Hadoop distributed filesystem, using Hue, and the only way to add to them is to repeat the process and manually copy them across from our webserver. But what if I want these log files to be copied-across automatically, in a kind of “trickle-feed” process similar to how Oracle GoldenGate trickle-feeds database transactions to a data warehouse? Enter Apache Flume, a component within Hadoop and the Cloudera CDH4/5 distribution of Hadoop, which does exactly this.

Flume is an Apache project within the overall Hadoop ecosystem that provides a reliable, distributed mechanism for collecting aggregating and moving large amounts of log data. Similar to GoldenGate it has transaction collectors, mechanisms to reliably transmit data from source to target, and mechanisms to write those log events to a centralised data store, for example HDFS. It’s free and comes with Cloudera CDH, and coupled with something at the target end to then process and work with the incoming log entries, is a pretty powerful and flexible way to transmit log-type entries from (potentially) multiple source providers to a central Hadoop cluster.

To take our example, we’ve got a webserver that’s generating our Apache CombinedLogFormat log entries as users generate activity on the website. We then set up Flume agents on the source webserver, and then the Hadoop client node that’s going to receive the log entries, which then writes the log entries to HDFS just like any other file activity. The Flume agent on the webserver source “tail”s the Apache access.log file copying across entries as they’re made (more or less), so that the target HDFS log file copies are kept up to date with individual log entries, not just whole log files as they’re closed off, with the diagram below showing the overall schematic:

Flume Topology

Down at the Flume component level, Flume consists of agents, Java processes that sit on the source, target and any intermediate servers; channels, intermediate staging points that can persist log entries to disk, database or memory; and sinks, processes that take log transactions out of a channel and write them to disk. Flume is designed to be distributed and resilient, and won’t take the source down if the target Hadoop environment isn’t available; if this type of situation occurs, transactions will slowly fill-up the channel used by the source agent until such time as it runs out of space, and then further log transactions are lost until the target comes back up and the source agent’s channel regains some spare space. The diagram below, from the Cloudera blog about the latest generation of Flume (Flume NG, for “Next Generation”) shows the Flume product topology:


whilst the next diagram shows how Flume can collect and aggregate log entries from multiple servers, and then combine them into one log stream sent to a single target.


In our example, that’s all there is to it; in more complex examples, perhaps where the source is sending XML log entries, you’d need a downstream processor on the target platform to decode, deserialise or parse the incoming log files – Flume is just a transport mechanism and doesn’t do any transformation itself. You can also choose how the log entries are held by each of the agents’ channels; in the example we’re going to use, channel data is just held in-memory which is fast to run and setup, but you’d lose all of your data in the process if the server went down. Other, more production-level processes would persist the channel entries to file, or even a mySQL database.

For our setup we need to two agents, one on the source and one on the target server, each of which has its own configuration file. The source agent configuration file looks like this, with key entries called-out underneath it:

## Local instalation: /home/ec2-user/apache-flume
## configuration file location:  /home/ec2-user/apache-flume/conf
## bin file location: /home/ec2-user/apache-flume/bin
## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent

source_agent.sources = apache_server
source_agent.sources.apache_server.type = exec
source_agent.sources.apache_server.command = tail -f /etc/httpd/logs/access_log
source_agent.sources.apache_server.batchSize = 1
source_agent.sources.apache_server.channels = memoryChannel
source_agent.sources.apache_server.interceptors = itime ihost itype

source_agent.sources.apache_server.interceptors.itime.type = timestamp

source_agent.sources.apache_server.interceptors.ihost.type = host
source_agent.sources.apache_server.interceptors.ihost.useIP = false
source_agent.sources.apache_server.interceptors.ihost.hostHeader = host

source_agent.sources.apache_server.interceptors.itype.type = static
source_agent.sources.apache_server.interceptors.itype.key = log_type
source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined

source_agent.channels = memoryChannel
source_agent.channels.memoryChannel.type = memory
source_agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on Hadoop Node
source_agent.sinks = avro_sink
source_agent.sinks.avro_sink.type = avro = memoryChannel
source_agent.sinks.avro_sink.hostname =
source_agent.sinks.avro_sink.port = 4545

  • Source is set to “apache_server”, i.e. an Apache HTTP server
  • The capture mechanism is the Linux “tail” command
  • Log entries are held by the channel mechanism in-memory, rather than to file or database
  • Timestamp is used by the source collector to tell which entries are new
  • The agent then sends the log entries to a corresponding Flume agent on the Hadoop Cluster, in this case an IP address that corresponds to my network’s external IP address, with Flume network traffic then NATted by my router to, the client node in my CDH4.6 Hadoop cluster running on VMWare.

The target server in my Hadoop cluster then has a corresponding configuration file set up, looking like this:

## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector

collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind =
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2


collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100

## Sinks ##
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem 
collector.sinks.LocalOut.type = file_roll = /var/log/flume-ng
collector.sinks.LocalOut.sink.rollInterval = 0 = mc1

## Write to HDFS
collector.sinks.HadoopOut.type = hdfs = mc2
collector.sinks.HadoopOut.hdfs.path = /user/root/flume-channel/%{log_type}/%y%m%d
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

Key entries in this log file are:

  • Apache AVRO is the file format we’re using to transmit the data, and Flume is working on port 4545
  • There’s two sink collector channels defined – “mc1” for writing file entries to the local server filesystem, and one to HDFS
  • The maximum number of events (log entries) Flume will store in the various channels (log entry persistence stores) is 100, meaning that if the target platform goes down and more than 100 log transactions back-up, then further ones will get lost until we can clear the channel down. Of course this limit can be increased, assuming there’s memory or disk spare.

I then SSH into the target Hadoop node and start the Flume agent, like this:

[root@cdh4-node1 ~]# flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector
Info: Including Hadoop libraries found via (/usr/bin/hadoop) for HDFS access
Info: Excluding /usr/lib/hadoop/lib/slf4j-api-1.6.1.jar from class path
14/05/18 18:15:29 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
14/05/18 18:15:29 INFO hdfs.BucketWriter: Creating /user/root/flume-channel/apache_access_combined/18052014/FlumeData.1400433329254.tmp

and then repeat the step for the source webserver, like this:

[ec2-user@ip-10-35-143-131 apache-flume]$ sudo bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent
Warning: JAVA_HOME is not set!
+ exec /usr/bin/java -Xmx20m -cp '/home/ec2-user/apache-flume/conf:/home/ec2-user/apache-flume/lib/*' -Djava.library.path= org.apache.flume.node.Application -f conf/flume-src-agent.conf -n source_agent

Finally, moving across to Hue I can see new log entries being written to the HDFS file system:


So there you go – simple transport of webserver log entries from a remote server to my Hadoop cluster, via Apache Flume – thanks again to Nelio Guimaraes from the RM team for setting the example up.

Categories: BI & Warehousing

Mobile App Designer mis-configuration error

Rittman Mead Consulting - Wed, 2014-05-14 09:21

I’ve been doing some work recently with OBIEE’s new Mobile App Designer (MAD). It’s a great bit of software that I’m genuinely impressed with, but it’s got its little v1 quirks, and helpful error messages are not its forte. I hit a MADdening (sorry) problem with it that Google and My Oracle Support both drew blanks on, so I’m posting it here in case it helps out others with the same problem.

Setting up MAD is a bit of a fiddly process involving patching OBIEE (regardless of the base version you’re on – hopefully in the future it will get rolled into the patchsets) and performing other bits of setup detailed in the documentation. The problem that I hit manifested itself twofold:

  1. Publishing an App to the Apps Library worked fine, but updating an existing App threw an error in the browser:
    Failed to publish /~weblogic/GCBC Mobile - Phone.xma:oracle.xdo.webservice.exception.AccessDeniedException: PublicReportService::executeUpdateTemplateForReport Failure: user has no access to report[/Apps Library//GCBC Mobile - Phone.xma] due to [Ljava.lang.StackTraceElement;@4e6106df
  2. Trying to subscribe to any App threw a generic error in the browser: “Error occurred while accessing server. Please contect administrator.” with the corresponding bipublisher.log showing: 
    [2014-05-13T16:49:53.449+01:00] [bi_server1] [WARNING] [] [oracle.xdo] [tid: 24] [userId: <anonymous>] [ecid: 3f3d2d8955322f32:2f756afc:145f4d10b2f:-8000-0000000000003eea,0] [APP: bimad#11.1.1] User (weblogic) with session id: q2fq8fkh66f85ghamsq164u9qs98itvnk0c826i is looking for object in biee path: /shared/Apps Library//GCBC.xma/_mreport.xma[[
    Object Error [Context: 0, code: QM3V3HLV, message: Invalid path (/shared/Apps Library//GCBC.xma/_mreport.xma) -- ]
    [2014-05-13T16:49:53.450+01:00] [bi_server1] [WARNING] [] [oracle.xdo] [tid: 24] [userId: <anonymous>] [ecid: 3f3d2d8955322f32:2f756afc:145f4d10b2f:-8000-0000000000003eea,0] [APP: bimad#11.1.1] oracle.xdo.XDOException: Target app not found in the repository :/Apps Library//GCBC.xma[[

One of my esteemed Rittman Mead colleagues, Francesco Tisiot, pointed out that the path referenced in the errors has a double slash in it. On checking my configuration, I had indeed fat-fingered one of the settings. APPS_LIBRARY_FOLDER_LOCAL is defined in the <DOMAIN_HOME>/config/bipublisher/repository/Admin/Configuration/xmlp-server-config.xml file, and mine looked like this:

<property name="APPS_LIBRARY_FOLDER_LOCAL" value="/Apps Library/"/>

All I needed to do was to remove the trailing slash after Library:

<property name="APPS_LIBRARY_FOLDER_LOCAL" value="/Apps Library"/>

After restarting the bimad application deployment all was good again with the MAD world and I could republish and subscribe to Apps happily.


Categories: BI & Warehousing

Exploring Apache Spark on the New BigDataLite 3.0 VM

Rittman Mead Consulting - Mon, 2014-05-12 20:34

The latest version of Oracle’s BigDataLite VirtualBox VM went up on OTN last week, and amongst other things it includes the latest CDH5.0 version of Cloudera Distribution including Hadoop, as featured on Oracle Big Data Appliance. This new version comes with an update to MapReduce, moving it to MapReduce 2.0 (with 1.0 still there for backwards-compatibility) and with YARN as the replacement for the Hadoop JobTracker. If you’re developing on CDH or the BigDataLite VM you shouldn’t notice any differences with the move to YARN, but it’s a more forward-looking, modular way of tracking and allocating resources on these types of compute clusters that also opens them up to processing models other than MapReduce.

The other new Hadoop feature that you’ll notice with BigDataLite VM and CDH5 is an updated version of Hue, the web-based development environment you use for creating Hive queries, uploading files and so on. As the version of Hue shipped is now Hue 3.5, there’s proper support for quoted CSV files in the Hue / Hive uploader (hooray) along with support for stripping the first, title, line, and an updated Pig editor that prompts you for command syntax (like the Hortonworks Pig editor).


This new version of BigDataLite also seems to have had Cloudera Manager removed (or at least, it’s not available as usual at http://bigdatalite:7180), with instead a utility provided on the desktop that allows you to stop and start the various VM services, including the Oracle Database and Oracle NoSQL database that also come with the VM. Strictly-speaking its actually easier to use than Cloudera Manager, but it’s a shame it’s gone as there’s lots of monitoring and configuration tools in the product that I’ve found useful in the past.


CDH5 also comes with Apache Spark, a cluster processing framework that’s being positioned as the long-term replacement for MapReduce. Spark is technically a programming model that allows developers to create scripts, or programs, that bring together operators such as filters, aggregators, joiners and group-bys using languages such as Scala and Python, but crucially this can all happen in-memory – making Spark potentially much faster than MapReduce for doing both batch, and ad-hoc analysis.

This article on the Cloudera blog goes into more detail on what Apache Spark is and how it improves over MapReduce, and this article takes a look at the Spark architecture and how it’s approach avoids the multi-stage execution model that MapReduce uses (something you’ll see if you ever do a join in Pig or Hive). But what does some basic Spark code look like, using the default Scala language most people associate Spark with? Let’s take a look at some sample code, using the same Rittman Mead Webserver log files I used in the previous Pig and Hive/Impala articles.

You can start up Spark in interactive mode, like you do with Pig and Grunt, by opening a Terminal session and typing in “spark-shell”:

[oracle@bigdatalite ~]$ spark-shell
14/05/12 20:56:50 INFO HttpServer: Starting HTTP Server
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 0.9.0

Using Scala version 2.10.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
14/05/12 20:56:56 INFO Slf4jLogger: Slf4jLogger started
14/05/12 20:56:56 INFO Remoting: Starting remoting
14/05/12 20:56:56 INFO Remoting: Remoting started; 

14/05/12 20:56:57 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140512205657-0002
14/05/12 20:56:57 INFO SparkILoop: Created spark context..
Spark context available as sc.


Spark has the concept of RDDs, “Resilient Distributed Datasets” that can be thought of as similar to relations in Pig, and tables in Hive, but which crucially can be cached in RAM for improved performance when you need to access their dataset repeatedly. Like Pig, Spark features “lazy execution”, only processing the various Spark commands when you actually need to (for example, when outputting the results of a data-flow”, so let’s run two more commands to load in one of the log files on HDFS, and then count the log file lines within it.

scala> val logfile = sc.textFile("logs/access_log")
14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) called with curMem=234759, maxMem=309225062
14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 75.5 KB, free 294.6 MB)
logfile: org.apache.spark.rdd.RDD[String] = MappedRDD[31] at textFile at <console>:15

scala> logfile.count()
14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1
14/05/12 21:19:06 INFO SparkContext: Job finished: count at <console>:18, took 0.192536694 s
res7: Long = 154563

So the file contains 154563 records. Running the logfile.count() command again, though, brings back the count immediately as the RDD has been cached; we can explicitly cache RDDs directly in these commands if we like, by using the “.cache” method:

scala> val logfile = sc.textFile("logs/access_log").cache

So let’s try some filtering, retrieving just those log entries where the user is requesting our BI Apps 11g homepage (“/biapps11g/“):

scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/"))
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17
scala> biapps11g.count()
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s
res9: Long = 403

Or I can create a dataset containing just those records that have a 404 error code:

scala> val errors = logfile.filter(_.contains("404"))
errors: org.apache.spark.rdd.RDD[String] = FilteredRDD[36] at filter at <console>:17
scala> errors.count()
res11: Long = 1577

Spark, using Scala as the language, has routines for filtering, joining, splitting and otherwise transforming data, but something that’s quite common in Spark is to create Java JAR files, typically from compiled Scala code, to encapsulate certain common data transformations, such as this Apache CombinedFormat Log File parser available on Github from @alvinalexander, author of the Scala Cookbook. Once you’ve compiled this into a JAR file and added it to your SPARK_CLASSPATH (see his blog post for full details, and from where the Spark examples below were taken from), you can start to work with the individual log file elements, like we did in the Hive and Pig examples where we parsed the log file using Regexes.

scala> import com.alvinalexander.accesslogparser._
import com.alvinalexander.accesslogparser._

scala> val p = new AccessLogParser
p: com.alvinalexander.accesslogparser.AccessLogParser = com.alvinalexander.accesslogparser.AccessLogParser@6d32bc14

Then I can access the HTTP Status Code using its own property, like this:

def getStatusCode(line: Option[AccessLogRecord]) = {
  line match {
    case Some(l) => l.httpStatusCode
    case None => "0"

log.filter(line => getStatusCode(p.parseRecord(line)) == "404").count 
res12: Long = 1233

Then we can use a similar method to retrieve all of the “request” entries in the log file where the user got a 404 error, starting off by defining two methods that will help with the request parsing – note the use of the :paste command which allows you to block-paste a set of commands into the scala-shell:

scala> :paste
// Entering paste mode (ctrl-D to finish)
def getRequest(rawAccessLogString: String): Option[String] = {
  val accessLogRecordOption = p.parseRecord(rawAccessLogString)
  accessLogRecordOption match {
    case Some(rec) => Some(rec.request)
    case None => None
def extractUriFromRequest(requestField: String) = requestField.split(" ")(1)
// Exiting paste mode, now interpreting.

getRequest: (rawAccessLogString: String)Option[String]
extractUriFromRequest: (requestField: String)String

Now we can run the code to output the URIs that have been generating 404 errors:

scala> :paste
// Entering paste mode (ctrl-D to finish)
log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)).count
val recs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_))
val distinctRecs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404")
                      .collect { case Some(requestField) => requestField }

So Spark is commonly-considered the successor to MapReduce, and you can start playing around with it on the new BigDataLite VM. Unless you’re a Java (or Scala, or Python) programmer, Spark isn’t quite as easy as Pig or Hive to get into, but the potential benefits over MapReduce are impressive and it’d be worth taking a look. Hopefully we’ll have more on Spark on the blog over the next few months, as we get to grips with it properly.

Categories: BI & Warehousing

The State of the OBIEE11g World as of May 2014

Rittman Mead Consulting - Mon, 2014-05-12 03:00

I’m conscious I’ve posted a lot on this blog over the past few months about hot new topics like big data, Hadoop and Oracle Advanced Analytics, and not so much about OBIEE, which traditionally has been the core of Rittman Mead’s business and what we’ve written about most historically. Part of this is because there’s a lot of innovative stuff coming out of the big data world, but a part of it is because there’s not been a big new OBIEE11g release this year, as we had last year with, before that, and so on. But there’s actually a lot interesting going on in the OBIEE11g world at the moment without a big headline release, and what with the Brighton RM BI Forum 2014 taking place last week and the product keynotes it gave us, I thought it’d be worth talking a look back at where we are in May 2014, where the innovation is happening and what’s coming up in the next few months for OBIEE.

Product Versions and Capabilities

As of the time of writing (May 11th 2014) we’re currently on the version of OBIEE, updated with a few patch sets since the original April 2013 release to include features such as Mobile App Designer. OBIEE saw a UI update to the new FusionFX theme, replacing the theme used from the release, and brought in new capabilities such as Hadoop/Hive integration as well as a bunch of “fit-and-finish” improvements, such that at the time I referred to it as “almost like 11g Release 2”, in terms of usability, features and general “ready-for-deployment” quality.


The other major new capability OBIEE brought in was better integration with Essbase and the Hyperion-derived products that are now included in the wider Oracle BI Foundation 11g package. Earlier versions of OBIEE gave you the ability to install Essbase alongside OBIEE11g for the purposes of aggregate persistence into Essbase cubes, but the release brought in a single combined security model for both Essbase and OBIEE, integration of EPM Workspace into the OBIEE environment and the re-introduction of Smartview as OBIEE (and Essbase’s) MS Office integration platform.

Outside of core OBIEE11g but complementing it, and the primary use-case for a lot of OBIEE customers, are the Oracle BI Applications and 2013 saw the release of Oracle BI Applications, followed just a few days ago by the latest update, OBIA What these new releases brought in was the replacement of Informatica PowerCenter by Oracle Data Integrator, and a whole new platform for configuring and running BI Apps ETL jobs based around JEE applications running in WebLogic Server. Whilst at the time of OBIA’s release most people (including myself) advised caution in using this new release and said most new customers should still use the old 7.9.x release stream – because OBIA 11g skills would be scarce and relatively speaking, it’d have a lot of bugs compared to the more mature 7.9.x stream – in fact I’ve only heard about 11g implementations since then, and they mostly seem to have gone well. OBIA came out in early May 2014 and seems to be mostly additional app content, bug fixes and Endeca integration, and there’s still no upgrade path or 11g release for Informatica users, but the 11g release of BI Apps seems to be a known-quantity now and Rittman Mead are getting a few implementations under our belt, too.

Oracle BI Cloud Service (BICS)

So that’s where we are now … but what about the future? As I said earlier, there hasn’t been a major release of OBIEE 11g this year and to my mind, where Oracle’s energy seems to have gone is the “cloud” release of OBIEE11g, previewed back at Oracle Openworld 2013 and due for release in the next few months. You can almost think of this as the “ release” for this year with the twist being it’s cloud-only, but what’ll be interesting about this version of OBIEE11g is that it’ll probably be updated with new functionality on a much more regular basis than on-premise OBIEE, as Oracle (Cloud) will own the platform and be in a much better position to push-through upgrades and control the environment than for on-premise installs.


Headline new capabilities in this cloud release will include:

  • Rapid provisioning, with environments available “at the swipe of a credit card” and with no need to download and install the software yourself
  • Built-in storage, with Oracle’s schema-as-a-service/ApEx database environment backing the product and giving you a place to store data for reporting
  • A consumer-style experience, with wizards and other helper features aimed at getting users familiar with on-premise OBIEE11g up and started on this new cloud version
  • Access to core OBIEE11g features such as Answers, dashboards, mobile and a web-based repository builder

It’s safe to say that “cloud” is a big deal for Oracle at the moment, and it’s probably got as much focus within the OBIEE development team as Fusion Middleware / Fusion Apps integration had back at the start of OBIEE 11g. Part of this is technology trends going on outside of BI, and OBIEE – customers are moving their IT platforms into the cloud anyway, so it makes sense for your BI to be there too, rather than being the only thing left back on-premise, but a bit part of it is the benefits it gives Oracle, and the OBIEE product team – they can own and control much more of the end-to-end experience, giving them control over quality and much more customers on the latest version, and of course the recurring revenues Oracle gets from selling software-as-a-service in the cloud are valued much higher by the market than the one-off license sales they’ve relied on in the past.

But for customers, too, running BI and OBIEE in the cloud brings quite a few potential benefits – both in terms of Oracle’s official “BI in the Cloud Service”, and the wider set of options when you consider running full OBIEE in a public cloud such as Amazon AWS – see my Collaborate’14 presentation on the topic on Slideshare. There’s none of the hassle and cost of actually setting up the software on your own premises, and then doing upgrades and applying patches over time – “empty calories” that have to be spent but don’t bring any direct benefit to the business.  OBIEE in the Cloud also promises to bring a bit of independence to the business from IT, as they’ll be able to spin-up cloud BI instances without having to go through the usual procurement/provisioning cycle, and it’ll be much easier to create temporary or personal-use OBIEE environments for tactical or short-lived work particularly as you’ll only have to license OBIEE for the users and months you actually need it for, rather than buying perpetual licenses which might then sit on the shelf after the immediate need has gone.

Data Visualization, and the Competition from Tableau

It’s probably safe to say that, when OBIEE came out back in 2010, its main competitors were other full-platform, big vendor BI products such as SAP Business Objects and IBM Cognos. Now, in 2014, what we’re hearing anecdotally and from our own sales activity around the product, the main competitor we hear OBIEE 11g coming up against is Tableau. Tableau’s quite a different beast to OBIEE – like QlikTech’s QlikView it’s primarily a desktop BI tool that over the years has been giving some server-based capabilities, but what it does well is get users started fast and give them the ability to create compelling and beautiful data visualisations, without spending days and weeks building an enterprise metadata layer and battling with their IT department.

Of course we all know that as soon as any BI tool gets successful, its inevitable that IT will have to get involved at some point, and you’re going to have to think about enterprise definitions of metrics, common dimensions and so forth, and it’s this area that OBIEE does so well, primarily (in my mind) selling well to the IT department, and with Oracle focusing most of their attention recently on the integration element of the BI world, making it easy to link your ERP and CRM applications to your BI stack, and the whole lot playing well with your corporate security and overall middleware stack. But none of that stuff is important to end users, who want a degree of autonomy from the IT department and something they can use to quickly and painlessly knock-together data visualisations in order to understand the data they’re working with.

So to my mind there’s two aspects to what Tableau does well, that OBIEE needs to have an answer for; ease of setting-up and getting started, and its ability to create data visualisations beyond the standard bar charts and line charts people most associate with OBIEE. And there’s a couple of initiatives already in place, and coming down the line, from Oracle that aim to address this first point; BI Publisher, for example, now gives users the option to create a report directly off-of data in the RPD without the intermediate requirement to create a separate data model, and presents a list of commonly-used report formats at report creation to make the process a bit more “one-stop”.


Another initiative that’ll probably come along first as part of the BI in the Cloud Service is personal data-mashups; what this is is a way for users to upload, from spreadsheets or CSV files, data that they want to add to their standard corporate metrics to allow them to produce reports that aren’t currently possible with the standard curated RPD from corporate IT. Metrics users add in this way will have their data stored (probably) in the catalog but marked in a way that it’s clear they’re not “gold-standard” ones, with the aim of the feature being to avoid the situation where users export their base data from OBIEE into Excel and then bring in the additional data there. It does beg a few questions in terms of where the data goes, how it all gets stored and how well it’d work on an on-premise install, but if you work on the basis that users are going to do this sort of thing anyway, it’s best they do it within the overall OBIEE environment than dump it all to Excel and do their worst there (so to speak).

Another even-more intriguing new product capability that’s coming along, and is technically possible with the current release, is the concept of “mini-apps”. Mini-apps are something Philippe Lion’s “SampleApp” team have been working on for a while now, and are extensions to core OBIEE that are enabled via Javascript and allow developers to create self-contained applications, including table creation scripts, to solve a particular user problem or requirement. This Youtube video from one of Philippe’s team goes through the basic concept, with custom Javascript used to unpack a mini-app setup archive and then create tables, and set up the analysis views, to support requirements such as linear regression and trend analysis.


It’s likely the BI Cloud Service will take this concept further and introduce a more formalised way of packaging-up BI mini-applications and deploying them quickly to the cloud, and also maybe introduce the concept of a BI App Store or Marketplace where pre-built analytic solutions can be selected and deployed faster even than if the user tried to built the same themselves using Excel (or Tableau, even).

Of course the other aspect to Tableau is its data visualisation capabilities, and while OBIEE improved in this area a bit – with trellis charts being introduced and a new visualisation suggestion engine – it’s probably fair to say that OBIEE 11g has dropped behind the industry-state-of-the-art in this area. What’s been interesting to see though, over the past twelve months, is the widespread adoption of technologies such as D3 and other third-part visualisation tools as additional ways to add graphs and other visuals to OBIEE, with Accenture’s Kevin McGinley showcasing the art of the possible on his blog recently (parts 1, 2 and 3) and presenting on this topic at the Atlanta Rittman Mead BI Forum later this week. Techniques such as those described by Kevin involve deploying separate third-party visualisation libraries such as D3 and Flot to the WebLogic server running OBIEE, and then calling those libraries using custom code contained within narrative views; while these aren’t as developer-friendly as built-in visualisation features in the tool, they do give you the ability to go beyond the standard graphs and tables provided by OBIEE 11g, as Tom Underhill from our team explained in a blog post on OBIEE11g and D3 back in 2013.


The upcoming 2014 OBIEE11g SampleApp will most probably feature some more third-party and externally-produced visualisations along these lines, including new HTML5 and Javascript integration capabilities for 11.1.1’7’s mapping feature:


and an example of integration ADF charts – which have far more options and capabilities that the subset used in OBIEE 11g – into the OBIEE dashboard. All of this is possible with OBIEE and standard Jdeveloper/ADF, with the video previewing the SampleApp PoC Demo going through the integration process at the end.


Community Development of OBIEE Best Practices, Techniques, Product Add-Ons

One of the advantages of OBIEE now being a mature and known product is that best practices are starting to emerge around deployment, development, performance optimisation and so-on around the product. For example, our own Stewart Bryson has been putting a lot of thought into agile development and OBIEE, and topics such as automated deployment of OBIEE RPDs using Git and scripting, giving us a more industry-standard way of building and deploying RPDs now that we’ve got the ability to work with repository metadata in a more atomic format. Robin Moffatt, also from Rittman Mead, has published many articles over the past few years on monitoring, measuring and testing OBIEE performance, again giving us a more industry-standard way of regression testing OBIEE reports and monitoring the overall OBIEE experience using open-source tools.

There’s even a third-party add-on industry for OBIEE, with Christian Screen’s / Art of BI’s “BI Teamwork” being the showcase example; OBIEE still doesn’t have any collaboration or social features included in the base product, unless you count wider integration with WebCenter as the answer for this, and Christian’s BI Teamwork product fills this gap by integrating collaboration, social and SaaS integration features into the core product including localisation into key overseas OBIEE markets.


Hadoop and Big Data Integration

You’ll probably have guessed from the amount of coverage we’ve given the topic on the blog over the past few months, but we think Hadoop and big data, and particularly the technologies that will spin-off from this movement, are quite a big deal and will revolutionise what we think-of as analytics and BI over the next few years. Most of this activity has taken place outside the core world of OBIEE using tools such as Cloudera Impala, R and Tableau as the default visualisation tool, but OBIEE will play a role too, primarily through its ability to incorporate big data insights and visualisations into the core enterprise semantic model and corporate dashboards.

What this means in-practice is that OBIEE needs to be able to connect to Hadoop data sources such as Hive and Impala, and also provide a means to incorporate, visualise and explore data from non-traditional sources such as NoSQL and document databases. OBIEE made a first step in this direction with its ability to use Apache Hive as a datasource, but this really is a minimal step-one in support for big data sources, as Hive is generally considered too-slow for ad-hoc query use and the HiveServer1 ODBC driver OBIEE ships with no longer being compatible with recent Cloudera Hadoop (CDH 4.5+) releases. What’s really needed is support for Impala – an in-memory version of Hive – as a datasource, something we hacked-together with a workaround but most probably coming as a supported data source in a future version of OBIEE. What would be very interesting though is support for document-style databases such as MongoDB, giving OBIEE (or most probably, Endeca) the capability to create 360 degree-views of customer activity, including unstructured data held in these NoSQL-style databases.

Exalytics and Engineered Systems

I’d almost forgotten Exalytics from this round-up, which is ironic given its prominence in Oracle BI product marketing over the past couple of years, but not all that surprising given the lack of real innovation around the product recently. There’s certainly been a number of Exalytics updates in terms of product certification – the graphic below shows the software evolution of Exalytics since launch, going up to autumn last year when we presented on it at Enkitec E4:


whilst the Exalytics hardware over the same period has seen RAM double, and SSD disk added to improve TimesTen and Essbase startup-times.


What Exalytics has lacked, though, is something game-changing that’s only available as part of this platform. There’s a central dilemma for Oracle over Exalytics; do they develop something for OBIEE that only works on OBIEE, that’s substantial and that they hold-back from the on-premise version, or do they largely release the same software for both Exalytics, and non-Exalytics OBIEE and rely on performance tweaks which are hard to quantify for customers, and are hard for Oracle salespeople to use as differentiation for the product. So far they’ve gone for the latter option, making Exalytics – if we’re honest – a bit underwhelming for the moment, but what would be really interesting is some capability that clearly can only be supported on Exalytics – some form of in-memory analysis or processing that needs 1TB+ of RAM for enterprise datasets, possibly based on an as-yet unreleased new analytic engine, maybe based on Essbase or Oracle R technology, maybe even incorporating something from Endeca (or even – left-field – something based on Apache Spark?)

My money however is on this differentiation growing over time, and Exalytics being used extensively by Oracle to power their BI in the Cloud Service, with less emphasis over time on on-premise sales of the products and more on “powered by Exalytics” cloud services. All of that said, my line with customers when talking about Exalytics has always been – you’re spending X million $/£ on OBIEE and the BI Apps, you might as well run it on the hardware its designed for, and which in the scheme of things is only a small proportion of the overall cost; the performance difference might not be noticeable now, but over time OBIEE will be more-and-more optimised for this platform, so you might as well be on it now and also take advantage of the manageability / TCO benefits.

So anyway, that’s my “state-of-the-nation” for OBIEE as I see it today – and if you’re coming along to the Atlanta RM BI Forum event later this week, there’ll be futures stuff from Oracle that we can’t really discuss on here, beyond the 3-6 month timeline, that’ll give you a greater insight into what’s coming in late 2014 and beyond.

Categories: BI & Warehousing

RM BI Forum 2014 Brighton is a Wrap – Now on to Atlanta!

Rittman Mead Consulting - Sun, 2014-05-11 17:34

I’m writing this sitting in my hotel room in Atlanta, having flown over from the UK on Saturday following the end of the Rittman Mead BI Forum 2014 in Brighton. I think it’s probably true to say that this year was our best ever – an excellent masterclass on the Wednesday followed by even-more excellent sessions over the two main days, and now we’re doing it all again this week at the Renaissance Atlanta Midtown Hotel in Atlanta, GA.

Wednesday’s guest masterclass was by Cloudera’s Lars George, and covered the worlds of Hadoop, NoSQL and big data analytics over a frantic six-hour session. Lars was a trooper; despite a mistake over the agenda where I’d listed his sessions as being just an hour each when he’d planned (and been told by me) that they were an hour-and-a-half each, he managed to cover all of  the main topics and take the audience through Hadoop basics, data loading and processing, NoSQL and analytics using Hive, Impala, Pig and Spark. Roughly half the audience had some experience with Hadoop with the others just being vaguely acquainted with it, but Lars was an engaging speaker and stuck around for the rest of the day to answer any follow-up questions.


For me, the most valuable parts to the session were Lars’ real-world experiences in setting up Hadoop clusters, and his views on what approaches were best to analyse data in a BI and ETL context – with Spark clearly being in-favour now compared to Pig and basic MapReduce. Thanks again Lars, and to Justin Kestelyn from Cloudera for organsising it, and I’ll get a second-chance to sit through it again at the event in Atlanta this week.

The event itself proper kicked-off in the early evening with a drinks reception in the Seattle bar, followed by the Oracle keynote and then dinner. Whilst the BI Forum is primarily a community (developer and customer)-driven event, we’re very pleased to have Oracle also take part, and we traditionally give the opening keynote over to Oracle BI Product Management to take us through the latest product roadmap. This year, Matt Bedin from Oracle came over from the States to deliverer the Brighton keynote, and whilst the contents aren’t under NDA there’s an understanding we don’t blog and tweet the contents in too much detail, which then gives Oracle a bit more leeway to talk about futures and be candid about where their direction is (much like other user group events such as BIWA and ODTUG).


I think it’s safe to say that the current focus for OBIEE over the next few months is the new BI in the Cloud Service (see my presentation from Collaborate’14 for more details on what this contains), but we were also given a preview of upcoming functionality for OBIEE around data visualisation, self-service and mobile – watch this space, as they say. Thanks again to Matt Bedin for coming over from the States to delver the keynote, and for his other session later in the week where he demo’d BI in the Cloud and several usage scenarios.

We were also really pleased to be joined by some some of the top OBIEE, Endeca and ODI developers around the US and Europe, including Michael Rainey (Rittman Mead) and Nick Hurt (IFPI), Truls Bergensen, Emiel van Bockel (CB), Robin Moffatt (Rittman Mead), Andrew Bond (Oracle) and Stewart Bryson (Rittman Mead), and none-other than Christian Berg, an independent OBIEE / Essbase developer who’s well-known to the community through his blog and through his Twitter handle, @Nephentur – we’ll have all the slides from the sessions up on the blog once the US event is over, and congratulations to Robin for winning the “Best Speaker” award for Brighton for his presentation “No Silver Bullets: OBIEE Performance in the Real World”.


We had a few special overseas guests in Brighton too; Christian Screen from Art of BI Software came across (he’ll be in Atlanta too this week, presenting this time), and we were also joined by Oracle’s Reiner Zimmerman, who some of you from the database/DW-side will known from the Oracle DW Global Leaders’ Program. For me though one of the highlights was the joint session with Oracle’s Andrew Bond and our own Stewart Bryson, where they presented an update to the Oracle Information Management Reference Architecture, something we’ve been developing jointly with Andrew’s team and which now incorporates some of our thoughts around the agile deployment of this type of architecture. More on this on the blog shortly, and look out for the white paper and videos Andrew’s team are producing which should be out on OTN soon.


So that’s it for Brighton this year – and now we’re doing it all again in Atlanta this week at the Renaissance Atlanta Midtown Hotel. We’ve got Lars George again delivering his masterclass, and an excellent – dare I say it, even better than Brighton’s – array of sessions including ones on Endeca, the In-Memory Option for the Oracle Database, TimesTen, OBIEE, BI Apps and Essbase. There’s still a few places left so if you’re interested in coming, you can book here and we’ll see you in Atlanta later this week!


Categories: BI & Warehousing

Adding Geocoding Capabilities to Pig through Custom UDFs

Rittman Mead Consulting - Sun, 2014-05-04 15:28

In the previous two posts in this series, I’ve used Hive and Pig to process and analyse data from log files generated by the Rittman Mead Blog webserver. In the article on Hive, I created a relational table structure over a directory containing the log files, using a regular expression SerDe to split each log line up into its constituent parts (IP address, page requested, status code and so on). I then brought in another table of data, containing IP address ranges and countries they were assigned to, so that I could determine what parts of the world were accessing our site over time.

In the second example, I took the same log files but processed them this time using Apache Pig. I used Pig’s dataflow-style approach to loading and analysing data to progressively filter, pivot and analyse the log file dataset, and then joined it to an export of pages and authors from WordPress, the CMS that we used to run the website, so I could determine who wrote the most popular blog articles over the period covered by the logs.

But the first example, where I joined the log file data to the geocoding table, had a bit of an issue that only came-up when I tested it with a larger set of data than I used at the end of that article. In the article example, I limited the amount of log file rows to just five, at the time to keep the example readable on the blog, but when I tried it later on with the full dataset, the query eventually failed with an out-of-memory error from the Hadoop cluster. Now in practice, I could probably have increased the memory (java heap space) or otherwise got the query through, but geo-tagging my data in this way – as a big table join, and using an in-memory database engine (Impala) to do it – probably isn’t the most sensible way to do a single value lookup as part of a Hadoop transformation – instead,  this is probably something better done through what’s called a “user-defined function”.

Both Hive and Pig support used-defined functions (UDFs), and a quick Google search brought up one for Hive called GeocodeIP, on Github, that looks like it might do the job. Sticking with Pig for the moment though, we thought this might be a good opportunity to see how UDFs for Pig are created, and so my colleague, Nelio Guimaraes, put together the following example to walk through how a typical one might be created. Before start though, a bit of background.

The problem we have is that we need to match an IP address in a webserver log file with an IP address range in a lookup table. For example, the IP address in the lookup table might be, any the lookup database would have an IP address range from, say, to which allocates to a particular country – Poland, for example. The lookup database itself comes from Maxmind, and there’s a formula they use to convert IP addresses to integers, like this:


so that you can do a simple BETWEEN in an SQL join to locate the range that matches the incoming IP address.


Except Pig, like Hive, can’t normally support non-equijoins, which leads us to UDFs and other approaches to getting the country for our IP address. Pig, again like Hive, is however extensible and its relatively easy to add Pig UDFs either yourself, or through UDF libraries like Pig’s Piggybank. The best language to write UDFs in is Java as it gives access to the largest amount of Pig native functionality (such as the ability to write custom data loaders and unloaders), but for what we’re doing Python will be fine, so let’s put one together in Python to do our IP address Geocoding and show how the process works.

Going back to the Pig scripts we put together yesterday, they started-off by declaring a relation that loaded the raw log files in from a directory in HDFS, and then used another relation to parse the first one via a regular expression, so we had each of the log file elements in its own column, like this:

raw_logs =LOAD '/user/root/logs/' USING TextLoader AS (line:chararray);
logs_base = FOREACH raw_logs
 '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
 remoteAddr: chararray, remoteLogname: chararray, user: chararray,
 time: chararray, request: chararray, status: int, bytes_string: chararray,
 eferrer: chararray, browser: chararray

At this point we can run-off a list of the top 5 browser type based on page access, for example:

grunt> by_browser_group = GROUP logs_base BY browser;                              
grunt> by_browser_group_count = FOREACH by_browser_group  
>> GENERATE group, COUNT($1);
grunt> by_browser_group_sorted = ORDER by_browser_group_count BY $1 DESC;
grunt> by_browser_group_top_5 = LIMIT by_browser_group_sorted 5;
grunt> dump by_browser_group_top_5
(Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36,675161)
(Mozilla/5.0 (compatible; - free monitoring service;,371078)
(Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0,201861)
(Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0,167851)

See the previous post, and this other blog article, for more background on how you do grouping and aggregating in Pig, if you’re not familiar with the syntax.

But what if we want to group by country? That’s where the geocoding comes in, and the Pig UDF we’re going to create. As we’re going to create this example using Python, we’ll be using the pygeoip Python API that you install through the pip, the Python package manager, and the GeoLite Country database (.dat) file from Maxmind, who make this basic version available for download free, then follow these steps to set the Python UDF up:

1. On the master node on your Hadoop cluster (i’m using a three-node CDH4.6 cluster) where Pig and Hive run, install pip, and then download the pygeoip API, like this:

pip install pygeoip

2. Copy the GeoIP.dat file to somewhere on the Hadoop master node, for example /home/nelio/. Make a note of the full path to the GeoIP.dat file, and then copy the file to the same location on all of the worker nodes – there’s probably a way to cache this or otherwise automatically distribute the file (suggestions welcome), but for now this will ensure that each worker node can get access to a copy of this file.

3. Using a text editor, create the python script that will provide the Python UDF, like this, substituting the path to your GeoIP.dat file if it’s somewhere else. Once done, save the file as to the same directory (/home/nelio, in this example) – note that this file only needs to go on the main, master node in your cluster, and Pig/MapReduce will distribute it to the worker nodes when we register it in our Pig script, later on.


import sys
import pygeoip

def getCountry(ip):
    gi = pygeoip.GeoIP('/home/nelio/GeoIP.dat')
    country = gi.country_name_by_addr(ip) 
    return country

Note that the sys.path.append line in the script is so that Jython knows to look in the place were the new python module, pygeoip, when starting up.

4. Let’s start another Pig session now and try and use this UDF. I exit back to the OS command prompt, and change directory to where I’ve stored the python file and the GeoIP.dat file, and start another Grunt shell session:

[root@cdh4-node1 ~]# cd /home/nelio
[root@cdh4-node1 nelio]# pig

Now if I want to use this Python Pig UDF in a Grunt shell session, I need to register it as an scripting UDF either at the Grunt shell or in a Pig script I run, so let’s start with that, and then bring in the log data as before:

grunt> register '' using jython as pythonGeoIP;
grunt> raw_logs =LOAD '/user/root/logs/' USING TextLoader AS (line:chararray);
grunt> logs_base = FOREACH raw_logs
>>         (
>>                 REGEX_EXTRACT_ALL
>>                 (
>>                         line,
>>                         '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
>>                 )
>>         )
>>         AS
>>         (
>>                 remoteAddr: chararray, remoteLogname: chararray, user: chararray,
>>                 time: chararray, request: chararray, status: int, bytes_string: chararray,
>>                 eferrer: chararray, browser: chararray
>>         );

I’ll now define a new relation (alias) projecting just the IP addresses from the combined log files, and then filter that into another alias so we only have valid IP addresses to work with:

grunt> ipaddress =  FOREACH logs_base GENERATE remoteAddr;
grunt> clean_ip  = FILTER ipaddress BY (remoteAddr matches '^.*?((?:\\d{1,3}\\.){3}\\d{1,3}).*?$');

Now we get to use the Python UDF. We’ll pass to it these IP addressees, with the UDF then returning the country that the IP address is located in, based on Maxmind’s IP address ranges.

grunt> country_by_ip = FOREACH clean_ip GENERATE pythonGeoIP.getCountry(remoteAddr);
2014-05-05 05:22:50,141 [MainThread] INFO  org.apache.pig.scripting.jython.JythonFunction - Schema 'country:chararray' defined for func getCountry
grunt> describe country_by_ip
country_by_ip: {country: chararray}

So this function will have converted all of the IP addresses in the logs to country names; let’s now group, count, order and select the top five from that list.

grunt> group_by_country = GROUP country_by_ip BY country;
grunt> count_by_country = FOREACH group_by_country GENERATE FLATTEN(group) as country, COUNT(country_by_ip) AS (hits_per_country:long);
grunt> order_by_access = ORDER count_by_country BY hits_per_country DESC;
grunt> top5_country = LIMIT order_by_access 5;

Of course all this has done so far is set-up a data-flow in Pig, telling it how to move the data through the pipeline and arrive at the final output I’m interested in; let’s now run the process by using the “dump” command:

grunt> dump top5_country
(United States,2012311)
(United Kingdom,459422)

So that’s a simple example of a Pig UDF, in this instance written in Python. There’s other ways to extend Pig beyond UDFs – Pig Streaming is the obvious alternative, where the entire relation goes through the streaming interface to be processed and then output back into Pig, and hopefully we’ll cover this at some point in the future – or then again, maybe it’s now time to take a proper look at Spark.

Categories: BI & Warehousing

Simple Hadoop Dataflows using Apache Pig and CDH4.6

Rittman Mead Consulting - Fri, 2014-05-02 00:36

The other day I took some logs from the Apache webserver that runs the Rittman Mead website, and analysed them using Hadoop CDH5, Apache Hive and Impala to get some basic metrics on number of hits per month, where the hits came from and so on. Hive and Impala are great for analysing data sitting on HDFS on a Hadoop cluster, but like SQL compared to PL/SQL or C++, everything you do is declarative and set-based whereas sometimes, you want to build up your dataset using a dataflow-type approach, particularly if you’ve come from a programming vs. a data warehousing background.

If you’ve been looking at Hadoop for a while, you’ll probably therefore know there’s another basic high-level-language approach to querying Hadoop data to accompany Hive, and it’s called “Pig”. Pig, like Hive, is an Apache project and provides an engine for creating and executing data flows, in parallel, on Hadoop. Like Hive, jobs you create in Pig eventually translate into MapReduce jobs (with the advantages and disadvantages that this brings), and has concepts that are similar – but just that little bit different – to relational flows such as filters, joins and sorts.

It’s often called a “procedural” language (as opposed to Hive’s declarative language), but really it’s not – it’s a “data flow language” that has you specifically set out the data flow as the main part of a Pig program, rather than it being a by-product of the if/then/elses and control structures of a procedural language. For people like me that comes from an Oracle data warehousing background, in most cases we’d feel more comfortable using Hive’s set-based transformations to do our data loading and transformation on Hadoop, but in some cases – particularly when you’re querying data interactively, building up a data pipeline and working with nested data sets – it can be more appropriate.

Connecting to the Pig Console, and Pig Execution Options

Iteratively examining and analysing data from webserver log files is a great example of where Pig could be useful, as you naturally hone-down and pivot the data as you’re looking at it, and in-effect you’re looking to create a data pipeline from the raw logs through to whatever summary tables or files you’re looking to create. So let’s go back to the same input log files I used in the previous post on Hive and Impala, and this time bring them into Pig. 

Within CDH (Cloudera Distribution including Hadoop) you can run Pig scripts either interactively from the Pig command-line shell, called “Grunt”, or you can submit them as workflow jobs using the Hue web interface and the Oozie workflow scheduler; the advantage when you’re starting to working with the interactive Grunt shell is that you can run your commands one-by-one and examine the metadata structures that you create along the way, so let’s use that approach first and move onto batch scheduling later on.

I’ll start by SSH’ing into one of the CDH4.6 nodes and starting the Grunt shell:

officeimac:~ markrittman$ ssh root@cdh4-node1
root@cdh4-node1's password: 
Last login: Sat May  3 06:38:18 2014 from
[root@cdh4-node1 ~]# pig
2014-05-03 06:44:39,257 [main] INFO  org.apache.pig.Main - Apache Pig version 0.11.0-cdh4.6.0 (rexported) compiled Feb 26 2014, 03:01:22
2014-05-03 06:44:39,258 [main] INFO  org.apache.pig.Main - Logging error messages to: /root/pig_1399095879254.log
2014-05-03 06:44:39,301 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2014-05-03 06:44:39,663 [main] WARN  org.apache.hadoop.conf.Configuration - is deprecated. Instead, use fs.defaultFS
2014-05-03 06:44:39,663 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://
2014-05-03 06:44:40,392 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at:
2014-05-03 06:44:40,394 [main] WARN  org.apache.hadoop.conf.Configuration - is deprecated. Instead, use fs.defaultFS

Even from within the Grunt shell, there’s two ways I can then run Pig. The default way is to have Grunt run your Pig commands as you’d expect, converting them in the end to MapReduce jobs which then run on your Hadoop cluster. Or, you can run in “local mode”, which again uses MapReduce but only runs on the machine you’re logged in to and only single-threaded, but can often be faster for when you’re just playing around with a local dataset and you want to see results fast (you can turn on local mode by adding an ‘-x local’ flag when starting Grunt). In my example, I’m going to run Grunt in regular MapReduce mode though anyway.

Loading and Parsing the Weblog Files

I then define my first pig relation, analogous to a relational table and technically, a named Pig “bag”, like this:

grunt> raw_logs =LOAD '/user/root/logs/' USING TextLoader AS (line:chararray);

Compared to the Pig table DDL script in the previous article example I posted, we declare the incoming dataset much more programmatically – the first row of the script creates a relation called “raw_logs”, analogous to a table in Hive, and declares it as having a single column (“line:array”) that maps onto a directory of files in HDFS (“/user/root/logs”). You can ask Pig (through the Pig command-line client, which I’m using now) to list-out the structure of this relation using the “describe” command:

grunt> describe raw_logs;
raw_logs: {line: chararray}

In this form the logs aren’t too useful though as each row contains all the data we want, as a single field. To take a look at what we’re working with currently, let’s create another relation that limits down the dataset to just five rows, and use the DUMP command to display the relation’s data on the screen:

grunt> raw_logs_limit_5 = LIMIT raw_logs 5;
grunt> DUMP raw_logs_limit_5
2014-05-03 16:15:13,260 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
( - - [03/Apr/2014:20:56:34 +0000] "GET /wp-content/uploads/2012/01/Partial-Hybrid-Model-e1327470743307.png HTTP/1.1" 200 8432 "" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)")
( - - [08/Apr/2014:20:13:45 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 - "" "Mozilla/5.0 (Windows NT 5.1; rv:26.0) Gecko/20100101 Firefox/26.0")
( - - [11/Apr/2014:06:08:10 +0000] "GET /wp-content/plugins/featured-content-gallery/scripts/ HTTP/1.1" 200 6075 "" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)")
( - - [11/Apr/2014:06:08:10 +0000] "GET /wp-content/plugins/featured-content-gallery/scripts/ HTTP/1.1" 200 492 "" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)")
( - - [18/Mar/2014:14:11:52 +0000] "GET /wp-content/uploads/2013/08/NewImage14.png HTTP/1.1" 200 217190 "" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)")

What I’ve omitted for clarity in the above output is the MapReduce console output – what you’ll see if you run this in MapReduce mode is the process starting up, and then running, to retrieve 5 rows effectively at random from the whole set of log files, process them through the Map > Shuffle > Reduce process and then return them to the Grunt shell.

What would be really good though, of course, is if we could split these single log row columns into multiple ones, one for each part of the log entry. In the Hive example I posted the other day, I did this through a Hive “SerDe” that used a regular expression to split the file, and I can do something similar in Pig; Pig has a function called REGEX_EXTRACT_ALL that takes a regular expression and creates a column for each part of the expression, and so I can use it in conjunction with another relational operator, GENERATE FLATTEN, to take the first set of data, run it through the regular expression and come out with another set of data that’s been split as I want it:

logs_base = FOREACH raw_logs
            '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'
        remoteAddr: chararray, remoteLogname: chararray, user: chararray,  
        time: chararray, request: chararray, status: chararray, bytes_string: chararray, 
        referrer: chararray, browser: chararray

grunt> describe logs_base
logs_base: {remoteAddr: chararray,remoteLogname: chararray,user: chararray,time: chararray,request: chararray,status: chararray,bytes_string: chararray,referrer: chararray,browser: chararray}

GENERATE in Pig tells it to create (or “project”( some columns out of an incoming dataset; FLATTEN eliminates any nesting the resulting output (we’ll see more of FLATTEN and nesting in a moment). Notice how the DESCRIBE command afterwards now shows individual columns for the log elements, rather than just one single “line:chararray” column.

Using Pig to Interactively Filter the Dataset

So now we’ve got a more useful set of rows and columns in the Pig relation, and like an Oracle table, unless we do something to order them later, they’re effectively held in random order. Something we can do now is filter the dataset, for example creating another relation containing just those log entries where the request 404’d, and the further filter that dataset to those 404’d requests that were made by users using IE6:

grunt> logs_404 = FILTER logs_base BY status == ‘404';
grunt> logs_404_ie6 = FILTER logs_404 BY browser == 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)';

So how many of our website users are on IE6 and getting page not available errors? To find out, I create another relation that groups the entries up in a single row, and then generates a count of those rows that were aggregated:

grunt> logs_404_ie6_count = FOREACH (GROUP logs_404_ie6 ALL) GENERATE COUNT(logs_404_ie6);
grunt> DUMP logs_404_ie6_count

and I can do a similar thing all of the 404’s:

grunt> logs_404_count = FOREACH (GROUP logs_404 ALL) GENERATE COUNT(logs_404);
grunt> dump logs_404_count

You can see these Pig scripts running in CDH’s Cloudera Manager web application, with the screenshot below showing one of them at the point where 92% of the Mapper parts have completed, waiting to hand-off to the Reducers; the console output in Grunt will show you the status too, the output of which I removed from the above two statements for clarity.


Grouping, Subsetting and Aggregating Data using Pig

How we generate counts and other aggregates is interesting in Pig. Pig has a relational operator called GROUP as we’ve seen before, and when you GROUP a relation by a column, or a group of columns, it creates a new relation that contains two columns; one called “group” that has the same datatype as whatever you grouped on (or a “tuple” made up of multiple columns, if you grouped-on more than one column), and a second column that’s named after whatever you grouped, i.e. the original relation. To take an example, if we grouped the logs_base relation on status code, you’d see the following if you then describe the resulting relation:

grunt> by_status = GROUP logs_base BY status;
grunt> describe by_status
by_status: {group: chararray,logs_base: {(remoteAddr: chararray,remoteLogname: chararray,user: chararray,time: chararray,request: chararray,status: chararray,bytes_string: chararray,referrer: chararray,browser: chararray)}}

What’s interesting though about a pig GROUP, and conceptually different to SQL (and therefore Hive)’s GROUP BY, is that this second column is actually in Pig terms a “bag”, a bag of rows (or “tuples”) that are unaltered compared to the original relation, i.e. they’ve not been aggregated up by the grouping, but are still in their same detail-level. So Pig gives you, apart from its step-by-step data flow method of working with data, this ability to group data whilst still preserving the detail of the individual grouped rows, leaving any summation or other aggregation step to something you do afterwards. So for example, if I wanted to see how many 200s, 404’s and so on my log file dataset contained in total, I then tell Pig to iterate through these bags, project out the columns I’m interested in (in this case, just the status) and also perform aggregation over the grouping buckets specified in the GROUP relational operator:

grunt> by_status_count = FOREACH by_status GENERATE FLATTEN (group) as status, COUNT(logs_base);
grunt> dump by_status_count

So in that example, we told Pig to list out all of the groupings (i.e. the distinct list of status codes), and then run a count of rows against each of those groupings, giving us the output we’re interested in. We could, however, not aggregate those rows at this point though and instead treat each “bucket” formed by the grouping as a sub-selection, allowing us to, for example, investigate in more detail when and why the 301 errors – “Moved Permanently” – were caused. Let’s use that now to find out what the top 10 requests were that led to HTTP 301 errors, starting by creating another relation that just contains the ‘301’ group:

grunt> by_status_301 = FILTER by_status BY group == '301';
grunt> describe by_status_301

by_status_301: {group: chararray,logs_base: {(remoteAddr: chararray,remoteLogname: chararray,user: chararray,time: chararray,request: chararray,status: chararray,bytes_string: chararray,referrer: chararray,browser: chararray)}}

Looking at the structure of the relation this has created though, you can see that the rows we’ve grouped are all contained within a single tuple called “logs_base”, and to do anything interesting with that data we’ll need to flatten it, which takes that tuple and un-nests it:

grunt> by_status_301_flattened = FOREACH by_status_301
grunt> describe by_status_301_flattened
by_status_301_flattened: {group: chararray,logs_base::remoteAddr: chararray,logs_base::remoteLogname: chararray,logs_base::user: chararray,logs_base::time: chararray,logs_base::request: chararray,logs_base::status: chararray,logs_base::bytes_string: chararray,logs_base::referrer: chararray,logs_base::browser: chararray}

Notice also how I referenced the two columns in the by_status_301 relation by positional notation ($0 and $1)? This is handy when either you’ve not got a proper schema defined for your data (all part of the “pigs eat anything” approach for Pig, in that it even handles data you don’t yet have a formal schema for), or when it’s just easier to refer to a column by position than work out it’s formal name.

So now we’ve got our list of log entries that have recorded HTTP 301 “permanently moved” error messages, let’s use another relation to project just the columns we want – the date and the requests – and also use some Pig string functions to extract the day, month and year along, and also split the request field up into its constituent method, URI and protocol fields:

grunt> by_status_301_date_and_urls = FOREACH by_status_301_flattened                                                                   
>> GENERATE SUBSTRING(time,3,6) as month,                                                                                          
>> SUBSTRING(time,7,11) as year, FLATTEN(STRSPLIT(request,' ',5)) AS (method:chararray, request_page:chararray, protocol:chararray);
grunt> describe by_status_301_date_and_urls
by_status_301_date_and_urls: {month: chararray,year: chararray,method: chararray,request_page: chararray,protocol: chararray}
grunt> by_status_date_and_urls_group = GROUP by_status_301_date_and_urls BY (year,month);                                              

grunt> describe by_status_date_and_urls_group
by_status_date_and_urls_group: {group: (year: chararray,month: chararray),by_status_301_date_and_urls: {(month: chararray,year: chararray,method: chararray,request_page: chararray,protocol: chararray)}}
grunt> by_status_date_and_urls_group_count = FOREACH by_status_date_and_urls_group
>> GENERATE FLATTEN(group) as (year,month),                                   
>> COUNT(by_status_301_date_and_urls);

All of these statements just set-up the data flow, and no actual processing takes place until we choose to dump, or store, the results of the data flow – which again makes Pig great for iteratively building-up a data flow, or in BI terms maybe an ETL flow, before finally pulling the trigger at the end and generating the end result. Let’s do that now using the dump command:

grunt> dump by_status_date_and_urls_group_count     
Job Stats (time in seconds):
Successfully read 5341613 records (1448814961 bytes) from: "/user/root/logs"
Successfully stored 3 records (63 bytes) in: "hdfs://"
Total records written : 3
Total bytes written : 63
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 1
Total records proactively spilled: 221601
Job DAG:
2014-05-04 00:33:50,193 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 169 time(s).
2014-05-04 00:33:50,193 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-05-04 00:33:50,194 [main] INFO - Key [pig.schematuple] was not set... will not generate code.
2014-05-04 00:33:50,199 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-05-04 00:33:50,199 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

So we had around 85k “page permanently moved” errors in April, only a few in February, and a much larger amount in March 2014. So which web page requests in March 2014 were the biggest cause of this error? Let’s focus on just that month and list out the top ten page requests that hit this error:

grunt> mar_2014_urls = FILTER by_status_301_date_and_urls BY (year == '2014' AND month == 'Mar');
grunt> mar_2014_301_url_group = GROUP mar_2014_urls BY request_page;                             
grunt> describe mar_2014_301_url_group
mar_2014_301_url_group: {group: chararray,mar_2014_urls: {(month: chararray,year: chararray,method: chararray,request_page: chararray,protocol: chararray)}}
grunt> mar_2014_301_url_group_count = FOREACH mar_2014_301_url_group
>> GENERATE FLATTEN(group) as request,                          
>> COUNT(mar_2014_urls) as num;
grunt> mar_2014_301_url_group_count_sorted = ORDER mar_2014_301_url_group_count by num DESC;
grunt> mar_2014_301_url_group_count_limited = LIMIT mar_2014_301_url_group_count_sorted 10;
grunt> dump mar_2014_301_url_group_count_limited

Joining Datasets in Pig 

So far we’ve worked with just a single set of data – the Apache weblog files that we’ve then filtered, subsetted, parsed, analysed and so forth. But what would be really interesting though, would be if we can bring in some additional, reference or other lookup data to help us make more sense of the log activity on our website. One of the motivators for the people behind Pig, right at the start, was to give Hadoop the ability to join datasets, which up until then was really hard to do with just Java and MapReduce; as we’ll see later on there are still a lot of restrictions on how these joins take place, but Pig gives you the ability to join two or more datasets together, which we’ll do now in another example where we’ll look at the most popular blog posts, and blog authors, over the period covered by our logs.

Let’s start by taking the full set of logs, parsed into the separate elements of the log file entry, and add in additional columns for month and the request elements:

grunt> page_requests = FOREACH logs_base                      
>> GENERATE SUBSTRING(time,3,6) as month,                   
>> FLATTEN(STRSPLIT(request,' ',5)) AS (method:chararray, request_page:chararray, protocol:chararray);
grunt> describe page_requests;
page_requests: {month: chararray,method: chararray,request_page: chararray,protocol: chararray}

One thing you’re taught with Pig is “project early, and often”, so let’s remove the method and protocol columns from that dataset and then filter the remaining page requests to remove those that are blank or aren’t blog post requests:

grunt> page_requests_short = FOREACH page_requests
>> GENERATE $0,$2;
grunt> page_requests_short_filtered = FILTER page_requests_short BY (request_page is not null AND SUBSTRING(request_page,0,3) == '/20');
grunt> page_requests_count = FOREACH (GROUP page_requests_short_filtered ALL) GENERATE COUNT (page_requests_short_filtered);
grunt> dump page_requests_count

Let’s now reduce that list down to the top ten page requests, the way we did before with pages causing 301 errors:

grunt> page_request_group = GROUP page_requests_short_filtered BY request_page;
grunt> page_request_group_count = FOREACH page_request_group GENERATE $0, COUNT(page_requests_short_filtered) as total_hits;
grunt> page_request_group_count_sorted = ORDER page_request_group_count BY $1 DESC;
grunt> page_request_group_count_limited = LIMIT page_request_group_count_sorted 10;
grunt> dump page_request_group_count_limited

Not bad. What would be even better though, would be if I could retrieve the full names of these posts in WordPress, on which our website runs, and also the author name. I’ve got text file export file of post names, URLs and authors that’s been previously exported from our WordPress install, so let’s declare another relation to hold initially the raw rows from that file, like this:

grunt> raw_posts = LOAD '/user/root/posts/' USING TextLoader AS (line:chararray);

Then split that file by the semicolon that delimits each of the entries (author, post name etc):

grunt> posts_line = FOREACH raw_posts 
>> (
>> STRSPLIT(line,';',10)
>> )
>> AS 
>> (
>> post_id: chararray, title: chararray, post_date: chararray,
>> type: chararray, author: chararray, post_name: chararray,
>> url_generated: chararray
>> );
grunt> describe posts_line
posts_line: {post_id: chararray,title: chararray,post_date: chararray,type: chararray,author: chararray,post_name: chararray,url_generated: chararray}

I’ll now take that relation and project just the columns I’m interested in:

grunt> posts_and_authors = FOREACH posts_line
>> GENERATE title,author,post_name,CONCAT(REPLACE(url_generated,'"',''),'/') AS (url_generated:chararray);
grunt> describe posts_and_authors
posts_and_authors: {title: chararray,author: chararray,post_name: chararray,url_generated: chararray}

Now I’ll do the join, and then take that join and use it to generate a combined list of pages and who wrote them:

grunt> pages_and_authors_join = JOIN posts_and_authors BY url_generated, page_request_group_count_limited BY group;
grunt> pages_and_authors = FOREACH pages_and_authors_join GENERATE url_generated, post_name, author, total_hits;
grunt> top_pages_and_authors = ORDER pages_and_authors BY total_hits DESC;

and then finally, output the joined set of data to a comma-separated file in HDFS:

grunt> STORE top_pages_and_authors into '/user/root/top-pages-and-authors.csv' USING PigStorage(‘,');

Once that’s run, I can use Grunt’s “cat” command to output the contents of the file I just created:

grunt> cat /user/root/top-pages-and-authors.csv                                                      
/2014/03/obiee-dashboard-prompt-at-least-one-mandatory/,"obiee-dashboard-prompt-at-least-one-mandatory","Gianni Ceresa",4610
/2012/03/obiee-11g-security-week-connecting-to-active-directory-and-obtaining-group-membership-from-database-tables/,"obiee-11g-security-week-connecting-to-active-directory-and-obtaining-group-membership-from-database-tables","Mark Rittman",3528
/2013/04/upgrading-obiee-to-11-1-1-7/,"upgrading-obiee-to-11-1-1-7","Robin Moffatt",2963
/2014/04/bi-forum-2014-preview-no-silver-bullets-obiee-performance-in-the-real-world/,"bi-forum-2014-preview-no-silver-bullets-obiee-performance-in-the-real-world","Robin Moffatt",2605
/2014/03/the-secret-life-of-conditional-formatting-in-obiee/,"the-secret-life-of-conditional-formatting-in-obiee","Gianni Ceresa",2579
/2012/03/obiee-11g-security-week-subject-area-catalog-and-functional-area-security-2/,"obiee-11g-security-week-subject-area-catalog-and-functional-area-security-2","Mark Rittman",2410
/2014/03/introducing-obi-metrics-agent-an-open-source-obiee-metrics-collector/,"introducing-obi-metrics-agent-an-open-source-obiee-metrics-collector","Robin Moffatt",2321
/2014/03/using-oracle-r-enterprise-to-analyze-large-in-database-datasets/,"using-oracle-r-enterprise-to-analyze-large-in-database-datasets","Mark Rittman",2309
/2014/03/using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm/,"using-sqoop-for-loading-oracle-data-into-hadoop-on-the-bigdatalite-vm","Mark Rittman",2240
/2012/03/obiee-11-1-1-6-new-dashboard-analysis-and-reporting-features/,"obiee-11-1-1-6-new-dashboard-analysis-and-reporting-features","Mark Rittman",2160

But What About More Complex Joins and Transformations … Enter, Pig Extensibility

This is of course great, but going back to my previous Hive example I also managed to geo-code the log file entries, converting the IP addresses into country names via a lookup to a geocoding database. What made that example “interesting” though was the need to join the Hive table of log posts to the geocode table via a BETWEEN, or > and < than operators, so that I could locate each IP address within the ranges given by the geocoding database – and the reason it got interesting was that Hive can only do equi-joins, not non-equijoins or joins involving greater than, BETWEEN and so on. Impala *could* do it, and on a small set of input rows – five in my example – it worked fine. Try and scale the Impala query up to the full dataset though, and the query fails, because it runs out of memory; and that’s potentially the issue with Impala, and set-based queries, as Impala does everything in-memory, and most Hadoop systems are designed for fast I/O, not lots of memory. 

So can Pig help here? Well, it’s actually got the same limitation – non-equijoins are actually quite difficult to do in Hadoop because of the way MapReduce works, but where Pig could perhaps help is through its extensibility – you can stream Pig data, for example IP addresses, through Perl and Python scripts to return the relevant country, or you can write Pig UDFs – User-Defined Functions – to return the information we need in a similar way to how PL/SQL functions in Oracle let you call-out to arbitrary functions to return the results of a more complex look-up. But this is also where things get a bit more complicated, so we’ll save this to the next post in this series, where I’ll also be joined by my colleague Nelio who’s spent the best part of this last week VPNd into my VMWare-based Hadoop cluster getting this last example working.

Categories: BI & Warehousing

New ODI12c Article, and Details of our Inaugural ODI12c Course in Brighton, May 12th-14th 2014

Rittman Mead Consulting - Thu, 2014-05-01 09:11


Oracle have just published the May/June 2014 edition of Oracle Magazine, and my business analytics column this time round is on the new 12c release of Oracle Data Integrator.

In “Go with the Flow” I look at how this new editor supports OWB-style multi-step mappings, and how new features like “deployment specifications” allow you to choose different load strategies depending on whether you’re doing a full, or an incremental load. On the same topic, you might also want to take a look at my colleague Stewart Bryson’s recent article on the Oracle Technology Network, “Making the Move from Oracle Warehouse Builder to Oracle Data Integrator 12c”, where he takes an in-depth look at what’s involved in migrating from, and interoperating with, Oracle Warehouse Builder, and what’s in-store for OWB developers when they upgrade to the new 12c release of ODI.

This is actually excellent timing, as we’re just about to launch our new ODI12c training, with our initial course being a three-day ODI12c bootcamp that’s running for the first time in Brighton, UK, from May 12th -14th 2014. Based on this latest release of Oracle Data Integrator, this three-day course assumes no prior knowledge and takes you through everything you need to know to get started with ODI12c.

From setting up the topology through to creating mapping, packages and load plans, this course features modules and labs covering many aspects of ODI 12c functionality. As with all of our courses, we bring our trainer to you and teach all of your team, together, how to make the most of Oracle’s premier data integration tool, with one of our experienced consultants leading the sessions and sharing their project experience.

This first run of the course will be taught by the course author, Oracle ACE Edel Kammermann, accompanied by Jerome Francoisse, our lead beta-tester for ODI12c and speaker at events such as Oracle Openworld and RMOUG training days. If you’ve been looking to get trained-up on the new 12c release of Oracle Data Integrator, this is an excellent opportunity to learn the basics in just three days, down in sunny Brighton in May! Course details are as follows:

  • Duration : 3 Days
  • Course Delivery : Instructor-led with labs, on-site at customer location
  • Who Should Attend : Developers, consultants, project managers, technical leads, DBAs
  • Prerequisites : None

Detailed Course Agenda :

  • Getting Started with ODI 12c
  • ODI Topology
  • ODI Projects
  • Models and Datastores
  • Data Quality in a Model
  • Introduction to ODI Mappings
  • ODI Procedures, Variables, Sequences, and User Functions
  • Advanced ODI Mappings
  • ODI Packages
  • Scenarios in ODI
  • The ODI Debugger
  • ODI Load Plans

To book a place on the course, running in Brighton, UK on May 12th-14th 2014, just click on this link - we’ll be running the course in the US shortly afterwards. Finally, if you’ve got any questions about this course or any of our other OBIEE, ODI, Oracle BI Apps or OWB courses, just drop us a line at

Categories: BI & Warehousing

Previewing TimesTen, Endeca and Oracle DW Sessions at the Brighton BI Forum 2014

Rittman Mead Consulting - Thu, 2014-05-01 02:00

It’s under a week now to the first of the two Rittman Mead BI Forum 2014 events, with Brighton running next week at the Hotel Seattle and then Atlanta the week after, at the Renaissance Atlanta Midtown Hotel. Earlier in the week I went through a more detailed agenda for the Lars George Cloudera Hadoop Masterclass, and the week before Stewart covered-off some of the Oracle sessions at the Atlanta event, but as a final preview of this series I just wanted to talk about three session running at next week’s Brighton event.


Someone I’ve got to know pretty well over the last year is Oracle’s Chris Jenkins, who’s the face of TimesTen development in the UK. I first ran into Chris, and his colleague Susan Cheung, late last year when I posted a series of articles on TimesTen vs. Essbase ASO, and then Chris presented alongside myself and Peak Indicators’ Tony Heljula at last year’s Oracle Openworld, on TimesTen on Exalytics Best Practices. Chris kindly agreed to come along to the Brighton BI Forum and share some of his tips and techniques on TimesTen development, and also answer some of the questions from members of the audience implementing TimesTen as part of their OBIEE setup. Over to Chris:

“Since the launch of Exalytics TimesTen has been at it’s heart delivering high performance access to relational data to support the ‘speed of thought’ experience. But it hasn’t all been plain sailing; each use case has its own specific challenges and correct configuration, adopting best operational practice and properly tuning the TimesTen database to support the workload are essential to getting the best results. When working with customers I often come across situations where things are not setup quite as well as they might be or where a less than optimal approach has been adopted, and this can negatively affect performance or manageability.

In my session I will highlight the most common pitfalls and show how to avoid them. I will also discuss best practices for operation and data loading and look at how to optimise the TimesTen database for your workload. And of course there is the opportunity to ask questions! By the end of the session I hope that you will have a good understanding of how to get the best out of TimesTen for your particular use case.”


Another speaker speaking for the first time at the BI Forum is Truls Bergersen, but Truls will of course be well-known to the European user group community through his work with the Norwegian Oracle User Group, who run the excellent OUGN conference cruise each year around April. Truls has been working with Oracle’s BI and data warehousing tools for many years, but more recently has been taking a close look at Endeca Information Discovery, the search and data discovery tool Oracle added to their BI portfolio a couple of years ago. According to Truls …

“It’s been almost two and a half years now, since Oracle acquired Endeca,and in that period the tool has been given a few enhancements. E.g.improvements have been done to the look-and-feel of the UI, it has beenadded support for loading JSON and OBI presentation tables, and the toolcan now be installed on Weblogic. My two favorite things, however, are theself service provisioning and eBS extensions.

The goal of my presentation is to give the audience a good overview of thetool from a data architect’s point of view, and how the tool fits in withand extends your existing BI platform. I will not go into details aboutinstallation and other too technical aspects, but rather look at thetool’s capabilities from a data point of view – how can Endeca utilizeOBIEE and visa versa, what can be done in terms of self-service, etc.”


Finally, we’re really pleased to be joined by none other than Reiner Zimmerman, who heads-up Oracle’s Data Warehouse Global Leaders’ Program. Rittman Mead are one of the European partner sponsors of the DW Global Leaders Forum, which brings together the top customers and partners working with Oracle’s data warehousing, engineered systems and big data products several times a year in EMEA, the Americas and APAC.  Reiner’s also the person most likely to take the “last man standing” award from our own Borkur and Ragnar, so before that happens, over to Reiner:

“The DW & Big Data Global Leaders program is an Oracle development driven strategic customer program to establish a platform for Oracle DW and Big Data customers to discuss best practices and experience with Oracle Product Management and Product Development and our Associate Partners like Rittman Mead.

Our current focus is Big Data and Advanced Analytics and we seek to create best practices around Big Data architectures in terms of Manageability, High Availability and Monitoring Big Data Systems. Learn what the program is, what is can bring to you, how you can participate and what other customers are doing.”

The Rittman Mead Brighton BI Forum 2014 runs next week (May 7th-9th 2014) at the Hotel Seattle, Brighton, and there’s still a few places left if you register now. Straight-after, we’re going over to Atlanta to do it all again at the Renaissance Midtown Atlanta Hotel, with full details of the event agendas here, and the event homepage including registration instructions, here. Hopefully see some of you in Brighton or Atlanta later in May!

Categories: BI & Warehousing

Extended Visualisation of OBIEE Performance Data with Grafana

Rittman Mead Consulting - Tue, 2014-04-29 12:00

Recently I wrote about the new obi-metrics-agent tool and how it enables easy collection of DMS data from OBIEE into whisper, the time-series based database behind graphite. In this post I’m going to show two things that take this idea further:

  1. How easy it is to add other data into Graphite
  2. How to install and use Grafana, a most excellent replacement for the graphite front-end.
Collecting data in Graphite

One of the questions I have been asked about using Graphite for collecting and rendering OBIEE DMS metrics is a very valid one : given that OBIEE is a data visualisation tool, and that it usually sits alongside a database, where is the value in introducing another tool that apparently duplicates both data storage and visualisation.

My answer is that it is horses for courses. Graphite has a fairly narrow use-case but what it does it does superbly. It lets you throw any data values at it (as we’re about to see) over time, and rapidly graph these out alongside any other metric in the same time frame.

You could do this with OBIEE and a traditional RDBMS, but you’d need to design the database table, write a load script, handle duplicates, handle date-time arithmetic, build and RPD, build graphs – and even then, you wouldn’t have some of the advanced flexibility that I am going to demonstrate with Grafana below.

Storing nqquery.log response times in Graphite

As part of my Rittman Mead BI Forum presentation “No Silver Bullets – OBIEE Performance in the Real World”, I have been doing a lot of work examining some of the internal metrics that OBIEE exposes through DMS and how these correlate with the timings that are recorded in the BI Server log, nqquery.log, for example:

[2014-04-21T22:36:36.000+01:00] [OracleBIServerComponent] [TRACE:2] [USER-33] [] [ecid: 11d1def534ea1be0:6faf73dc:14586304e07:-8000-00000000000006ca,0:1:9:6:102] [tid: e4c53700] [requestid: c44b002c] [sessionid: c44b0000] [username: weblogic] -------------------- 
Logical Query Summary Stats: Elapsed time 5, Response time 2, Compilation time 0 (seconds) [[

Now, flicking back and forth between the query log is tedious with a single-user system, and as soon as you have multiple reports running it is pretty much impossible to track the timings from the log with data points in DMS. The astute of you at this point will be wondering about Usage Tracking data, but for reasons that you can find out if you attend the Rittman Mead BI Forum I am deliberately using nqquery.log instead.

Getting data in to Graphite is ridiculously easy. Simply chuck a metric name, value, and timestamp, at the Graphite data collector Carbon, and that’s it. You can use whatever method you want for sending it, here I am just using the Linux commandline tool NetCat (nc):

echo " 3 `date +%s`"|nc localhost 2003

This will log the value of 3 for a metric for the current timestamp (date +%s). Timestamps are in Unix Time, which is the number of seconds since 1st Jan 1970. You can specify historical values for your metric too:

echo " 3 1386806400"|nc localhost 2003

Looking in Graphite we can see the handle of test values I just sent through appear:

Tip: if you don’t see your data coming through, check out the logs in ~/graphite/storage/log/carbon-cache/carbon-cache-a/ (assuming you have Graphite installed in ~/graphite)

So, we know what data we want (nqquery.log timings), and how to get data into Graphite (send the data value to Carbon via nc). How do we bring the two together? We do this in the same way that many Linux things work, and that it using pipes to join different commands together, each doing one thing and one thing well. The above example demonstrates this – the output from echo is redirected to nc.

To extract the data I want from nqquery.log I am using grep to isolate the lines of data that I want, and then gawk to parse the relevant data value out of each line. The output from gawk is then piped to nc just like above. The resulting command looks pretty grim, but is mostly a result of the timestamp conversion into Unix time:

grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$23);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.elapsed",$23,e}'|nc localhost 2003

An example of the output of the above is:

nqquery.logical.response 29 1395766983
nqquery.logical.response 22 1395766983
nqquery.logical.response 22 1395766983
nqquery.logical.response 24 1395766984
nqquery.logical.response 86 1395767047
nqquery.logical.response 10 1395767233
nqquery.logical.response 9 1395767233

which we can then send straight to Carbon.

I’ve created additional versions for other available metrics, which in total gives us:

# This will parse nqquery.log and send the following metrics to Graphite/Carbon, running on localhost port 2003
#       nqquery.logical.compilation
#       nqquery.logical.elapsed
#       nqquery.logical.response
#       nqquery.logical.rows_returned_to_client
#       nqquery.physical.bytes
#       nqquery.physical.response
#       nqquery.physical.rows
# NB it parses the whole file each time and sends all values to carbon. 
# Carbon will ignore duplicates, but if you're working with high volumes 
# it would be prudent to ensure the nqquery.log file is rotated 
# appropriately.

grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$23);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.elapsed",$23,e}'|nc localhost 2003
grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$26);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.response",$26,e}'|nc localhost 2003
grep Elapsed nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.compilation",$29,e}'|nc localhost 2003
grep "Physical query response time" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.response",$(NF-4),e}'|nc localhost 2003
grep "Rows returned to Client" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.logical.rows_returned_to_client",$(NF-1),e}'|nc localhost 2003
grep "retrieved from database" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);sub(/\,/,"",$(NF-9));split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.rows",$(NF-9),e}'|nc localhost 2003
grep "retrieved from database" nqquery.log |gawk '{sub(/\[/,"",$1);sub(/\]/,"",$1);split($1,d,"-");split(d[3],x,"T");split(x[2],t,":");split(t[3],tt,".");e=mktime(d[1] " " d[2] " " x[1] " " t[1] " " t[2] " " tt[1]);print "nqquery.physical.bytes",$(NF-7),e}'|nc localhost 2003

Now I run this script, it scrapes the data out of nqquery.log and sends it to Carbon, from where I can render it in Graphite:

or even better, Grafana:


Grafana is an replacement for the default Graphite front-end, written by Torkel Ödegaard and available through the very active github repository.

It’s a great way to very rapidly develop and explore dashbaords of data sourced from Graphite. It’s easy to install too. Using SampleApp as an example, setup per the obi-metrics-agent example, do the following:

# Create a folder for Grafana
mkdir /home/oracle/grafana
cd /home/oracle/grafana
# Download the zip from
# Unzip it and rearrange the files
mv grafana-1.5.3/* .
# Create & update the config file
cp config.sample.js config.js
sed -i -e 's/8080/80/g' config.js
# Add grafana to apache config
sudo sed -i'.bak' -e '/Alias \/content/i Alias \/grafana \/home\/oracle\/grafana' /etc/httpd/conf.d/graphite-vhost.conf 
sudo service httpd restart 

# Download ElasticSearch from
cd /home/oracle/grafana
# Run elasticsearch
nohup /home/oracle/grafana/elasticsearch-1.1.1/bin/elasticsearch &
# NB if you get an out of memory error, it could be a problem with the JDK available. Try installing java-1.6.0-openjdk.x86_64 and adding it to the path.

At this point you should be able to go to the URL on your sampleapp machine http://localhost/grafana/ and see the Grafana homepage.

One of the reasons I like working with Grafana so much is how easy it is to create very smart, interactive dashboards. Here’s a simple walkthrough.

  1. Click on the Open icon and then New to create a new dashboard
  2. On the new dashboard, click Add a panel to this row, set the Panel Type to Graphite, click on Add Panel and then Close.

  3. Click on the title of the new graph and select Edit from the menu that appears. In the edit screen click on Add query and from the select metric dropdown list define the metric that you want to display

    From here you can add additional metrics to the graph, or add graphite functions to the existing metric. I described the use of functions in my previous post about OBIEE and Graphite
  4. Click on Back to Dashboard at the top of the screen, to see your new graph in place. You can add rows to the dashboard, resize graphs, and add new ones. One of the really nice things you can do with Grafana is drag to zoom a graph, updating the time window shown for the whole page:

    You can set dashboards to autorefresh too, from the time menu at the top of the screen, from where you can also select pre-defined windows.
  5. When it comes to interacting with the data being able to click on a legend entry to temporarily ‘mute’ that metric is really handy.

This really is just scratching the surface of what Grafana can do. You can see more at the website, and a smart YouTube video.


Here I’ve shown how we can easily put additional, arbitrary data into Graphite’s datastore, called Whisper. In this instance it was nqquery.log data that I wanted to correlate with OBIEE’s DMS data, but I’ve also used it very successfully in the past to overlay the number of executing JMeter load test users with other data in graphite.

I’ve also demonstrated Grafana, a tool I strongly recommend if you do any work with Graphite. As a front-end it is an excellent replacement for the default Graphite web front end, and it’s very easy to use too.

Categories: BI & Warehousing