Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 17 hours 8 min ago

Rittman Mead at ODTUG KScope’14, Seattle

Fri, 2014-06-20 03:42


Next week is ODTUG KScope’14 in Seattle, USA, and Rittman Mead will be delivering a number of presentations over the week. Coming over from Europe we have Jérôme Françoisse and myself, and we’ll be joined by Michael Rainey, Daniel Adams, Charles Elliott from Rittman Mead in the US. We’re also very pleased to be co-presenting with Nicholas Hurt from IFPI who some of you might know from the Brighton BI Forum this year, who’s talking with Michael Rainey about their Oracle data warehouse upgrade story.

Here’s details of the Rittman Mead sessions and speakers during the week:

  • Daniel Adams : “Hands-On Training: OBIEE Data Visualization: The “How” and the “Why”?” – Monday June 23rd 2014 1.15pm – 3.30pm, Room 2A/2B
  • Jérôme Françoisse : “Oracle Data Integrator 12c New features” – Monday June 23rd 1.15pm – 2.15pm, Room 616/615
  • Mark Rittman : “Deploying OBIEE in the Cloud: Options and Deployment Scenarios” – Monday June 23rd 3.45pm – 4.45pm, Room 615/616
  • Mark Rittman : “OBIEE, Hadoop and Big Data Analysis” – Tuesday June 24th 11.15am – 12.15pm, Room 602/603/604
  • Michael Rainey and Nicholas Hurt (IFPI) : “Real-Time Data Warehouse Upgrade: Success Stories” – Tuesday June 24th 2014 2pm – 3pm, Room 614
  • Charles Elliott : “OBIEE and Essbase – The Circle of Life” – Wednesday June 25th 11.15am – 12.15pm
  • Mark Rittman : “TimesTen as your OBIEE Analyic “Sandbox” – Wednesday June 25th 3.15pm – 4.15pm, Room 615/616

We’ll also be around the event and Seattle all week, so if you’ve got any questions you’d like answered, or would like to talk to us about how we could help you with an Oracle BI, data integration, data warehousing or big data implementation, stop one of us for a chat or drop me a line at

Categories: BI & Warehousing

SampleApp v406 – Automatic startup of OBIEE

Tue, 2014-06-17 00:20

Last week Oracle released v406 of SampleApp. SampleApp is a one-stop-shop for a demonstration of pretty much any conceivable thing that OBIEE is capable of, and then some more on top of it. It is a VirtualBox appliance with everything on the one machine (Database, OBIEE, Endeca, TimesTen, Essbase, etc), and demonstrates basic analysis building techniques through to dashboarding, Mobile App Designer, analysis visualisations with D3, ADF, JavaScript, backend hackery with undocumented NQS calls (hi Christian!), and much, much more.

So, SampleApp is awesome, but … there’s no such thing as absolute perfection ;-) One of the things that is still missing from it is the automatic start/stop of OBIEE when you bootup/shutdown the machine respectively. Setting it up is easy to do, as I demonstrate below. I’ve also put my cheatsheet for whipping SampleApp into line for my day-to-day use, focussed on the commandline and automation (because respectively that’s where I spend most of my time on a server and because I’m lazy).

The OBIEE init.d service script that I demonstrate here is available for use on any Linux installation of OBIEE. For more information, see the Rittman Mead public scripts github repository here:

Before we get started I’m assuming here that you’ve:

  • downloaded the 28GB worth of zip files
  • Unpacked them using 7zip
  • Found 70+GB of disc space free and imported the OVF into VirtualBox
  • Started up the VM

There’s full instructions in the SampleApp_QuickDeploymentGuide–406.pdf, available in the SampleApp v406 Documentation download from the same page as the SampleApp image itself.

So to make OBIEE start automatically, the first thing we need to do is make sure that the database (where the RCU schemas are) is doing the same, by setting it up as a service (init.d). This is based on this article if you want more details about quite how it works, but for know you just need to copy and paste this whole code block into the command prompt on SampleApp to create the necessary files. If you can’t copy & paste between your host and the Virtualbox guest, just go to this blog from Firefox within the SampleApp VM itself.
(Or if you know what you’re doing, SSH onto the server and paste the text into an SSH client on your host machine.)

Copy and paste this whole code block into the command prompt:

# Create startup/shutdown script files
mkdir -p /home/oracle/scripts
chown oracle.oinstall /home/oracle/scripts

# Start Listener
lsnrctl start

# Start Database
sqlplus / as sysdba << EOF


# Stop Database
sqlplus / as sysdba << EOF

# Stop Listener
lsnrctl stop

# Make them executable
chmod u+x /home/oracle/scripts/ /home/oracle/scripts/
chown oracle.oinstall /home/oracle/scripts/ /home/oracle/scripts/

# Create service script
# chkconfig: 345 90 25
# description: Oracle auto start-stop script.
# Set ORA_OWNER to the user id of the owner of the 
# Oracle database software.


case "\$1" in
        # Start the Oracle databases:
        # The following command assumes that the oracle login 
        # will not prompt the user for any values
        su - \$ORA_OWNER -c "/home/oracle/scripts/ >> /home/oracle/scripts/startup_shutdown.log 2>&1"
        touch /var/lock/subsys/dbora
        # Stop the Oracle databases:
        # The following command assumes that the oracle login 
        # will not prompt the user for any values
        su - \$ORA_OWNER -c "/home/oracle/scripts/ >> /home/oracle/scripts/startup_shutdown.log 2>&1"
        rm -f /var/lock/subsys/dbora
sudo mv /tmp/dbora /etc/init.d/dbora
sudo chown root. /etc/init.d/dbora

# Make the service script executable
sudo chmod 750 /etc/init.d/dbora

# Associate the dbora service with the appropriate run levels and set it to auto-start using the following command.
sudo chkconfig --add dbora

On SampleApp v406 there is an Oracle 12c container database (CDB), with two “pluggable” databases (PDB) within it. Assuming you’ve not started the database yet, trying to connect to one of the PDBs for the RCU schema will fail:

[oracle@demo ~]$ sqlplus BIEE_BIPLATFORM/Oracle123@localhost:1521/pdborcl

SQL*Plus: Release Production on Tue Jun 17 03:03:51 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

ORA-12541: TNS:no listener

Now run the service start (which will happen automatically at boot)

sudo service dbora start

And check the status again:

[oracle@demo ~]$ sqlplus BIEE_BIPLATFORM/Oracle123@localhost:1521/pdborcl

SQL*Plus: Release Production on Tue Jun 17 03:06:12 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Tue Jun 17 2014 03:02:09 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options


Now we can set up OBIEE as a service that can start automatically at boot time. This is using a script that I wrote and is shared on the Rittman Mead public scripts github repository here: To install it on SampleApp v406 needs a couple of changes for the environment paths and dependencies etc, which I’ve incorporated in this code block. Again, copy and paste the whole thing into a command prompt on SampleApp.

# lsof is used in the script and isn't installed by default, so let's install it: 
sudo yum -y install lsof

# Now fetch the init.d script itself
sudo wget --no-check-certificate -O /etc/init.d/obiee

# Update the FMW_HOME path in the script
# If you're doing this manually, you just neeed to change the line 
# "FMW_HOME=" and put in the FMW_HOME path for your installation. 
# In the case of SampleApp v406 it is /app/oracle/biee
sudo sed -i -e 's/FMW_HOME=.*$/FMW_HOME=\/app\/oracle\/biee/g' /etc/init.d/obiee

# Make the script executable
sudo chmod 750 /etc/init.d/obiee

You should now be able to run the following status command to see if OBIEE is running or not:

sudo service obiee status

and if it’s not running, start it up:

sudo service obiee start

To shut it down:

sudo service obiee stop

You can read more about the script here.

There’s one more step to run to make OBIEE start automatically at bootup:

sudo chkconfig --add obiee

Now that the service scripts are in place, restart the machine and check they work:

sudo shutdown -r now

When the VM restarts, it may appear to “hang” on boot – the progress bar will show and “Oracle Linux Server 6.5”, but no desktop. That’s because the services are starting up, and you can switch to the console view to see this. On my Mac I press Shift-Function-Backspace to do this, it may vary on other host operating systems (try ALT + d on Windows):

Once the desktop appears you should be able to launch Firefox and go straight to OBIEE, up and running

There are some additional tweaks I usually make to any new server I work with:

  1. Install screen (don’t know what screen is? Read this!):
    sudo yum -y install screen

    Note I’m using sudo which this version of SampleApp thankfully has configured for the oracle user – previous versions didn’t IIRC.
  2. Configure screen with a nice statusbar:

    cat > ~/.screenrc <

    I’m using the bash “here document” functionality here to embed the contents of a document in a command statement. It means I can cut and paste it from my clipboard (if you’re on a Mac, you really should check out Alfred, which has a clipboard history manager, so I always have this screenrc and many other snippets available to paste within just a couple of key presses). Cut and paste a single command is easier (and thus less error-prone) than explaining what content to edit into which text file with which text editor.

  3. Set up SSH keys. I wrote about SSH keys previously on this blog (in fact, all of the things on this checklist are covered there). Put simply, it makes logging in much faster and removes the problem of fat-fingering the password:

    As with the screen configuration in the previous step, I use a snippet of code from my clipboard, pasted into the shell of any new server I’m working on:

    # setup ssh key
    mkdir ~/.ssh
    cat >> ~/.ssh/authorized_keys <<EOF
    ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDMvNjJ8y1a3+NR5OEZHjAd6HhPFdkOI4mDiIIegbRDG+ABhTQmE80DuDe0vp4tex7qefS1u5FG3Xf+hmKDBzKFyegpC+9oEmrtIPkuW7b1Uf/f7Dr6EjXbyA6x1LGEu3THCrGVfw5Mkakqzr43ZfGEFf8k0RiMroHQbp3vOcR+5y0Q132Tbrr0GPd0RPuj4cVOC7QQ6jXSs7TZWepzrcqEpB4M1+N3xS/jEQ5aiCY2FTDOUVj6Y4c0Ogg93bLos6JltPKznq08KXy0ZW/rWDindAmeL0ZAMuU12Qv1ehQZJsxjUwq3jz4yNMgXs5ba3nSvMAr7ZVRzbK5nLl7+eAN3 zaphod@beeblebrox
    chmod -R 700 ~/.ssh

    As a side node, that’s my public SSH key there. One’s public ssh key is just that – public. If you want to use it, go ahead, it just means I’ll be able to login to your server. That’s because I have the other half of the key-pair; the private key, and that is the one that should not be shared, because with it you can access any server that has been configured with the public key. Private SSH keys should be treated just as your most important passwords (and indeed, you should use a passphase when creating your private SSH key, to add a layer of security to it, so that it can only be used if the passphrase is known).

The next step is Installing obi-metrics-agent, Graphite, and collectl, but that’s a matter for another blog post :-)

Categories: BI & Warehousing

Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture Pt2. – Delivering the Data Factory

Mon, 2014-06-16 04:00

In my previous post on our updated Oracle Information Management Reference Architecture, jointly-developed with Oracle’s Enterprise Architecture team, we went through a conceptual and logical view of the information architecture, introducing new concepts like the Raw Data Reservoir, the Data Factory and the Discovery Lab. I said that the Data Factory, Data Reservoir, Enterprise Information Store and Reporting, together with the Discovery Lab, comprised the Information Platform, which could be used to create data applications (raw data plus the data factory), or more wide-ranging Information Solutions, with the Discovery Lab often acting as the place where new ideas and models were prototyped, and algorithms developed that could be productionized in the Execution area of the overall architecture.


And this is typically where most information architectures leave you; with a good starting point and an overall schematic, but without any real guidance on how to deliver it. For example, how do you develop small pieces of functionality that may often be throwaway, and how do you take on larger projects? Is it possible to develop this type of architecture in an agile way, and if so, can the data reservoir part be delivered agile too? In the collaboration with Oracle’s Enterprise Architecture team this is where we made our major contribution, taking the ideas behind our ExtremeBI agile development approach and using them to deliver this updated Information Management architecture. Let’s go back a few steps though and think about why an agile, or at least a different, way of developing BI content is needed, and what part of the overall architecture are we developing for?

Within this overall architecture, the bit that developers such as ourselves typically add value for is the “data factory”; the part of the architecture that acts as the conduit, or interface, between the schema-on-read raw data reservoir and the schema-on-write enterprise information store.


There’s other code to write too; interface and streaming code to get data into the event engine, for example, and more traditional ETL code to load data in from database and application stores. But what we’re most concerned about is the data models and reports that users request in their user stories and epics. If you take a look back at the logical view of our new information architecture, you can see that information can come in at any level in the architecture – raw data reservoir, foundation or access and performance, with different pros and cons for landing and accessing data at each layer, and the ability to take data from one layer and use it to load layers higher up in the layer hierarchy.


And this is where our agile development methodology, “ExtremeBI”, comes in. ExtremeBI, for structured relational sources, typically uses Oracle GoldenGate to trickle-feed data of interest directly into the foundation layer of this information management data architecture, with developers then using that layer plus the access and performance layer, if needed, to close-off user stories as quickly as possible, as shown in the diagram below.


In some cases it’s possible to model against just the foundation layer in OBIEE, where that foundation layer contains all the metadata and history columns that you need. One of the main parts of our ExtremeBI methodology is a technical process to make this modelling possible, and more importantly allow the underlying physical model to evolve in response to user stories about performance and analysis needs, without breaking existing reports or creating an unmanageable system.


We’ll be documenting this process as part of the series of white papers we’ll be co-authoring with Oracle, and we also bring utilities and accelerators to ExtremeBI customer projects, along with continuous integration and regression testing tools, to help get usable software in the hands of end-users as soon as possible.

But what about the parts of the data factory that source data from the raw data store? Can we use similar techniques, and tools like ODI and OBIEE, to deliver working software to users within a short time span? Well in practice it’s not quite so easy; typically, getting information out of schema-on-read database involves writing code, and its easy to fall back into bad habits where everything is scripted, code is brittle and fragile, and no-one feels confident in evolving the model over time.

We think though it’s possible to use ExtremeBI techniques with schema-on-read sources thought, when you combine them with tools like Flume or GoldenGate for trickle-feed extraction and loading, and new capabilities in ODI and OBIEE where Hadoop data can be accessed via Hive. Flume is the industry-standard mechanism for transporting log entries from source to the Hadoop cluster, and GoldenGate has the ability through its Java API to land data in HDFS or Hive, when the source is a relational database (see MOS Doc.IDs 1586188.1 (Hive) and 1586210.1 (HDFS) for example code).

Hive, though slow for ad-hoc query access against Hadoop data, has a range of Serializer/Deserializer utilities that can translate semi-structured and NoSQL sources into more regular columns and tables, with Cloudera Impala removing the response-time issue and rapidly matching Hive in terms of plugins and SerDes. ODI, as I demonstrated in a five-part series on the blog last week, has a number of Hadoop-native knowledge modules and also leverages Oracle Loader for Hadoop to move data out of the raw data reservoir and into the foundation and access + performance layers.

The important thing to understand though when working with data in the updated Information Management architecture, and reporting against schema-on-read and schema-on-write sources, is that the most effective way to deliver real value for end-users is to have two phases to your development work;


  • A discovery phase, typically done in the Discovery Lab sandbox or Rapid Development Sandboxes, that focuses on understanding the data and helping the users see where the value is, with less emphasis on governance and corporate standards, and
  • A commercial exploitation phase, when our understanding of the data has significantly increased and we start to add structure to the data, and where it’s an appropriate time to add governance and address data quality issues. Typically, this would be done in the Execution part of our architecture, taking our prototype design and adding it to the data factory as one of its interface routines.

We’ll be developing these ideas out further over the summer, with some white papers and a joint presentation at Openworld 2014 to formally launch it. In the meantime, keep an eye on the blog as we work through some of the key concepts.


Categories: BI & Warehousing

Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture Pt1. – Information Architecture and the “Data Factory”

Fri, 2014-06-13 05:39

One of the things at Rittman Mead that we’re really interested in, is the architecture of “information management” systems and how these change over time as thinking, and product capabilities, evolve. In fact we often collaborate with the Enterprise Architecture team within Oracle, giving input into the architecture designs they come up with, and more recently working on a full-blown collaboration with them to come up with a next-generation Information Management architecture. I these two posts I wanted to share some of our recent thinking in this area, looking first at our new proposed architecture, and then in the second post talking about how we’d use agile development methods, in-particular our “ExtremeBI” development approach, to deliver it.

But first, some history. Back in 2009 I blogged about a first-generation DW reference architecture which introduced a couple of new concepts, based on new capabilities from tools such as OBIEE plus some thinking we, and the Enterprise Architecture team at Oracle, had been doing over the years. This reference architecture introduced the concept of “Foundation” and “Access and Performance” layers, and recognised the reality that Kimball-type star schemas were great for querying but not so good for long-term, query-neutral storage of data, whilst Inmon-style EDW models were great as a long-term, process-neutral system of record, but not so good for running user queries on. This new architecture included both of these design approaches, with the foundation layer forming the “information management” layer and the access and performance layer being the “information access” layer. Most importantly, tools like OBIEE made it possible for enterprises to create metadata layers that potentially accessed all layers in this model, so users could query the foundation layer if needed as well as the access and performance layer, if the foundation layer was a better source of data for a particular reports.


A second revision to this model, a couple of years later, expanded on the original one and introduced another two interesting concepts, brought upon by the introduction of tools like Endeca Information Discovery, and the rise of unstructured and semi-structured data sources. This new architecture added unstructured and semi-structured sources into the model, and also introduced the concept of “sandboxes”, areas of the information management model that allowed more free-form, exploratory BI applications to be built.


But in-practice, this idea of “unstructured” and “semi-structured” sources wasn’t all that helpful. What really started to make an impact in the past couple of years is the increasing use of “schema-on-read” databases, where we trade-off the performance and defined structure of traditional relational 3NF and star schemas for the flexibility and “time-to-value” provided by key-value store databases. The Endeca Server is a good example of these types of database, where the Endeca Server allows rapid storage of loosely-associated datasets and tools like Endeca Studio then apply a structure to the data, at the time of analysis. Schema-on-read databases are great for fast, flexible access to datasets, but the cost of ETL is then borne by each system that accesses the data.

Probably the most well-known examples of schema-on-read sources though are Hadoop, and NoSQL databases. Coupled with their ability to store lots of detail-level data at relatively low cost, Hadoop and NoSQL databases have significantly affected the overall landscape for BI, data warehousing and business analytics, and we thought it was about time for a new reference architecture that fully-incorporated the capabilities and latest thinking around this area. Back at the start of 2014 myself, Jon Mead and Stewart Bryson met up with Oracle’s Andrew Bond in his team for a series of workshops, and what came out of it was an updated Information Management Architecture *and* a development methodology for delivering it. Let’s start off then by looking at this updated architecture from a conceptual view.


At a conceptual level, we build on this idea of sandbox environment and formally separate things out into the Execution area – business-as-usual, production and development areas – and an Innovation area, where we build on the idea of a sandbox and rename it the “Discovery lab”. The Discovery lab is where, for want of a better word, the “data scientists” work, with fewer constraints on development and whose inputs are events and data, and outputs are the discovery output that can be the prototype and inspiration for designs going into the execution area.

The main “engine” of the Execution area is our enterprise store of data, this time broken down into four areas:

  • A “data reservoir” where we store all incoming events and data at detail-level, typically on HDFS. This blog article by Oracle’s Jean-Pierre Dijcks sets out the concept of a data reservoir well, and I like this blog by Scaleabilities’ Jeff Needham where he makes the case for calling it a “data reservoir” that can ingest, process and analyse data rather than a “data lake”, which implies a passive store.
  • An Enterprise Data Store, analogous to the enterprise data warehouses we use today, and a reporting component, typically in our case OBIEE
  • Most importantly, the new concept of a “data factory”, a conduit between the data reservoir and the enterprise information store

Together, the execution and innovation layers form our “information platform”, with the event engine feeding real-time events into the platform and outputting them into the data reservoir, and traditional ETL routines loading structured data from the enterprise into the enterprise information store.

This conceptual architecture then permits several types of information application. For example, the data reservoir and the data factory together could support what we call “data applications”, applications working on semi-structured, large and low-granularity data sets such as those used for genomic analysis.


Other applications might be more traditional BI and data warehousing applications, but with the addition of data from the data reservoir and the analysis capabilities of Hadoop.


The discovery lab can be a standalone area, or the insights and discovery it outputs can be used as inputs into the main information platform. More event-based data will typically come in via the event engine, with its output going into the data reservoir and supporting “next-best-decision” applications like Oracle Real-Time Decisions.

Another way of looking at this architecture is from a logical perspective, in particular focusing on the data layers and access/loading processes to load them. The diagram below is our latest version of the two diagrams at the start of this article, and as you can see we’ve kept the data sources and BI element much the same, and kept the concept of the sandbox, in this case refined as the “discovery lab sandbox”.


What is different this time though is the middle bit; we’ve lost the staging area and replaced it with the raw data reservoir, added a “Rapid Development Sandbox”, and drawn the main layers as a slanted set of stacked areas. So why?

What we’re trying to show with the slanted data layers is the relative cost of data ingestion (loading), and the relative cost of accessing it (information interpretation). For the raw data reservoir, for example, there’s little cost in ingesting the data – maybe copy some files to HDFS, or use Flume or GoldenGate to capture log or transaction data to HDFS or Hive, but the cost is then borne in accession this typically “schema-on-read” data source. As you go up the stack, there’s a bit more work in landing data into the Foundation layer – ETL routines, GoldenGate routines, some data cleaning and constraint checking, for example – but it’s correspondingly easier to get data out. For the Access and Performance Layer there’s the most cost in getting data in, but then users have very little work to do when getting data out.

Data can move up the stack from Raw Data Reservoir to Foundation, or directly into Access and Performance, or it could be landed at levels above Raw Data Reservoir, for example in our ExtremeBI approach where we use GoldenGate to replicate source system tables directly into Foundation without going through a staging layer. The Rapid Development Sandboxes are there to support agile, iterative development, with the output from them either being the result in itself, or their designs and insights being used to create more formal projects and data structures.

From a more product-centric perspective, you can overlay these types of diagrams with specific schematics for example enterprises. For example, in the diagram below you can see Oracle NoSQL database being see with HDFS and the Oracle Big Data Connectors to capture and store events from Complex Event Processing, and then outputs from CEP being also fed into a more traditional, “high density data” store as well as directly into a decision engine.


So this all sounds great, but how do you build it? Do we have to use the (discredited) step-by-step, waterfall method to build this type of architecture, and in particular the key “data factory” element that provides connectivity between the Raw Data Reservoir and the Enterprise Information Store? And can we apply agile methods to big data sources, as well as regular databases and applications? Check back on Monday for our thoughts on how this should be done.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.5 : Bulk Unload to Oracle

Fri, 2014-06-13 04:00

All week I’ve been looking at what’s involved in moving data around Hadoop on the Oracle Big Data Appliance, using ODI12c to orchestrate the end-to-end process. Using web log data from the Rittman Mead website, I first landed the log data on HDFS using Apache Flume, then transformed the data in several stages using Hive transformations built using ODI mappings and knowledge modules. The links below give the previous articles in this series, where you can read how I transformed data using Hive, Sqoop and Python streaming:

At the end of all these ODI12c transformations, I’m left with a Hive table containing summarised web page accesses, enhanced with reference data on the page title and author, and with the entries geocoded with the name of the country associated with the page access. In many cases, this is where I’d leave the data, but quite often it’d be useful to then export the data out of Hadoop and into an Oracle database, so that I could load it into a data warehouse, or analyse it using a richer SQL dialect than I can get with HiveQL. I could use Sqoop to get the data out of Hive and into Oracle, but for larger exports in particular I’d get better unload performance by using Oracle Loader for Hadoop, one of the Oracle Big Data Connectors that typically come with Oracle Big Data Appliance.

There’s actually two Oracle Big Data Connectors that deal with getting data out of Hadoop and into Oracle; Oracle Direct Connector for HDFS (ODCH) gives you the ability to define an Oracle Database External Table over a HDFS file or Hive table, and is useful if you just want to access the data in Hadoop without actually loading it into an Oracle Database. 


The way ODCH works is that it adds a special type of adapter to the external table feature in Oracle Database, that gives the database the ability to stream HDFS files in a similar way to how external tables stream regular file data into the database. This makes ODCH very convenient if you just want access to HDFS or Hive data from an Oracle database, but it suffers from the same performance penalty as regular external flat file access, in that you can’t index the table and so forth; so if you want high-performance access to data held in a Hadoop system, and you need this access more than just occasionally, you’re better off copying the data right into the Oracle database, and this is where Oracle Loader for Hadoop comes in.

Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that uses MapReduce technology to sort, partition and organise the data extraction at the Hadoop end, and then fast-unload the data into the Oracle Database typically much faster than a Sqoop data transfer.


OLH is, however, a bit fiddly to setup, so what’s useful is that ODI has a knowledge module, IKM File/Hive to Oracle, that can set up both OLH and ODCH jobs. This knowledge module does have its own setup process that itself is a bit involved – mostly around environment variables on your Linux system – but once its done, each OLH data transfer is then very easy to setup, and as long as you don’t then go on to do any more transformation on the Oracle data post-unload, it’s covered by the ODI limited-use license you get when you buy it in conjunction with the BDA.

So the key things to get OLH working with ODI are firstly, to install OLH on the BDA node that’s running the ODI12c agent (or in my case, full ODI Studio), and then set up the Oracle user’s environment variables so OLH works, and ODI knows where to find any additional Hadoop / Hive libraries and configuration files to work with OLH. In my case, this meant adding the following entries to my .bashrc file:

export HIVE_HOME=/usr/lib/hive
export HADOOP_CLASSPATH=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/*:/etc/hive/conf:$HIVE_HOME/lib/hive-metastore-0.12.0-cdh5.0.1.jar:$HIVE_HOME/lib/libthrift.jar:$HIVE_HOME/lib/libfb303-0.9.0.jar:$HIVE_HOME/lib/hive-common-0.12.0-cdh5.0.1.jar:$HIVE_HOME/lib/hive-exec-0.12.0-cdh5.0.1.jar
export OLH_HOME=/home/oracle/oracle/product/oraloader-3.0.0-h2
export HADOOP_HOME=/usr/lib/hadoop
export JAVA_HOME=/usr/java/jdk1.7.0_60
export ODI_HIVE_SESSION_JARS=/usr/lib/hive/lib/hive-contrib.jar
export ODI_OLH_JARS=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/ojdbc6.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-utility.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-mapping.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-collation.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/oraclepki.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/osdt_cert.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/osdt_core.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/commons-math-2.2.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/jackson-core-asl-1.8.8.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/jackson-mapper-asl-1.8.8.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/avro-1.7.3.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/avro-mapred-1.7.3-hadoop2.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/oraloader.jar,/usr/lib/hive/lib/hive-metastore.jar,/usr/lib/hive/lib/libthrift-0.9.0.cloudera.2.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/hive/lib/hive-common-0.12.0-cdh5.0.1.jar,/usr/lib/hive/lib/hive-exec.jar

where OLH was installed to /home/oracle/oracle/product/oraloader-3.0.0-h2, Hadoop was installed to /usr/lib/hadoop and Hive to /usr/lib/hive. ODI_HIVE_SESSION_JARS is optional, but if you do anything fancy with SerDes in your mappings you’re best adding this entry in, to the hive-contrib.jar file (or technically the symbolic link to it). Getting this right was the hardest part about setting this KM up, but once it’s done, it’s done.

Then, to set the ODI parts up, I first create a table in my Oracle database with matching datatypes for the Hive table i’m going to unload; varchar2 for string, number for int, etc, and then create entries in the ODI topology and Designer navigator. Then, I create a new mapping in ODI12c, where I copy the columns from my Hive table into the Oracle table, like this (note that I could map from an HDFS file instead, if I wanted to):


Now it’s a case of configuring the Physical part of the mapping. There’s two steps to this; first, you have to change the default LKM setting for the access table in the physical mapping, the ACCESS_F object in the TARGET_GROUP below:


By default this sets itself to LKM SQL to Oracle (as you’ve got an Oracle target), but we need to change this to a special one, LKM SQL Multi-Connect (below), that delegates the multi-connect capabilities to the downstream node, so you can use a multi-connect IKM such as IKM File/Hive to Oracle. 



Then, and only once you’ve selected this LKM, you can select IKM File/Hive to Oracle for the target table KM.


There’s not much you need to enable with this KM to at least get started. Set OLH_OUTPUT_MODE to JDBC (you can enable OCI later on instead, for potentially faster unloading) and set MAPRED_OUTPUT_BASE_DIR to an HDFS directory that the agent’s OS user (typically “oracle”) can read and write to – in my case, I created one specifically for this KM, using Hue’s File System browser tool. Then, if all’s set up correctly you should be able to execute the mapping and view the results in the Operator navigator – note that in my case, it took several attempts before I finally got all the environment variable settings right. Once done, you should then be able to view the Hive table output in your Oracle table, using SQL Developer.


So now, we’ve got each step for our end-to-end process. The only thing left to do is to pull all of these steps, including the ODI Procedure to run the sqoop transfer, into an ODI package or load plan, so that we can run the steps in sequence:


Then, finally, I can run the whole process as a single execution, moving data from Flume capture through Hadoop ETL steps, to the unload into an Oracle table, all from within ODI12c.


So there you have it – Hadoop-native ETL processing, with the metadata-driven design, error handling and heterogenous connectivity provided by ODI12c.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.4 : Transforming Data using Python & Hive Streaming

Thu, 2014-06-12 03:30

This week I’m taking an in-depth look at ETL on the Oracle Big Data Appliance, using Oracle Data Integrator 12c to call the various bits of Hadoop functionality and orchestrate the whole process. So far, I’ve landed web log data into the Hadoop cluster using Flume, created a Hive table over the log data using Hive and the RegEx SerDe, then used further Hive transformations to join this data to other reference data, some of which came from an Oracle database via Sqoop. Here’s a complete listing of the posts so far, and the ones to come:

Our next transformation is a bit trickier though. The data we’re bringing in from our webserver has IP addresses recorded for each HTTP access, and what we want to do is use that IP address to identify the country that the website visitor was located in. Geocoding, as it’s called, is fairly easy to do if your data is stored in a database such as Oracle Database or you’re using a language like Python, with two steps needed to retrieve the IP address’s country:

1. First convert the IP address into a single integer, like this:

address = ''
( o1, o2, o3, o4 ) = address.split('.')
integer_ip =   ( 16777216 * o1 )
             + (    65536 * o2 )
             + (      256 * o3 ) 
           +              o4

2. Then, using a freely-downloadable database from MaxMind, look to see which range of numbers your converted IP address is in:

SELECT ip_country
FROM geoip
WHERE 2921648058 BETWEEN begin_ip_num AND end_ip_num

The only problem is that Hive doesn’t support anything other than equi-joins. In a previous blog post I got around this issue by using Impala rather than Hive, and in another one I used Pig, and a custom Python UDF, to do the geocoding outside of the main Pig data flow. I could use Pig and a Python UDF in this situation, creating an ODI Procedure to call Pig from the command-line, but instead I’ll use another ODI KM, IKM Hive Transform, to call a Python script but entirely controlled from within ODI.

To set this geocoding process up, I first create a new ODI mapping, with my current Hive table as the source, and another Hive table this time with an additional column for the IP address country, as the target. I then map what columns I can into the target, leaving just the country column without a mapping. Notice I’ve put the source table within an ODI12c dataset – this isn’t mandatory, it’s just something I decided to do at the time – it doesn’t impact on the rest of the process.


Now in a similar way to how I created the Python UDF to do the country lookup for Pig, I’ll create a variation of that script that expects columns of data as an input, outputs other columns of data, and does the geocoding within the script by calling the Python GeoIP API provided by MaxMind. Most importantly though, I’ll use it in conjunction with IKM Hive Transform’s ability to stream it’s Hive data through an arbitrary script, making use of the Hive TRANSFORM function and a feature called “hive streaming”. To keep things simple let’s first set it up outside of the KM, see how it works, and then configure the KM to use this python script as its transformation function.

Let’s start by looking at the source Hive table I’ll be starting with:

hive> describe access_per_post_categories;
hostname           string             None                
request_date       string             None                
post_id            string             None                
title              string             None                
author             string             None                
category           string             None                
Time taken: 0.112 seconds, Fetched: 6 row(s)
hive> select * from access_per_post_categories limit 5;
OK[02/Jun/2014:21:20:41 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion[02/Jun/2014:21:20:42 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion[02/Jun/2014:21:20:43 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion[02/Jun/2014:21:20:45 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion[02/Jun/2014:21:21:23 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
Time taken: 0.976 seconds, Fetched: 5 row(s)

The target table is exactly the same, except for the additional column for country:

hive> describe access_per_post_full;
hostname           string             None                
request_date       string             None                
post_id            string             None                
title              string             None                
author             string             None                
country            string             None                
category           string             None                
Time taken: 0.098 seconds, Fetched: 7 row(s)

The way IKM Hive Transform works is by streaming each incoming row of Hive data to the script registered with it, with each row’s columns tab-separated. In-turn, it expects the script to output similar rows of tab-separated values, each line terminated with a newline. My script, therefore, needs to accept an arbitrary number of incoming rows, parse each row into its constituent columns, call the MaxMind API to do the geocoding, and then output a tab-separated set of columns for each row it processes (a bit like pipelined table functions in Oracle PL/SQL).

So here’s the Python script that does this:

[root@bdanode1 tmp]# cat
import sys
import pygeoip
gi = pygeoip.GeoIP('/tmp/GeoIP.dat')
for line in sys.stdin:
  line = line.rstrip()
  hostname,request_date,post_id,title,author,category = line.split('\t')
  country = gi.country_name_by_addr(hostname)
  print hostname+'\t'+request_date+'\t'+post_id+'\t'+title+'\t'+author+'\t'+country+'\t'+category

A few things to note about the script:

1. “Import sys” brings in the Python libraries that then allows me to read the input data, via sys.stdin.
2. I’ve previously installed the MaxMind GeoIP API, database and libraries, with the GeoIP.dat database stored alongside the script in /tmp. Import pygeoip brings in the Python library to use the API.
3. The bit that does the geocoding is “gi.country_name_by_addr”, which calls the MaxMind API. It’s this bit that replaces the need for a join with a BETWEEN clause in HiveQL.
4. To output data back to the calling Hive transformation, I just “print” it, using “\t” for tab.

One thing we’ll need to do though, similar to how we had to copy the JAR file used by Hive for the RegEx SerDe around all the nodes in the cluster, Is copy the GeoIP.dat file used by the MaxMind geocoding API to the same place on all the cluster nodes. We’ll also have to set the permissions on this file so it’s readable and writeable by the Hive/YARN process on each Hadoop node:

officeimac:~ markrittman$ ssh
The authenticity of host ' (' can't be established.
RSA key fingerprint is 1b:e7:ec:01:57:0c:09:4e:6d:19:08:a4:0d:df:00:e0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (RSA) to the list of known hosts.'s password: 
Last login: Fri Jun  6 16:55:44 2014 from
[root@bdanode2 ~]# ls /tmp/Geo* -l
-rwxrwxrwx 1 oracle oracle 687502 Jun  7 12:31 /tmp/GeoIP.dat

I also need to make sure the MaxMind geocoding Python API is installed into each node too:

python pip 
install pygeoip

Now I can run a HiveQL command and test out the script. I’ll start up a Hive shell session, and the first thing I need to do is register the script with Hive, distributing across the cluster and putting it in cache. Then I can select against the source table, using the TRANSFORM USING clause:

[oracle@bdanode1 ~]$ hive
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/06/09 20:48:01 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.12.0-cdh5.0.1.jar!/
hive> add file x-marsedit-filelocal:///tmp/;                                
Added resource: x-marsedit-filelocal:///tmp/
hive> select transform (hostname,request_date,post_id,title,author,category)
    > using ''                                              
    > as (hostname,request_date,post_id,title,author,category,country)      
    > from access_per_post_categories; 
...[02/Jun/2014:21:19:28 +0000] 14726 SmartView as the Replacement for BI Office with OBIEE Mark Rittman United States Oracle EPM[02/Jun/2014:21:18:58 +0000] 14476 Upgrading OBIEE to Robin Moffatt Israel Oracle BI Suite EE[02/Jun/2014:21:18:59 +0000] 14476 Upgrading OBIEE to Robin Moffatt Israel Oracle BI Suite EE
Time taken: 27.645 seconds, Fetched: 176 row(s)

So it looks like it’s working. Let’s move back to ODI now and reference the script within the IKM Hive Transform KM settings. Note that I enter the name and filesystem location of the script in the KM settings but I don’t use the TRANSFORM_SCRIPT option to directly key the script contents into the KM; this should work, and if you do so the KM will write the script contents out to the agent’s host filesystem at the start of the process, but the KM in this ODI12c release has an issue parsing these keyed-in scripts, so you’re best leaving this setting empty and the KM will just use the file already in the filesystem.


After running the mapping, I then go over to the Operator navigator and see that it’s run successfully. Looking at the code the KM generates, I can see it referencing our script in the HiveQL statement, with the script outputting the extra country column that’s then loaded into the target table along with the other columns.


Then finally, checking the target table, I can see that each row in my final Hive table has the country name alongside each log file entry, along with the IP address and other details.

So – now we’re at the stage where we’ve finished the processing with the Hadoop cluster, and I’d like to copy the final set of data into an Oracle database, so that I can analyse it using Oracle SQL and other tools. In the final installment in this series we’ll do just that, using IKM File/Hive to Oracle (OLH/ODCH) and Oracle Loader for Hadoop, one of the Oracle Big Data Connectors.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.3 : Enhance with Oracle Reference Data via Sqoop, and CKMs

Wed, 2014-06-11 04:02

In the first two posts in this series, I used the software on the Oracle Big Data Appliance 3.0 to ingest web log data from the Rittman Mead blog server, parse and load that data into a Hive table, and then join that table to another to add details on post and author. Links to the post in this series are below, and I’ll complete them as the series is posted this week:

In this next step, I’ve got some data sitting in an Oracle database that I’d like to use to enhance the data in the Hadoop cluster; as a reminder, the diagram below shows how I’m looking to move data through the system, and we’re currently at step number three:


The data in the Oracle tables contains categories that I can join to the page entries that have come through the ETL process, and the join is pretty simple: POST_ID = POST_ID; but how do I get access to the Oracle data – I need to register the Oracle source in the ODI topology and create a model to represent the table, but can I then just join that table to the Hive table and load the results into another Hive table? Let’s try.

The screenshot below shows the mapping that would use this approach, with two Oracle tables joining to the Hive table and another Hive table as the target; on the right-hand side of the screen is the join condition, across both sources:


The Execution view shows how ODI intends to run the transformation, with LKM SQL to SQL KMs used to load the Oracle tables and stage their data in Hive, and then IKM Hive Control Append used to load the joined dataset into another Hive table.


If you try and execute the mapping though, it fails – because the LKM SQL to SQL KM can’t work with Hive tables yet.


In fact. the various IKMs that come with Oracle Data Integrator Application Adapter for Hadoop (IKM file to Hive, IKM Hive Control Append etc) are a bit of a mix of IKMs and LKMs in that they contain extraction code, and integration code in the same KM, and none of the regular LKMs and IKMs will otherwise work with Hadoop sources. In fact, what we need to do with ODI at this point is actually land the Oracle data in Hive first, then do the join, which then begs the second question – how do we do that?

Currently, the only way to get Oracle data in to Hadoop is via IKM File to Hive (LOAD DATA), which involves an unnecessary extra step of exporting the Oracle data to a file, and then loading that file into HDFS and Hive. What we can do though is use sqoop, a tool within Hadoop to extract from, and load into, relational databases, something I covered on the blog a few weeks ago. Sqoop creates data loading and unloading jobs that run in parallel on the Hadoop cluster, and can use native JDBC drivers or even additional plugins such as Oraoop to make the process run faster (though Oracle Loader for Hadoop is considered the fastest way to unload to Oracle, if you’ve licensed the Big Data Adapters).

The only problem is that there’s no official support for sqoop in ODI, so no KMs that make use of it. What you can do though is create a command-line script to run sqoop and include that in an ODI procedure, which is what I’ll now do to bring in my Oracle data into Hive. To do this, I create an ODI procedure and add a single task, using the Operating System (command-shell) technology type, and use it to tell Sqoop to create me a Hive table based on an SQL SELECT statement against my Oracle database:


(note that I had to format the sqoop command, in practice, as one line, to get it to run – the above listing is so that you can see all of the code.)

Executing this procedure works OK, and thereafter I’ve got a single Hive table containing the joined dataset from Oracle.


Joining this new Hive table to the previous one containing the distinct set of page views is then fairly straightforward, but something I’d also like to do is stop any entries going into the rest of the ETL process where the calling IP address is a test one we use, “″. The way I can do this is to use the CKM Hive knowledge module and put a constraint on the hostname in the table I’ll be loading from the join, so that I can then use ODI’s flow control feature to divert those rows to an error table.


I also need to define a primary key for this table, something that’s mandatory when flow control is used. So let’s put the mapping together, joining the table I just brought in from Sqoop with the latest version of the weblog entries Hive table, loading into the Hive table I’ve just enabled the constraint for:


and then I enable flow control, and the CKM Hive check knowledge module, in the Physical mapping settings.


This is of course one of the benefits of using ODI to do your Hadoop data loading – you’ve got access to the data quality and error handling features that come with the tool. Then, when I execute the mapping and check with the Operator navigator, I can see the error handling process running, and afterwards in Hue I can see the contents of the new error table, which now contains those log entries where my test IP address was used, removing them from the target Hive table where they’d ordinarily have gone.


So that’s the third step in the ODI BDA ETL process complete. The next one’s a bit trickier though – I need to geocode the entries in the log table, assigning country names to each row based on where the IP address is located. More tomorrow.

Categories: BI & Warehousing

The BI Survey 14 – Have Your Voice Heard!

Wed, 2014-06-11 03:20

Long-term readers of this blog will know that we’ve supported for many years the BI Survey, an independent survey of BI tools customers and implementors. Rittman Mead have no (financial or other) interest in the BI Survey or its organisers, but we like the way it gathers in detailed data on which tools work best and when, and it’s been a useful set of data for companies such as Oracle when they prioritise their investment in tools such as OBIEE, Essbase and the BI Applications.

Here’s the invite text and link to the survey:

“We would like to invite you to participate in The BI Survey 14, the world’s largest annual survey of business intelligence (BI) users.

To take part in this year’s survey, visit:

As a participant, you will:

  • Receive a summary of the results from the full survey
  • Be entered into a draw to win one of ten $50 Amazon vouchers
  • Ensure that your experiences are included in the final analyses

BARC’s annual survey gathers input from thousands of organizations to analyze their buying decisions, implementation cycles and the benefits they achieve from using BI software.

The BI Survey 14 is strictly vendor-independent: It is not sponsored by any vendor and the results are analyzed and published independently.

You will be asked to answer questions on your usage of a BI product from any vendor. Your answers will be used anonymously and your personal details will not be passed on to software vendors or other third parties.

Business and technical users, as well as vendors and consultants, are all encouraged to participate.

The BI Survey 14 should take about 20 minutes to complete. For further information, please contact Jevgeni Vitsenko at BARC ( 

Click below to take part in The BI Survey 14:

Categories: BI & Warehousing

Rittman Mead Featured in Oracle In-Memory Option Launch

Tue, 2014-06-10 17:02

Today saw the official launch of the Oracle Database In-Memory Option, with Larry Ellison going through the product features and then reading out some quotes and testimonials from beta testers. Rittman Mead were part of the beta testing program, with several of our team testing out various scenarios where we ETL’d into it, used it with OBIEE and worked out what would be involved in “in-memory-enabling” some of our customer’s BI systems.

In fact, as we said in our quote for the launch, enabling Oracle Database for in-memory analysis was almost “boringly simple” – just enable the option, choose your tables, drop any OLTP indexes and you’re ready to go.


Of course, in practice you’ll need to think about which tables you’ll put into memory if RAM is limited, in some scenarios TimesTen might be a better option, and you’ll need to test your particular system and carefully consider whether you’ll keep particular indexes or materialised views, but we’re really excited about the In-Memory Option for Oracle Database as it’s got the potential to significantly improve query response times for users – and from what we’ve seen so far, it “just works”.

We’re still in the NDA period whilst beta testing goes on, but you can read more on the In-Memory Option on the Oracle website, and on the blog post I wrote when the feature was announced last Openworld. Once it goes GA look out for some in-depth articles on the blog around how it works, and details on how we’ll be able to help customers take advantage of this significant new Oracle Database feature.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.2 : Hive Table Joins, Aggregation and Loading

Tue, 2014-06-10 02:49

In this series of posts, I’m going to be looking at an end-to-end ETL process on Oracle Big Data Appliance, using Hadoop technologies to do the data manipulation and Oracle Data Integrator 12c to orchestrate the process. Over the five posts, I’ll be landing web server log data on the Hadoop cluster using Flume, loading that data into a Hive table, transforming and enhancing the dataset, and then loading the final dataset into an Oracle database using one of the Oracle Big Data Connectors. The first post in the list below has a schematic for the overall process, and over the rest of the week I’ll be adding the links in for the remaining posts in the series.

So in today’s step, I want to take the initial Hive table containing the full set of log data, and then do three things; first extract the page details out of the “request” column in the incoming Hive table, then join that to some reference data also in the BDA that’ll allow me to add details of the post author and title of the post, and finally aggregate and project just certain columns from the the dataset so I’ve got a Hive table of just page accesses by IP address, with the author and title details.

In fact this is the easiest out of the five steps to set up and bring together, as we’re just using basic ODI functionality, albeit with Hive tables rather than regular database tables. In the screenshot below you can see the incoming weblog hive table, with the individual columns split-out in the previous step via the RegEx SerDe, and I just join it to the table containing post and author details, apply a distinct on the join output and then load the results into a third Hive table.


The join between the two tables is bit complex, because I need to extract just the URL details out of the request field (which also contains the transport method and other metadata), but its no different that joining two regular database tables. Also, I did need to check the Generate ANSI Syntax checkbox, as Hive expects table joins to be in this format rather than the “Oracle” format.


Going over to the Physical part of the mapping, I set the KM to IKM Hive Control Append, turn on the feature to enable table truncation prior to load, and I’m ready to go.


Then, when I execute the mapping, I can see its progress in the Operator navigator, and the code view shows me the HiveQL commands that the KM has generated to move data around the BDA.


So far so good, and as I said, this was the easy bit. For the next transformation I want to bring in some additional reference data from an Oracle database, so the question for me is whether I need to land that data into the BDA before doing the transformation, or whether I can create a join between the Oracle table and my Hive table? Check back tomorrow for the next installment…

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.1 : Flume to Initial Hive Table

Mon, 2014-06-09 07:18

A few months ago I posted an article on the ODI12c examples in the new Oracle Big Data Appliance, and over the past few weeks I’ve been writing about the various components within the Cloudera CDH Hadoop stack, including Hive, Pig, Spark and Flume. Since then I’ve built out a virtualized example of an Oracle Big Data Appliance using the Oracle Big Data 3.0 software set, and I thought it’d be interesting to create an example ETL flow through the system showing how ODI could be used to initiate and control the process. As with any situation where you actually build a demo to do something specific, as opposed to just play around with the technology, you end up uncovering a few quirks and techniques that you wouldn’t have otherwise been aware of, so I’ll spend this week going through the various steps and calling out things others in the same situation might find useful – the steps I’ll go through are below, and I’ll add the links as the articles get published during the week;

As an overview, what I’ll be working with is a six-node Hadoop cluster running Cloudera CDH5, Oracle Big Data Connectors 3.0 and Oracle Data Integrator 12c, as you’d get with the current version of Oracle Big Data Appliance. Obviously BDA goes up to eighteen nodes as a maximum, and under the covers there’s lots more memory and much better networking that I was able to set up on VMWare ESXi, but what I’ve got is enough to prove the process. The diagram below shows the six nodes, and where the software is installed.


I took a couple of short-cuts with the setup; obviously each node has a lot less RAM than BDA’s 64GB per node, but the main node in the cluster (bdanode1) running the HDFS NameNode and all the ODI software got bumped-up to 28GB, with the next two allocated 12GB and the others 8GB – enough to work through a demo at least. I also ran ODI Studio on the bdanode1 as well, instead of setting it up on a separate client VM, mainly to avoid having to set up all the Hadoop and Hive libraries on another machine. Other than that though, its the same CDH distribution you get on BDA, the same version of ODI and the Big Data Connectors, and so on, with the following software versions and downloads being used:

Setup of the Hadoop cluster is really out of scope for these articles, except to say that with CDH5, I find it easier to select the (non-default, deprecated) Packages install type rather than the new Parcels type, as this new methods installs all of the Hadoop software on each node in a new place – /opt/cloudera – rather than the usual /usr/lib/hadoop, /usr/lib/hive and so forth, meaning that most configuration examples you’ll read point to the wrong place for your install. Parcels are Cloudera’s way forward for delivering software components (there’s advantages in terms of patching across the cluster) but if you’re getting started with the Hadoop platform, installing in the old location usually makes things easier to follow. Other than that, the two bits of configuration you need to do is firstly to tell ODI Studio where to find that various Hadoop libraries and configuration files; because I’ve installed Studio directly on the main Hadoop node, I could then just add that node’s file locations to the Oracle user’s $HOME/.odi/oracledi/userlib/additional_path.txt file, so that it looked like this:

Also to make use of Oracle Loader for Hadoop, one of the Oracle Big Data Connectors and something we’ll use at the end to bulk-unload data from Hadoop to an Oracle database, the second thing I’ll need to do is set a couple of environment variables in the “oracle” user’s .bashrc profile file pointing to where OLH is installed, and where the Hadoop and Hive libraries and configuration files are:

[oracle@bdanode1 userlib]$ cat $HOME/.bashrc
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
# User specific aliases and functions
export HADOOP_CLASSPATH=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/*:/etc/hive/conf
export OLH_HOME=/home/oracle/oracle/product/oraloader-3.0.0-h2

The scenario I’ll be working with is similar to the ones I covered on the blog recently, where I landed data from the Rittman Mead blog webserver into HDFS files using Hive, and then processed the files using Hive, Pig, Spark and so on. In this example though, I’ll use ODI to do the data manipulation where possible, but still use Hive and so forth under-the-covers to do the work. The diagram below shows the data flow that i’ll be looking to set up in the example:


So in this scenario incoming data is being landed in the Hadoop cluster by Flume, using the process outlined in this previous blog post. All Flume is is a transport mechanism; it doesn’t in itself have any processing ability (making it analogous to GoldenGate) and all it does it transport log file entries from one place to another, via Flume agents on either end. All we as developers need to be aware of is (a) where the log files will be landed, and (b) that Flume will keep continuously writing to these files until the source log file gets rotated – therefore I can’t assume a log file is completely written to when I read from it.

What this means in practice is that if I want to do incremental loads, I need to consider the fact that a file I’m reading to might have more data in it later on. There’s various solutions to this – principally having Flume write to HBase, rather than raw HDFS files, and then I read from the HBase database noting the last extraction point at the time – but to keep things simple I’ll just do a full-load each time the ETL run takes place, meaning that I don’t need to think about incremental loads throughout the system.

So the first thing I need to do is have ODI take the incoming files and load them into a Hive table. To do this I set up Topology entries for the incoming HFDS files, and here’s the first “gotcha” – to create a connection to HDFS files, you use the regular File technology, but you leave the JDBC driver type blank, and put the address of the HDFS NameNode in to the JDBC URL – which of course is technically invalid and won’t allow you to either test it, or reverse-engineer the file names in, but is a “hack” used by the IKM File to Hive KM to get the address of your HDFS NameNode (if you choose to source the incoming file from HDFS rather than the local filesystem).


Then, when you come to register the physical schemas to go with the new File technology data server, you’ll need to be aware that ODI just appends the final file name to the directory name when retrieving the file data – so if you want the connection to point to a directory, rather than just a file, you’ll need to set up the physical schema to be the directory “above” the one you’re interested in, and then set the file name later on to be that directory. In this example , I want the final file reference to point to hdfs://, a whole directory (HDFS aggregates all the files in a directory if you reference just the directory in an operation) rather than just a single log file. You can see the directory and the log files it contains in the Hue screenshot below:


I therefore set the physical schema to be hdfs://, and the file reference in the datastore model is set to the final directory name, like this:


If it seems a bit confusing, it’ll become clearer in a moment.

Now I need to go back to the Topology navigator and create a connection through to the Hive server on the Big Data Appliance VMs – in fact recent versions of CDH (CDH4, CDH5) swap out the old Hive Server for HiveServer2, so you’ll need to use the correct JDBC drivers (as supplied with CDH4/5) to connect to it, and also create the JDBC URL in the format jdbc:hive2://[machine name:port], as shown inn the screenshot below:


A quick note about security at this point; by default, most CDH4/5 clusters are set up as what’s called “unsecured”, which means that whilst you use a username and login to connect to Cloudera Manager, for example, by this default although Hive and Impala request user credentials when you connect, they don’t actually check the password against anything, for example your LDAP server. You can connect these tools to LDAP or Active Directory, and typically you’d combine this with Kerebos authentication between the various components and when you connect via Impala and Hive, and typically you’d also use Apache Sentry to provide role-based access control to the data within individual HDFS files. But by default, Hive will accept more or less anything as a password when you connect, but then you may hit issues later on when your HDFS files are owned by a different user to the one you connect as.

Where this manifests itself is when a Hive table has underlying HDFS files owned by, say, the “admin” user in CDH5 (because that’s how you connected to Hue to upload them), but then you connect as the “oracle” user through Hive to then manipulate the Hive table contents. In practice, what I try to do is create any Hive tables (using Hue) as the user I’m then going to connect in Hive to them as, which means you’ll most probably need to go into Hue and create a new user called “oracle” (if that’s what you’re running ODI as, and connecting through Hive as) before creating the Hive tables you’ll then import into the ODI topology.

So once you’ve done all that, you can go into the Designer navigator and reverse-engineer the definition of your Hive tables into datastore models. In my case, I’ve got a bunch of tables that I’ll be using throughout the whole ETL process.


Now it’s time for the first ODI mapping, to take the raw log file directory and load it into a Hive table. As it stands though, these raw log files are rows of just a single “column” of data – the log file entry in Apache CombinedLogFormat format. To make them useful to the rest of the ETL process I’ll somehow need to parse them into the individual log file elements, so I create a target Hive table that contains an entry for the raw log entry, and then columns for the various log file elements:


The way that I parse the log file is to use a feature within the IKM File to Hive (LOAD DATA) KM that allows me to specify a regular expressed Serde (Serializer-Deserializer) to parse the log file entry into its individual columns, like this (note that you’ll need to ensure the hive-contrib-* JAR file is available to all of your Hadoop nodes before using this SerDe)


In this instance, I want the KM to leave the source files in-place when doing the data load (Hive by default moves incoming source files into the /user/hive/warehouse directory area) as these files most probably haven’t been finished written-to by Flume yet, so I leave the EXTERNAL_TABLE value set to true (Hive external table, not Oracle external table) and make sure FILE_IS_LOCAL is set to FALSE, so that this KM knows to use the HDFS file location hack I set up in the topology earlier. Then, I just run the mapping and check that it’s worked OK:


and I can check from the Model pane in the Designer navigator that I’ve now got a Hive table of individually split-up log entry columns to work with, for the rest of the downstream ETL process:


So that’s the first stage done – next, I’ll be combining this Hive table with data from another one, using the IKM Hive Control Append KM.

Categories: BI & Warehousing

Rittman Mead at the Oracle Virtual Technology Summit 2014 – “The Architecture of Analytics: Big Time Big Data and Business Intelligence”

Mon, 2014-06-09 05:25

I’m pleased to announce that I’ll be presenting at the Oracle Technology Network (OTN) Virtual Technology Summit, an online event running over three days in early July. I’m speaking as part of the Middleware track, entitled “The Architecture of Analytics: Big Time Big Data and Business Intelligence”, along with fellow Oracle ACE Director Kevin McGinley, US BI Forum keynote speaker Richard Tomlinson, and by Tom Plunkett, Lead Author of the Oracle Big Data Handbook. It’s free to attend, it’s running in three timezones over the three days, and here’s the track details and session abstracts:

The Architecture of Analytics: Big Data and Business Intelligence

“More than just another hyped-up technological buzzword,  Big Data represents a dramatic shift not just in the amount of data to be dealt with, but also in how business intelligence is extracted from that data in order to inform critical business decisions. This OTN Virtual Technology Summit track will present a solution architect’s perspective on how business intelligence products in Oracle’s Fusion Middleware family and beyond fit into an effective big data architecture, and present insight and expertise from Oracle ACE Directors specializing in business Intelligence to help you meet your big data business intelligence challenges. Technologies covered in this track include Oracle Big Data Appliance, Oracle Data Integrator, Oracle Business Intelligence Enterprise Edition, Oracle Endeca, Oracle Data Warehouse, and Oracle Big Data Connectors, along with Hadoop, Hive, NoSQL, and MapReduce.”

Session Abstracts

1.Oracle Big Data Appliance Case Study: Using Big Data to Analyze Cancer-Genome Relationships [30:00]
by Tom Plunkett, Lead Author of the Oracle Big Data Handbook

This presentation takes a deep technical dive into the use of the Oracle Big Data Appliance in a project for the National Cancer Institute’s Frederick National Laboratory for Cancer Research.  The Frederick National Laboratory and the Oracle team won several awards for analyzing relationships between genomes and cancer subtypes with big data, including the 2012 Government Big Data Solutions Award, the 2013 Excellence.Gov Finalist for Innovation, and the 2013 ComputerWorld Honors Laureate for Innovation.

2.Getting Value from Big Data Variety [30:00]
by Richard Tomlinson, Director, Product Management, Oracle

Big data variety implies big data complexity. Performing analytics on diverse data typically involves mashing up structured, semi-structured and unstructured content. So how can we do this effectively to get real value? How do we relate diverse content so we can start to analyze it? This session looks at how we approach this tricky problem using Endeca Information Discovery.

3.How To Leverage Your Investment In Oracle Business Intelligence Enterprise Edition Within A Big Data Architecture [30:00]
By Oracle ACE Director Kevin McGinley

More and more organizations are realizing the value Big Data technologies contribute to the return on investment in Analytics.  But as an increasing variety of data types reside in different data stores, organizations are finding that a unified Analytics layer can help bridge the divide in modern data architectures.  This session will examine how you can enable Oracle Business Intelligence Enterprise Edition (OBIEE) to play a role in a unified Analytics layer and the benefits and use cases for doing so.

4.Oracle Data Integrator 12c As Your Big Data Data Integration Hub [90:00]
by Oracle ACE Director Mark Rittman

Oracle Data Integrator 12c (ODI12c), as well as being able to integrate and transform data from application and database data sources, also has the ability to load, transform and orchestrate data loads to and from Big Data sources. In this session, we’ll look at ODI12c’s ability to load data from Hadoop, Hive, NoSQL and file sources, transform that data using Hive and MapReduce processing across the Hadoop cluster, and then bulk-load that data into an Oracle Data Warehouse using Oracle Big Data Connectors.

We will also look at how ODI12c enables ETL-offloading to a Hadoop cluster, with some tips and techniques on real-time capture into a Hadoop data reservoir and techniques and limitations when performing ETL on big data sources.

Registration is free, and the form for all three events is online here. Look out for a preview of my big data ETL on BDA session over the next five days, and hopefully you’ll be able to make the VTS event and hear me talk about it in-person.

Categories: BI & Warehousing

Goodbye to Stewart…

Mon, 2014-06-09 05:00

We’re sad to announce that Stewart Bryson’s last day with Rittman Mead was last Friday, and he’s moving-on now to a new challenge. Five years ago Stewart co-founded Rittman Mead America with us, and most recently he’s been our Chief Innovation Officer, handing over the day-to-day running of our US business to Charles Elliott, our US Practice Manager.

One of the main contributions Stewart has made over the last few years is helping us develop the Rittman Mead Delivery Framework and the Rittman Mead Way. The Delivery Framework is a set of software, accelerators and processes that determine how Rittman Mead consultants go about projects (the Rittman Mead Way). Stewart has contributed greatly to our work around Extreme BI, version control and release management. These processes are now embedded in our delivery management team and framework, and as such will live on through the DNA of Rittman Mead, but in the meantime we wish him well going into the future.

Categories: BI & Warehousing

Data Integration Tips: GoldenGate on RAC – Action Script for Windows

Fri, 2014-06-06 14:29

I want to briefly interrupt my blog series on GoldenGate and ODI – A Perfect Match in 12c… to provide another Data Integration Tip that was found during a recent client engagement. I was tasked with installing GoldenGate 11g into a 2-node RAC, which is a pretty straight-forward process and well documented by the Oracle Data Integration product team. The challenge is that the client’s Oracle RAC environment was installed on Windows Server 2003.

The Scenario

Working through the GoldenGate setup and configuration process for RAC / Clusterware, most of the same approach applies to either Unix/Linux or Windows (maybe a future blog post?). There is also an example action script towards the bottom, but it’s written in shell script..and that just won’t work in Windows! For those unfamiliar with Clusterware, when an application is added as a resource, Clusterware will use the action script to perform specific functions, or actions, against the application. These actions, Check, Start, Stop, etc., will often times run commands through GoldenGate’s GGSCI application, ensuring the Manager process is running, starting the Manager, etc.

We decided to convert the shell script example to batch command script, as this is about as native as it gets in Windows. Everything was going well with the conversion until we reached this function and the highlighted code below.

#call_ggsci is a generic routine that executes a ggsci command
call_ggsci () {
  log "entering call_ggsci"
  cd ${GGS_HOME}

  ggsci_output=`${GGS_HOME}/ggsci << EOF

  log "got output of : $ggsci_output"

The script will simply change directory to the GoldenGate home, run the GGSCI application, execute the command (passed in as an argument), and then exit GGSCI. This is all wrapped within two EOF “tags”, indicating that the formatting in the script, such as hard returns, etc, should remain when the external command is run. It’s known as a heredoc, and we found that it’s not really possible to do in a batch script.

When attempting similar code in the batch script, without the heredoc, we ran into an issue. GGSCI attempted to run all of the commands on the same line, which failed miserably. We needed another approach to ensure the script would execute properly.

Here’s the Tip…

The goal was to find a way to run a GGSCI command on a single line. The solution: use GGSCI to execute GoldenGate obey files. Rather than attempt to place hard returns in the batch command script, we simply placed all of the necessary commands in an obey file and passed the obey command and script file location it into the function.

call :call_ggsci %GGS_HOME%\diroby\StartMgr.oby


 REM set ggsci_command variable with the passed-in value
 SET ggsci_command=%~1
 call :log "ggsci_command: %ggsci_command%"

 REM log whether or not the GGS_HOME path exists
 if exist %GGS_HOME% (
  call :log "%GGS_HOME% exists"
 ) else (
  call :log "%GGS_HOME% does not exist"

The obey file could be as complex as you like, but in this case it simply starts the manager process.


This approach works very well, but we did end up with the additional obey files to maintain. Next time around, I’ll probably use a scripting language such as Groovy or Python, as either should work just fine on Linux or Windows.

Look for more Data Integration Tips on the blog, as Rittman Mead always coming up with innovative solutions to interesting challenges!

Categories: BI & Warehousing

Rittman Mead Announce New Partnerships with Cloudera and Amazon Web Services

Thu, 2014-06-05 06:00

To complement our strategic partnership with Oracle Corporation, Rittman Mead are very pleased to announce two new technology partnerships to complement our Oracle-based services; with Cloudera, the industry-leading Hadoop software vendor, and with Amazon Web Services, the infrastructure-as-a-service provider that’s become the standard for deploying highly-available, secure cloud applications. As business intelligence, analytics and data warehousing systems extend onto the cloud and to encompass big data, these two new partnerships will allow us to offer best-of-breed services in these area whilst complementing our main technology partnership with Oracle. So why Cloudera, why Amazon Web Services, and why now?

Extending BI from the EDW to the “Enterprise Data Hub”

If you’re a regular reader of this blog you’ll probably have read several articles by myself and other authors about cloud, Hadoop and big data. Whilst the vast majority of BI & DW systems we put together for clients today are based around regular database and application data sources, increasingly customers are asking us to help them bring non-traditional, NoSQL and Hadoop-based data sources into their BI platform, adding what’s becoming termed “enterprise data hubs” alongside the more traditional enterprise data warehouses. We’re also seeing more interest in deploying these systems into the cloud – either driven by the economics of the cloud vs. on-premise systems, or because other key line-of-business systems are moving into the cloud and it makes sense to deploy your BI there too. And Oracle, of course, have noticed this trend, with support being added to OBIEE and ODI for reporting against, and loading data out of/into Hadoop data sources, and Oracle themselves due to launch Oracle Business Intelligence Cloud Services in the next couple of months.


Oracle of course already have some Hadoop-based products out in the market, including the very-impressive Oracle Big Data Appliance and their Oracle Big Data Connectors, and we think these products have the potential to make a massive impact on the types of project and solutions we can offer customers today. But at Rittman Mead we like to get ahead of the technology curve, and so we decided to in addition partner directly with Cloudera, makers of the Hadoop software within Big Data Appliance but a significant software vendor in themselves, to ensure our team is at the forefront of delivering cutting-edge big data solutions based around Cloudera CDH, Apache Spark and Cloudera Impala.

Over the past few months we’ve started to deliver projects and PoCs for customers around Cloudera Hadoop technology and Oracle Engineered Systems, Database and BI tools, and what’s emerged as our core competence is the ability to bring together Oracle’s chosen Hadoop distribution and tools with the core Oracle technologies our customers use. We’ve developed systems using Oracle’s Big Data Connectors, to for example analyse huge datasets across multiple Hadoop nodes using Oracle R Advanced Analytics for Hadoop, and we’re currently working with a couple of customers’ Oracle DBA teams who are in the process of adopting Oracle Big Data Appliance. Most importantly, as official Cloudera Partners we’ve got access to their technical and solutions architecture resources, giving us a similar same level of technical backup as we have access to on our core Oracle projects.

BI Moves into the Cloud, for Improved Business Agility and TCO

The other big trend that’s driving a lot of innovation in the BI industry is “cloud”. If you were at one of our BI Forum events in Brighton and Atlanta this year, you’ll know that cloud is front-and-centre in Oracle’s product strategy at the moment, both in terms of making BI available as part of the wider Oracle Public Cloud, but also as a way of accelerating innovation and making more of the product self-service. What’s now been officially named “Oracle BI Cloud Service” (BICS) was officially launched at last year’s Openworld and we’re about to help beta-test the product for Oracle prior to its launch later in the year, and we’re expecting BICS to be particularly attractive to existing OBIEE customers looking to quickly spin-up departmental BI environments without the need for IT to get involved.

But as I covered in a couple of blog posts earlier in the year, BICS functionality is likely to be limited in its initial incarnation and many customers are going to be looking to run “full” OBIEE in the cloud, along with a full Oracle database and an ETL infrastructure using tools such as ODI or Informatica, and for these types of customer a more complete cloud solution will be needed – which is why we’ve also partnered with Amazon Web Services, in our view by far the best cloud service provider on the market and the platform behind companies such as Netflix and Dropbox.

Logo amazon aws

We’ve been long-term users of Amazon AWS since around three or four years ago, initially running our training servers on the platform but more recently, undertaking internal and customer development work on their “virtual private cloud” platform as part of our new “Extreme BI in the Cloud” initiative. As techies, we appreciate the power and flexibility of the Amazon AWS API which we use to script and automate much of our “DevOps” work, and we’re finding more and more of our customers are moving to AWS independently, due to their excellent uptime and levels of customer service, and the wide ecosystem of AWS-native and partner products such as Amazon Redshift, Amazon Elastic MapReduce, Amazon EC2 and S3, and Attunity Cloudbeam. Again, we’ve partnered officially with Amazon AWS so that we can train our team-up and access technical and solutions architecture resources, and as with the Cloudera partnership, our particular specialisation is in deploying Oracle-based technologies onto the Amazon AWS platform.

So does this mean that Rittman Mead are de-emphasising our Oracle Partnership or otherwise taking our attention away from solutions built-on OBIEE, ODI, Essbase and the Oracle Database? Far from it; the core of the company will always be around Oracle technology, but by partnering with two best-of-breed complementary organisations – Cloudera, for Oracle-centric Hadoop and big data solutions, and Amazon AWS, for cloud and hybrid deployments – we can continue to offer customers the most innovative and industry-leading solutions as BI evolves from its database and on-premise roots to cover big data and cloud deployments.

Look out for our our blog over the coming months as we cover integrating Oracle technology with Cloudera’s Hadoop platform and Oracle Big Data Appliance, and look out for products and services where we combine Oracle and Cloudera technology in innovative ways, creating cutting edge solutions for our customers both on-premise, in the Amazon and Oracle clouds, or as hybrid on-premise/cloud deployments.

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 2: Journalizing Knowledge Module

Wed, 2014-06-04 16:35

This is the second post in a blog series on the integration between Oracle Data Integrator (ODI) 12c and GoldenGate 12c. The first post focused on the latest, yet-to-be-released, Oracle Information Management Reference Architecture and some high-level features new to the 12c versions of each product. We also began working through the details of the GoldenGate installation and configuration, specifically the JAgent setup, which is necessary for communication between ODI and GoldenGate during the setup of “online” journalizing. In this post, we’ll look at the new features of the Journalizing Knowledge Module “JKM Oracle to Oracle Consistent (OGG Online)” and get started on the modifications to the JKM that must occur for us to load the Raw Data Reservoir (RDR – acronym coined by Nick Hurt in the comments of my first post…I like it!) and Foundation Layer simultaneously.

ODI Journalizing

Before I get into editing the JKM, let me briefly go through an introduction of ODI Journalizing, aka Change Data Capture (CDC). In case you missed it earlier, ODI CDC is implemented using a Journalized Knowledge Module. The JKM generates the infrastructure for the CDC, creating journal tables that store the change rows and views that provide access to the change rows for use in ODI Mappings. ODI CDC can be implemented using various capture processes, such as triggers on source tables, timestamps on rows, or mining of the database logs via Oracle Streams or, in our case, Oracle GoldenGate. A great explanation of ODI JKMs and how they work, written by Christophe Dupupet of the Oracle Data Integration A-Team, can be found here.

Import ODI JKMs

The integration between GoldenGate and ODI begins with a Journalized Knowledge Module, in our example we’ll be using the “JKM Oracle to Oracle Consistent (OGG Online)”. The JKM, delivered with ODI, will create the GoldenGate parameter files, configure GoldenGate process groups (and start them up), and generate the ODI CDC tables and views. The JKM, added to the source Model in 12c, uses the ODI metadata to generate the GoldenGate parameter file mappings. This alone saves quite a bit of manual work and reduces possible typos caused by human error.

JKM Oracle to Oracle Consistent (OGG Online)

In the previous post, I mentioned the new capability of the JKM that allows for an “online” integration between ODI and GoldenGate. But, there are many other new features that need to be described, so we’ll walk through those here.

ODI Tool: OdiOggCommand

The JKM uses an undocumented ODI Tool called OdiOggCommand in the target command of some tasks that are executed only when in “online” mode. This tool has different values for a parameter called OPERATION.

EXECUTECMD: executes various commands, such as Add Extract, within GGSCI
EXECUTEOBEY: runs the OBEY command against an obey file in GGSCI
DEFGEN: generates the source definitions file by executing the DEFGEN command
SAVEFILE: uploads the parameter and obey files to the GoldenGate installation directory


I imagine the code behind this ODI Tool is simply executing command line calls to GGSCI, DEFGEN, etc. It would be great to see some Oracle documentation on this one!

GoldenGate Topology

The “online” aspect of the JKM requires that a Data Server, Physical Schema and Logical Schema all be setup under the GoldenGate technology in ODI. The Data Server contains the location and connection information for communicating with the JAgent on either the source or target GoldenGate server. Under that, a Physical Schema must be setup for each extract (the pump is included) and replicat process group that will be implemented on that server.

GoldenGate Physical Schema

In this screenshot, we have a Physical Schema with the process type set to Capture, also called the extract. Here we set up the directory path on the source to the location where captured transactions from the source database logs will be stored in GoldenGate’s own log files, called trail files. The remote trail file directory, setup on the target server and accessed by the pump process in order to move transactions from the source trail to the target trail, must also be added. Additional options such as trail file size (how large a trail file can get before rolling over to the next file) are also available to be set. Just below the Capture Process Properties are the Additional Options, parameters that can be added to the extract or replicat parameter files based on specific needs of the solution.

GoldenGate Physical Schema Additional Options

Here we can add options to handle different aspects of the extract or replicat, such as TRANLOGOPTIONS on the extract. Once I add this option, I get a template of the additional parameters I can configure. For example, I may want to access my transaction logs stored in ASM via the database rather than directly from ASM. The DBLOGREADER option lets me utilize the ASM API in the database, simplifying my access to the logs.


When I add the additional option to the capture physical schema, it will be generated as a part of the extract parameter file. This helps to drastically reduce the amount of manual editing that had to occur after the parameter files were generated in the 11g version, and is a great addition to the JKM.

JKM Setup and Options

The JKM is actually applied to the source ODI Model, which is a grouping of logical table structures called Datastores. On the Journalizing tab in the Model, we first set the type (Consistent Set) and choose the appropriate JKM.

GoldenGate Process Selection

In the GoldenGate Process Selection section, we choose the Capture and Delivery Logical Schemas that were setup in the ODI Topology. If none have been created, but the Data Server for the source and target GoldenGate installations exist, we can choose to create the Physical and Logical Schemas directly from the Model by clicking the Create button. This is a nifty way to separate the system administration role (setting up the Data Server) from the metadata management or developer role.

The JKM also has a set of Options with configurable values, some being required and others optional. The number of Options has been reduced by quite a bit in the 12c JKM. This makes sense, as we’ve seen that additional metadata such as the trail file location and trail file size are set elsewhere. We’ll go through these in more detail later on when setting the values in our example.

JKM Options

That covers the updated features in the JKM, now let’s talk Knowledge Module customization.

JKM Customization

In the first post, I described how we want to extract the source data once and replicate it into both the Foundation Layer and Raw Data Reservoir in parallel. The JKM is set up to load the ODI CDC Framework (the fully replicated table and J$ table) out of the box, but not the Foundation table. In case you’re unfamiliar with the purpose of the Foundation layer, the idea is to store all transactional history from the source tables by converting every transaction into an insert, and tracking the type of change (insert / update / delete), commit date, and commit SCN. With this information stored from the beginning of the data warehouse, it can be used for historical drill-throughs from the dimensional model, or to completely reload a star schema – including all history. With the stage set, let’s look at what we’re going to change in the JKM.

1. Add New Option “APPLY_FOUNDATION”
This option, when true, will allow the Start Journal process to generate the source-to-foundation mapping statement in the Replicat (apply) process.

2. Add New Option “FND_LSCHEMA”
The Logical Schema name for the Foundation layer schema.

3. Add New Task “Create apply prm (4) RM”
This task will create the source-to-foundation mapping code and add it to the replicat parameter file.

4. Set Option on “Create apply prm (4) RM” Task
Set the execution Options to have APPLY_FOUNDATION as the only checked option. This will determine whether or not that task will execute when Start Journal is run.

5. Edit Task “Execute apply commands online RM”
We’ll comment out the “start replicat …” command, as we’ll first need to complete an initial load of the source data to the target.

Now, I did say I would get into the editing of the JKM in this post, but after describing the new features of the Knowledge Module, I’ll save the actual details for the part 3 of the series! Up next, editing the JKM and building the necessary ODI objects so we can start journalizing.

Categories: BI & Warehousing

How We Deliver Agile OBIEE Projects – Introducing ExtremeBI

Wed, 2014-06-04 04:56

Most OBIEE projects that we see are delivered through some sort of “waterfall” method, where requirements are defined up-front, there’s several stages of development, one or more major releases at the end, and any revision to requirements takes the form of a change request. These work well where requirements can be defined upfront, and can be reassuring to customers when they want to agree a fixed-price up-front with every subsequent change clearly costed. But, as with the development world in general, some customers are starting to look at “agile” methods for delivering BI projects, where requirements emerge over the course of a project, there isn’t so much of a fixed design or specification at the start, but instead the project adds features or capabilities in response to what are called “user stories”, making it more likely in-the-end that what ends-up getting delivered is more in-line with what users want – and where changes and additions to requirements are welcomed, rather than extra-cost change requests.

OBIEE naturally lends itself to working in an agile manner, through the three-layer nature of the repository (RPD); by separating the physical representation of the source data from how it is then presented to the end-users, you can start from the off with the dimensional model that’s your end goal, and then over time evolve the back-end physical layer from pointing directly at the source system to instead point at a data warehouse or OLAP cube. In fact, I covered this approach back in 2008 in a blog post called “A Future Oracle OBIEE Architecture” where I positioned OBIEE’s BI Server as a “business logic layer”, and speculated that at some point in the future, OBIEE might be able to turn the logical > physical mappings in the RPD into actual ODI mappings and transformation.


In the end, although OBIEE’s aggregate persistence feature gave us the ability to spin-off aggregate tables and cubes from the RPD logical model, full ETL “push-down” never came although you can see traces of it if you have a good poke around the DLLs and directories under the BI Server component. What did happen though was Exadata; with Exadata, features such as SmartScan, and its ability to do joins between normalised tables much faster than regular databases meant that it became possible to report directly against an OLTP schema, or a ODS-like foundation layer, only adding ETL to build a performance star schema layer if it was absolutely necessary. We covered this in a series of posts on Agile Data Warehousing with Exadata, and the focus of this method was performance – by adding Exadata, and the metadata flexibility in OBIEE’s RPD, we could deliver agile projects where Exadata gave us the performance even when we reported directly against a third-normal form data source.




And this approach worked well for our customers; if they’d invested in Exadata, and were open to the idea of agile, iterative development, we could typically deliver a working system in just a few months, and at all times what the users got was what they’d requested in their user story backlog. But there were still ways in which we could improve this method; not everyone has access to Exadata, for example, and reporting directly against a source system makes it tricky to add DW features like history, and surrogate keys, so recently we introduced the successor to this approach, in the form of an OBIEE development method we called “ExtremeBI”. Building our previous agile work, ExtremeBI introduced an integration element, using GoldenGate and ODI to replicate in real time any source systems we were interested in to the DW foundation layer, add the table metadata that DW systems expect, and then provide a means to transform the logical to physical RPD mappings into ODI ETL specifications.


But in a way, all the technical stuff is by-the-by; what this means in practice for customers is that we deliver working systems from the first iteration; initially, by reporting directly against a replicated copy of their source system (with replication and metadata enhancement by GoldenGate, and optionally ODI),and then over subsequent iterations adding more end-user functionality, OR hardened ODI ETL code, all the while driven by end-user stories and not some technical design signed-off months ago and which no longer reflects what users actually want.


What we’ve found though from several ExtremeBI customer engagements, is that it’s not just down to the technology and how well ODI, OBIEE and GoldenGate work; the major factors in successful projects are firstly, having the project properly pre-qualified at the start; not every project, and not every client, suits agile working, and agile works best if you’re “all in” as opposed to just agreeing to work in sprints but still having a set-in-stone set of requirements which have to be met at a certain time. The second important success factor is proper project organisation; we’ve grown from just a couple of guys with laptops back in 2007 to a fully-fledged, end-to-end development organisation, with full-time delivery managers,a managed services desk and tools such as JIRA, and you need to have this sort of thing in place, particularly a project management structure that’s agile-friendly and a good relationship with the customer where they’re fully-signed up to the agile approach. As such, we’ve found the most success where we’ve used ExtremeBI for fairly technically-savvy customers, for example a MIS department, who’ve been tasked with delivering something for reasonable price and over a short amount of months, who understand that not all requirements can be delivered, but really want their system to get adopted, delight their customer and focus its features on what’s important to end-users.

As well as processes and a method, we’ve also developed utilities and accelerators to help speed-up the initial setup, and ensure the initial foundation and staging layers are built consistently, with GoldenGate mappings already put in place, and ready for our developers to start delivering reports against the foundation layer, or use these foundation-layer tables as the basis of a data mart or warehouse build-out. The screenshot below shows this particular tool, built using Groovy and run from within the ODI Studio user interface, where the developer selects a set of source tables from an ODI model, and then the utility builds out the staging and foundation layers automatically, typically saving days over the manual method.


We’ve also built custom KMs for ExtremeBI, including one that uses Oracle Database’s flashback query feature to pull historical transactions from the UNDO log, as an alternative to Oracle Streams or Oracle GoldenGate when these aren’t available on the project.

All together, using Rittman Mead’s ExtremeBI method along with OBIEE, ODI and optionally GoldenGate has meant we’ve been able to deliver working OBIEE systems for customers over just a few months, typically for a budget less than £50k. Coupled with cloud hosting, where we can get the customer up-and-running immediately rather than having to wait for their IT department to provision servers, we think this the best way for most OBIEE11g projects to be delivered in the future. If you’re interested, we’ve got more details on our “ExtremeBI in the Cloud” web page, or you can contact me via email – – if you’d like to discuss it more,

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 1: Getting Started

Fri, 2014-05-30 14:13

Over the years, I’ve blogged quite a bit about integration between Oracle Data Integrator and GoldenGate, and how to make it all work with the Oracle Reference Architecture. With the release of the 12c versions of ODI and GoldenGate last October, and a soon-to-be-updated reference architecture, it’s time to write a few posts on the subject again.

Getting Started with 12c

First, let me describe the new Journalizing Knowledge Module (JKM) that has been introduced in ODI 12c for integration with GoldenGate: JKM Oracle to Oracle Consistent (OGG Online). This JKM now allows GoldenGate to be setup in an “online” mode, meaning the GoldenGate parameter files and process groups will be configured and installed on the source and target GoldenGate servers. ODI communicates with the GoldenGate JAgent to perform the installation. The “offline” mode still exists as it did in the 11g version of the JKM, in which the parameter files, etc. are created in a temporary location, and then manually moved to the source and target. I’ll use the “online” JKM throughout this series of posts.

Another change to how Journalizing is implemented for GoldenGate in ODI is the Model to which the JKM is applied. In ODI 11g, the GoldenGate JKM was always applied to the Model containing the target tables, leaving the source table metadata completely out of the picture. This made sense, as GoldenGate handled everything on the source side. Now in ODI 12c, the source tables are reverse engineered and the JKM applied to the source Model. This allows the source table to be used in a single mapping for both the initial load and incremental load of the performance layer through the use of ODI 12c deployment specifications. The target, or fully replicated table, is no longer necessary in the metadata. We’ll talk through this concept in more detail later on.

Reference Architecture Update

Finally, before we get into the details, let’s go over the latest (yet to be released, I might add) version of the Oracle Information Management Reference Architecture. It was first presented by Stewart Bryson and Andrew Bond (Oracle) at the Rittman Mead BI Forum in Brighton (which is why I was given the OK to mention it pre-release!).

Information Management - Logical View

This latest reference architecture is not much different than previous versions. The main difference that pertains to this blog post is that the Staging Layer has been renamed the Raw Data Reservoir. If you look through the presentation by Stewart and Andrew, you’ll see that the many of the principles remain the same. They also describe more about an Agile approach to implementing the reference architecture, using GoldenGate, OBIEE against transactional schemas, and eventually ETL development using ODI, a methodology we here at Rittman Mead use with our clients, and call ExtremeBI. Look for the official release of the latest Information Management Reference Architecture in the next couple of months.

In this blog series, we’ll look at how to load the Raw Data Reservoir and Foundation Layer using GoldenGate, and subsequently load the Access and Performance Layer with Oracle Data Integrator Mappings. If you recall from my 11g posts on the subject, we don’t need to load these layers in sequence. GoldenGate will allow the extract of source data once and replication to multiple targets in parallel.


Now that we’ve gone through some of the updated concepts for 12c and the Reference Architecture, let’s look at the high-level steps that must be taken in order to implement GoldenGate and ODI integration.

  • Install GoldenGate on the source and target servers – including JAgent configuration
  • Edit the “JKM Oracle to Oracle Consistent OGG (Online)” Knowledge Module (enable Foundation Layer load)
  • Setup ODI Topology (database schema and GoldenGate connections)
  • Setup and start Journalizing on the source Model
  • Develop Mappings for initial load and incremental load
  • Perform initial load and start replication

There is quite a lot of detail to add to these steps, so let’s get right into it.

GoldenGate 12c Installation and JAgent Configuration

The install of GoldenGate 12c is pretty straight-forward, so I don’t plan on going into much detail here. A step-by-step guide can be found on the DBASolved blog, with the installation setting up and starting the manager process and creating the necessary subdirectories. We then need to configure the JAgent on both the source and target GoldenGate installations, enabling ODI to communicate with GoldenGate during the “online” JKM start journalizing process, which will automatically configure and start the GoldenGate process groups. Setting up the JAgent for ODI integration is essentially the same as if you were setting up Oracle Enterprise Manager integration with GoldenGate.

First, you’ll notice that the file jagent.prm exists in the dirprm directory after installation completes. This parameter file will be used by the jagent process once started in GGSCI.


Next, we need to enable monitoring of GoldenGate by adding an entry, ENABLEMONITORING, to the GLOBALS file. Create the GLOBALS file (with no extension) in the GoldenGate home directory and open it in your favorite text editor. Simply add the line to enable monitoring, close and save the file.

Edit GLOBALS file

To allow secure communication between ODI and GoldenGate, we must create an Oracle Wallet with a password for JAgent. From the GoldenGate install directory, run the password agent. This will create the cwallet.sso file in the dirwlt subdirectory.

Create Oracle wallet

We’re almost there! Now we need to make a slight change to the file in the cfg directory under the GoldenGate home. Edit this file and make the following changes:

Set the agent type to Oracle Enterprise Manager. If left as the default OGGMON, the JAgent will attempt to register with the Monitor server, which most likely is not installed.


Under the JMX Username, add the line to signify that SSL is not being used by JAgent (unless SSL is actually being used!).


Finally, ensure the JAgent port is unique across all installations, and on the server in general. In my example, I’m using a single Virtual Machine to host both the source and target GoldenGate installations, so I need to be careful about which ports are in use.


Before starting the JAgent, go ahead and stop, then start the Manager process to ensure all changes have been initialized. Then, start the JAgent and check to ensure both Manager and JAgent are running. That completes the GoldenGate installation and JAgent configuration.

Start JAgent

If you want to just skip all of this installation work and get right to it, you can always download the Prebuilt Machine for Oracle Data Integrator 12c. It’s a VirtualBox VM with ODI 12c and GoldenGate 12c already installed and configured to work with the Getting Started Guide. The JAgent is already configured on the source and target GoldenGate installations, making it easy to get up and running. This is a great resource that the Oracle Data Integration product team has provided, and it sounds like they plan to continue adding to it in the future.

In Part 2 of the blog post series, we’ll edit the JKM to enable parallel load of the Raw Data Reservoir and Foundation Layer, as well as begin setup of the ODI Topology and metadata.

Categories: BI & Warehousing