Skip navigation.

BI & Warehousing

Using HBase and Impala to Add Update and Delete Capability to Hive DW Tables, and Improve Query Response Times

Rittman Mead Consulting - Tue, 2015-05-19 16:21

One of our customers is looking to offload part of their data warehouse platform to Hadoop, extracting data out of a source system and loading it into Apache Hive tables for subsequent querying using OBIEE11g. One of the challenges that the project faces though is how to handle updates to dimensions (and in their case, fact table records) when HDFS and Hive are typically append-only filesystems; ideally writes to fact tables should only require INSERTs and filesystem appends but in this case they wanted to use an accumulating fact snapshot table, whilst the dimension tables all used SCD1-type attributes that had their values overwritten when updates to those values came through from the source system.

The obvious answer then was to use Apache HBase as part of the design, a NoSQL database that sits over HDFS but allows updates and deletes to individual rows of data rather than restricting you just to append/inserts. I covered HBase briefly on the blog a few months ago when we used it to store webserver log entries brought into Hadoop via Flume, but in this case it makes an ideal landing point for data coming into our Hadoop system as we can maintain a current-state record of the data brought into the source system updating and overwriting values if we need to. What was also interesting to me though was how well we could integrate this HBase data into our mainly SQL-style data processing; how much Java I’d have to use to work with HBase, and whether we could get OBIEE to connect to the HBase tables and query them directly (with a reasonable response time). In particular, could we use the Hive-on-HBase feature to create Hive tables over the HBase ones, and then query those efficiently using OBIEE, so that the data flow looked like this?


To test this idea out, I took the Flight Delays dataset from the OBIEE11g SampleApp & Exalytics demo data [PDF] and created four HBase tables to hold the data from them, using the BigDataLite 4.1 VM and the HBase Shell. This dataset has four tables:

  • FLIGHT_DELAYS – around 220m US flight records listing the origin airport, destination airport, carrier, year and a bunch of metrics (flights, late minutes, distance etc)
  • GEOG_ORIGIN – a list of all the airports in the US along with their city, state, name and so on
  • GEOG_DEST – a copy of the GEOG_ORIGIN table, used for filtering and aggregating on both origin and destination 
  • CARRIERS – a list of all the airlines associated with flights in the FLIGHT_DELAYS table

HBase is a NoSQL, key/value-store database where individual rows have a key, and then one or more column families made up of one or more columns. When you define a HBase table you only define the column families, and the data load itself creates the columns within them in a similar way to how the Endeca Server holds “jagged” data – individual rows might have different columns to each other and like MongoDB you can define a new column just by loading it into the database.

Using the HBase Shell CLI on the BigDataLite VM I therefore create the HBase tables using just these high-level column family definitions, with the individual columns within the column families to be defined later when I load data into them.

hbase shell
create 'carriers','details'
create 'geog_origin','origin'
create 'geog_dest','dest'
create 'flight_delays','dims','measures'

To get data into HBase tables there’s a variety of methods you can use. Most probably for the full project we’ll write a Java application that uses the HBase client to read, write, update and delete rows that are read in from the source application (see this previous blog post for an example where we use Flume as the source), or to set up some example data we can use the HBase Shell and enter the HBase row/cell values directly, like this for the geog_dest table:

put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_dest','LAX','dest:city','Los Angeles, CA'
put 'geog_dest','LAX','dest:state','California'
put 'geog_dest','LAX','dest:id','12892'

and you can then use the “scan” command from the HBase shell to see those values stored in HBase’s key/value store, keyed on LAX as the key.

hbase(main):015:0> scan 'geog_dest'
ROW                                    COLUMN+CELL                                                                                                     
 LAX                                   column=dest:airport_name, timestamp=1432067861347, value=Los Angeles, CA: Los Angeles                           
 LAX                                   column=dest:city, timestamp=1432067861375, value=Los Angeles, CA                                                
 LAX                                   column=dest:id, timestamp=1432067862018, value=12892                                                            
 LAX                                   column=dest:state, timestamp=1432067861404, value=California                                                    
1 row(s) in 0.0240 seconds

For testing purposes though we need a large volume of rows and entering them all in by-hand isn’t practical, so this is where we start to use the Hive integration that now comes with HBase. For the BigDataLite 4.1 VM all you need to do to get this working is install the hive-hbase package using yum (after first installing the Cloudera CDH5 repo into /etc/yum.repos.d), load the relevant JAR files when starting your Hive shell session, and then create a Hive table over the HBase table mapping Hive columns to the relevant HBase ones, like this:

ADD JAR /usr/lib/hive/lib/zookeeper.jar;
ADD JAR /usr/lib/hive/lib/hive-hbase-handler.jar;
ADD JAR /usr/lib/hive/lib/guava-11.0.2.jar;
ADD JAR /usr/lib/hive/lib/hbase-client.jar;
ADD JAR /usr/lib/hive/lib/hbase-common.jar;
ADD JAR /usr/lib/hive/lib/hbase-hadoop-compat.jar;
ADD JAR /usr/lib/hive/lib/hbase-hadoop2-compat.jar;
ADD JAR /usr/lib/hive/lib/hbase-protocol.jar;
ADD JAR /usr/lib/hive/lib/hbase-server.jar;
ADD JAR /usr/lib/hive/lib/htrace-core.jar;
 (key string,
  carrier_desc string
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
("hbase.columns.mapping" = ":key,details:carrier_desc")
TBLPROPERTIES ("" = "carriers");
CREATE EXTERNAL TABLE hbase_geog_origin
 (key string,
  origin_airport_name string,
  origin_city string,
  origin_state string,
  origin_id string
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
("hbase.columns.mapping" = ":key,origin:airport_name,origin:city,origin:state,origin:id")
TBLPROPERTIES ("" = "geog_origin");
 (key string,
  dest_airport_name string,
  dest_city string,
  dest_state string,
  dest_id string
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
("hbase.columns.mapping" = ":key,dest:airport_name,dest:city,dest:state,dest:id")
TBLPROPERTIES ("" = "geog_dest");
CREATE EXTERNAL TABLE hbase_flight_delays
 (key string,
  year string,
  carrier string,
  orig string,
  dest string,
  flights tinyint,
  late   tinyint,
  cancelled bigint,
  distance smallint
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,measures:flights,measures:late,measures:cancelled,measures:distance")
TBLPROPERTIES ("" = "flight_delays");

Bulk loading data into these Hive-on-HBase tables is then just a matter of loading the source data into a regular Hive table, and then running INSERT INTO TABLE … SELECT commands to copy the regular Hive rows into the HBase tables via their Hive metadata overlays:

insert into table hbase_carriers                           
select carrier, carrier_desc from carriers;
insert into table hbase_geog_origin
select * from geog_origin;
insert into table hbase_geog_dest
select * from geog_dest;
insert into table hbase_flight_delays
select row_number() over (), * from flight_delays;

Note that I had to create a synthetic sequence number key for the fact table, as the source data for that table doesn’t have a unique key for each row – something fairly common for data warehouse fact table datasets. In fact storing fact table data into a HBase table is not a very good idea for a number of reasons that we’ll see in a moment, and bear-in-mind that HBase is designed for sparse datasets and low-latency inserts and row retrievals so don’t read too much into this approach yet.

So going back to the original reason for using HBase to store these tables, updating rows within them is pretty straightforward. Taking the geog_origin HBase table at the start, if we get the row for SFO at the start using a Hive query over the HBase table, it looks like this:

hive> select * from hbase_geog_origin where key = 'SFO'; 
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
SFO   San Francisco, CA: San Francisco   San Francisco, CA   California   14771
Time taken: 29.126 seconds, Fetched: 1 row(s)

To update that row and others, I can load a new data file into the Hive table using HiveQL’s LOAD DATA command, or INSERT INTO TABLE … SELECT from another Hive table containing the updates, like this:

insert into table hbase_geog_origin    
select * from origin_updates;

To check that the value has in-fact updated I can either run the same SELECT query against the Hive table over the HBase one, or drop into the HBase shell and check it there:

hbase(main):001:0> get 'geog_origin','SFO'
COLUMN                                 CELL                                                                                                           
 origin:airport_name                   timestamp=1432050681685, value=San Francisco, CA: San Francisco International                                  
 origin:city                           timestamp=1432050681685, value=San Francisco, CA                                                               
 origin:id                             timestamp=1432050681685, value=14771                                                                           
 origin:state                          timestamp=1432050681685, value=California                                                                      
4 row(s) in 0.2740 seconds

In this case the update file/Hive table changed the SFO airport name from “San Francisco” to “San Francisco International”. I can change it back again using the HBase Shell like this, if I want:

put 'geog_origin','SFO','origin:airport_name','San Francisco, CA: San Francisco'

and then checking it again using the HBase Shell’s GET command on that key value shows it’s back to the old value – HBase actually stores X number of versions of each cell with a timestamp for each version, but by default it shows you the current one:

hbase(main):003:0> get 'geog_origin','SFO'
COLUMN                                 CELL                                                                                                           
 origin:airport_name                   timestamp=1432064747843, value=San Francisco, CA: San Francisco                                                
 origin:city                           timestamp=1432050681685, value=San Francisco, CA                                                               
 origin:id                             timestamp=1432050681685, value=14771                                                                           
 origin:state                          timestamp=1432050681685, value=California                                                                      
4 row(s) in 0.0130 seconds

So, so far so good. We’ve got a way of storing data in Hive-type tables on Hadoop and a way of updating and amending records within them by using HBase as the underlying storage, but what are these tables like to query? Hive-on-HBase tables with just a handful of HBase rows return data almost immediately, for example when I create a copy of the geog_dest HBase table and put just a single row entry into it, then query it using a Hive table over it:

hive> select * from hbase_geog_dest2;
LAXLos Angeles, CA: Los AngelesLos Angeles, CACalifornia12892
Time taken: 0.257 seconds, Fetched: 1 row(s)

Hive in this case even with a single row would normally take 30 seconds or more to return just that row; but when we move up to larger datasets such as the flight delays fact table itself, running a simple row count on the Hive table and then comparing that to the same query running against the Hive-on-HBase version shows a significant time-penalty for the HBase version:

hive> select sum(cast(flights as bigint)) as flight_count from flight_delays;
Total jobs = 1
Launching Job 1 out of 1
Total MapReduce CPU Time Spent: 7 seconds 670 msec
Time taken: 37.327 seconds, Fetched: 1 row(s)

compared to the Hive-on-HBase version of the fact table:

hive> select sum(cast(flights as bigint)) as flight_count from hbase_flight_delays;
Total jobs = 1
Launching Job 1 out of 1
Total MapReduce CPU Time Spent: 1 minutes 19 seconds 240 msec
Time taken: 99.154 seconds, Fetched: 1 row(s)

And that’s to be expected; as I said earlier, HBase is aimed at low-latency single-row operations rather than full table scan, aggregation-type queries, so it’s not unexpected that HBase performs badly here, but the response time is even worse if I try and join the HBase-stored Hive fact table to one or more of the dimension tables also stored in HBase.

In our particular customer example though these HBase tables were only going to be loaded once-a-day, so what if we copy the current version of each HBase table row into a snapshot Hive table stored in regular HDFS storage, so that our data loading process looks like this:


and then OBIEE queries the snapshot of the Hive-on-HBase table joined to the dimension table still stored in HBase, so that the query side looks like this:


Let’s try it out by taking the original Hive table I used earlier on to load the hbase_flight_delays table. and join that to one of the Hive-on-HBase dimension tables; I’ll start first by creating a baseline response time by joining that source Hive fact table to the source Hive dimension table (also used earlier to load the corresponding Hive-on-HBase table):

select sum(cast( as bigint)) as flight_count, o.origin_airport_name from flight_delays f 
join geog_origin o on f.orig = o.origin                                                             
and o.origin_state = 'California'                                                                       
group by o.origin_airport_name; 
17638Arcata/Eureka, CA: Arcata
9146Bakersfield, CA: Meadows Field
125433Burbank, CA: Bob Hope
1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field
Time taken: 43.896 seconds, Fetched: 27 row(s)

So that’s just under 44 seconds to do the query entirely using regular Hive tables. So what if I swap-out the regular Hive dimension table for the Hive-on-HBase version, how does that affect the response time?

hive> select sum(cast( as bigint)) as flight_count, o.origin_airport_name from flight_delays f       
    > join hbase_geog_origin o on f.orig = o.key                                                        
    > and o.origin_state = 'California'                                                                 
    > group by o.origin_airport_name;
17638Arcata/Eureka, CA: Arcata
9146Bakersfield, CA: Meadows Field
125433Burbank, CA: Bob Hope
1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field
Time taken: 51.757 seconds, Fetched: 27 row(s)

That’s interesting – even though we used the (updatable) Hive-on-HBase dimension table in the query, the response time only went up a few seconds to 51, compared to the 44 when we used just regular Hive tables. Taking it one step further though, what if we used Cloudera Impala as our query engine and copied the Hive-on-HBase fact table into a Parquet-stored Impala table, so that our inward data flow looked like this:


By using the Impala MPP engine – running on Hadoop but directly reading the underlying data files, rather than going through MapReduce as Hive does – and in-addition storing its data in column-store query-orientated Parquet storage, we can take advantage of OBIEE’s new support for Impala and potentially bring the query response time even further. Let’s go into the Impala Shell on the BigDataLite 4.1 VM, update Impala’s view of the Hive Metastore table data dictionary, and then create the corresponding Impala snapshot fact table using a CREATE TABLE … AS SELECT Impala SQL command:

[oracle@bigdatalite ~]$ impala-shell
[bigdatalite.localdomain:21000] > invalidate metadata;
[bigdatalite.localdomain:21000] > create table impala_flight_delays
                                > stored as parquet
                                > as select * from hbase_flight_delays;

Now let’s use the Impala Shell to join the Impala version of the flight delays table with data stored in Parquet files, to the Hive-on-HBase dimension table created earlier within our Hive environment:

[bigdatalite.localdomain:21000] > select sum(cast( as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
                                > join hbase_geog_origin o on f.orig = o.key
                                > and o.origin_state = 'California'  
                                > group by o.origin_airport_name;
Query: select sum(cast( as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
join hbase_geog_origin o on f.orig = o.key
and o.origin_state = 'California'
group by o.origin_airport_name
| flight_count | origin_airport_name                                              |
| 31907        | Fresno, CA: Fresno Yosemite International                        |
| 125433       | Burbank, CA: Bob Hope                                            |
| 1653         | Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field |
Fetched 27 row(s) in 2.16s

Blimey – 2.16 seconds, compared to the best time of 44 seconds we go earlier when we just used regular Hive tables, let alone join to the dimension table stored in HBase. Let’s crank-it-up a bit and join another dimension table in, filtering on both origin and destination values:

[bigdatalite.localdomain:21000] > select sum(cast( as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
                                > join hbase_geog_origin o on f.orig = o.key
                                > join hbase_geog_dest d on f.dest = d.key
                                > and o.origin_state = 'California'  
                                > and d.dest_state = 'New York'
                                > group by o.origin_airport_name;
Query: select sum(cast( as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
join hbase_geog_origin o on f.orig = o.key
join hbase_geog_dest d on f.dest = d.key
and o.origin_state = 'California'
and d.dest_state = 'New York'
group by o.origin_airport_name
| flight_count | origin_airport_name                                   |
| 947          | Sacramento, CA: Sacramento International              |
| 3880         | San Diego, CA: San Diego International                |
| 4030         | Burbank, CA: Bob Hope                                 |
| 41909        | San Francisco, CA: San Francisco International        |
| 3489         | Oakland, CA: Metropolitan Oakland International       |
| 937          | San Jose, CA: Norman Y. Mineta San Jose International |
| 41407        | Los Angeles, CA: Los Angeles International            |
| 794          | Ontario, CA: Ontario International                    |
| 4176         | Long Beach, CA: Long Beach Airport                    |
Fetched 9 row(s) in 1.48s

Even faster. So that’s what we’ll be going with as our initial approach for the data loading and querying; load data into HBase tables as planned at the start, taking advantage of HBase’s CRUD capabilities but bulk-loading and initially reading the data using Hive tables over the HBase ones; but then, before we make the data available for querying by OBIEE, we copy the current state of the HBase fact table into a Parquet-stored Impala table, using Impala’s ability to work with Hive tables and metadata and create joins across both Impala and Hive tables, even when one of the Hive tables uses HBase as its underlying storage.

Categories: BI & Warehousing

Writing tips

Amardeep Sidhu - Tue, 2015-05-19 03:14

Tim Hall has written some brilliant posts about getting going with writing (blogs, whitepapers etc). This post is the result of inspiration from there only. Tim says that just get started with whatever Winking smile.

If you are into blogging and no so active or even if you aren’t you may want to take a look at all the posts to get some inspiration to document the knowledge you gain on day to day basis.

Here is an index to all the posts by Tim till now

Enjoy !

Categories: BI & Warehousing

OBIEE Now Supports HiveServer2 and Cloudera Impala

Rittman Mead Consulting - Sun, 2015-05-17 15:52

As you all probably know I’m a big fan of Oracle’s BI and Big Data products, but something I’ve been critical of is OBIEE11g’s lack of support for HiveServer2 connections to Hadoop clusters. OBIEE supported Hive connections using the older HiveServer1 protocol, but recent versions of Cloudera CDH4 and CDH5 use the HiveServer2 protocol by default and OBIEE wouldn’t connect to them; not unless you switched to the Windows version of OBIEE and used the Cloudera ODBC drivers instead, which worked but weren’t supported by Oracle.

OBIEE addresses this issue by shipping more recent DataDirect ODBC drivers for Hive, that are compatible with the HiveServer2 protocol used by CDH4 and CDH5 (check out this other article by Robin on general new features in Oracle only really support Hive connectivity for Linux installs of OBIEE, and the Linux version of OBIEE comes with the DataDirect ODBC drivers already installed and configured for use, all you have to do then is set up the ODBC connection in the odbc.ini file on Linux and install the Cloudera Hive ODBC drivers on your Windows workstation for the Admin too (the Hive ODBC drivers that Oracle supply on MOS still look like the old HIveServer1 version, though I could be wrong). To check that it all worked on this new version of OBIEE11g I therefore downloaded and installed the Windows Cloudera Hive ODBC drivers and set up the System DSN like this:


and set up a corresponding entry in the Linux OBIEE’s odbc.ini file, like this:


with the key thing being to make sure you have matching DSN names on both the Windows workstation (for the Admin tool initial datasource setup and table metadata import) and the Linux server (for the actual online connection to Hive from the BI Server, and subsequent data retrieval). One thing I did notice was that whilst I could connect to the Hive database server and set up the connection in the Admin tool, I couldn’t view any Hive tables and had to manually create them myself in the RPD Physical Layer – this could just be a quirk on my workstation install though so I wouldn’t read too much into it. Checking connectivity in the Admin tool then showed it connecting properly and retrieving data from Hive on the Hadoop cluster. I didn’t test Kerberos-authentication connections but I’m assuming it’d work, as the previous version of OBIEE on Linux just failed at this point anyway. The docs are here if you’d like to look into any more details, or check the full set of setup steps.


For Cloudera Impala connections, you’re directed in the docs to download the Windows Cloudera Impala ODBC drivers as Oracle don’t even ship them on MOS, but again the Linux install of OBIEE comes with DataDirect Impala drivers that are already setup and ready for use (note that if you upgrade from to rather than do the fresh install that I did for testing purposes, you’ll need to edit the opmn.xml file to register these updated DataDirect drivers). Then it’s a case of setting the Windows System DSN up for the initial metadata import, like this:


then creating a corresponding entry in the Linux server’s odbc.ini file, like this:


Note that the docs do mention the issue with earlier versions of Impala where the Impala server is expecting LIMIT clauses when using ORDER BY in Impala SQL queries, and gives a couple of workarounds to fix the issue and stop Impala expecting this clause; for more recent (CDH5+) versions of Impala this requirement is in-fact lifted and you can connect-to and use Impala without needing to make the configuration change mentioned in the doc (or use the workaround I mentioned in this earlier blog post). Checking connectivity in the Admin tool then shows the connection is making its way through OK, from the Windows environment to the Linux server’s ODBC connection:


and creating a quick report shows data returned as expected, and considerably quicker than with Hive.


As I said, I’ve not really tested either of these two connections using Kerberos or any edge-case setups, but connectivity seems to be working and we’re now in a position where OBIEE11g can properly connect to both Hive, and Impala, on recent CDH installs and of course the Oracle Big Data Appliance. Good stuff, now what about Spark SQL or ElasticSearch..?

Categories: BI & Warehousing

Presentation Slides and Photos from the Rittman Mead BI Forum 2015, Brighton and Atlanta

Rittman Mead Consulting - Sat, 2015-05-16 13:41

It’s now the Saturday after the two Rittman Mead BI Forum 2015 events, last week in Atlanta, GA and the week before in Brighton, UK. Both events were a great success and I’d like to say thanks to the speakers, attendees, our friends at Oracle and my colleagues within Rittman Mead for making the two events so much fun. If you’re interested in taking a look at some photos from the two events, I’ve put together two Flickr photosets that you can access using the links below:


We’ve also uploaded the presentation slides from the two events (where we’ve been given permission to share them) to our website, and you can download them including the Delivering the Oracle Information Management and Big Data Reference Architecture masterclass using the links below:

Delivering the Oracle Information Management & Big Data Reference Architecture (Mark Rittman & Jordan Meyer, Rittman Mead)

Brighton, May 7th and 8th 2015

Atlanta, May 14th and 15th 2015

Congratulations also to Emiel van Bockel and Robin Moffatt who jointly-won Best Speaker award at the Brighton event, and to Andy Rocha and Pete Tamsin who won Best Speaker in Atlanta for their joint session. It’s time for a well-earned rest now and then back to work, and hopefully we’ll see some of you at KScope’15, Oracle Openworld 2015 or the UKOUG Tech and Apps 2015 conferences later in 2015.

Categories: BI & Warehousing

What’s New in OBIEE for Systems Administrators and Developers

Rittman Mead Consulting - Thu, 2015-05-14 11:42

After over two years since the last major release of OBIEE, Oracle released version in May 2015. You can find the installers here and documentation here. is termed the “terminal release” of the 11g line, and the 12c version is already out in closed-beta. We’d expect to see patchsets for 11g to continue for some time covering bugs and any security issues, but for new functionality in 11g I would hazard a guess that this is pretty much it as Oracle concentrate their development efforts on OBIEE 12c and BICS, particularly Visual Analyser.

For both the end user and backend administrator/developer, OBIEE has brought with it some nice little touches, none of which are going to revolutionise the OBIEE world but many of which are going to make life with the tool just that little bit smoother. In this article we take a look at what brings for the sysadmin & developer.

BI Server Query Instrumentation and Usage Tracking

There are some notable developments here:

  1. Millisecond precision when logging events from the BI Server
  2. Usage Tracking now includes the physical query hash, which is what is also visible in the database, enabling end-to-end tracing
  3. User sessions can be tracked and summarised more precisely because session ID is now included in Usage Tracking.
  4. The execution of initialisation blocks is now also recorded, in a new Usage Tracking table called S_NQ_INITBLOCK.
Millisecond precision in BI Server logs

OBIEE writes the nqquery.log with millisecond precision for both the timestamp of each entry, and also the summary timings for a query execution (at last!). It also calls out explicitly “Total time in BI Server” which is a welcome addition from a time profiling/performance analysis point of view:

[2016-07-31T02:11:48.231-04:00 [...] Sending query to database named X0 - Airlines Demo Dbs (ORCL) (id: <<221516>>), connection pool named Aggr Connection, logical request hash 544131ec, physical request hash 5018e5db: [[  
[2016-07-31T02:12:04.31-04:00 [...] Query Status: Successful Completion  
[2016-07-31T02:12:04.31-04:00 [...] Rows 2, bytes 32 retrieved from database query id: <<221516>>  
[2016-07-31T02:12:04.31-04:00 [...] Physical query response time 2.394 (seconds), id <<221516>>  
[2016-07-31T02:12:04.31-04:00 [...] Physical Query Summary Stats: Number of physical queries 1, Cumulative time 2.394, DB-connect time 0.002 (seconds)  
[2016-07-31T02:12:04.31-04:00 [...] Rows returned to Client 2  
[2016-07-31T02:12:04.31-04:00 [...] Logical Query Summary Stats: Elapsed time 16.564, Total time in BI Server 16.555, Response time 16.564, Compilation time 0.768 (seconds), Logical hash 544131ec

One thing to notice here is the subsecond timestamp precision seems to vary between 2 and 3 digits, which may or may not be a bug.

Being able to see this additional level of precision is really important. Previously OBIEE recorded information by the second, which was fine if you were looking at query executions taking dozens of seconds or minutes – but hopefully our aspirations for systems performance are actually closer to the realms of seconds or subsecond. At this scale the level of precision in the timings really matters. On the assumption that OBIEE was rounding values to the nearest whole number, you’d see “0 seconds” for a Logical SQL compile (for example) that was maybe 0.499 seconds. Per query this is not so significant, but if those queries run frequently then cumulatively that time stacks up and would be useful to be properly aware of and target with optimisation if needed.

Usage Tracking changes

Usage Tracking has five new columns for each logical query recorded in S_NQ_ACCT:

  • ECID

The presence of SESSION_ID is very useful, because it means that user behaviour can be more accurately analysed. For example, within a session, how many reports does a user run? What is the median duration of a session? Note that the session here is the session as seen by the BI Server, rather than Presentation Services.

ECID is also very useful for being able to link data in Usage Tracking back to more detailed entries in nqquery.log. Note that an ECID is multipart and concanated with RID and you won’t necessarily get a direct hit on the ECID you find in Usage Tracking with that in nqquery.log, but rather a substring of it. In this example here the root ECID is 11d1def534ea1be0:20f8da5c:14d4441f7e9:–8000–0000000000001891,0:1:103 and the varying component of the relationship (RID) id 1 and 3 respectively:

Usage Tracking:

select ecid,session_id,start_dt,start_hour_min ,saw_src_path from biee_biplatform.s_nq_acct



[2015-05-12T08:58:38.704-04:00] [...] [ecid: 11d1def534ea1be0:20f8da5c:14d4441f7e9:-8000-0000000000001891,0:1:103:3] [...]  
-------------------- SQL Request, logical request hash:  
SET VARIABLE QUERY_SRC_CD='Report',SAW_DASHBOARD='/shared/02. Visualizations/_portal/2.11 Table Designs',SAW_DASHBOARD_PG='Conditional Format',SAW_SRC_PATH='/shared/02. Visualizations/Configured Visuals/Conditional Formats/CF based on a hidden column',PREFERRED_CURRENCY='USD';SELECT^M  
   0 s_0,^M  

In the above example note how the absence of a timezone in the Usage Tracking data is an impedance to accurate interpretation of the results, compared to nqquery.log which has a fully qualified timezone offset.

Usage Tracking changes – Physical Hash ID

As well as additions to the logical query table, there are two new columns for each physical query logged in S_NQ_DB_ACCT:


The implications of this are important – there is now native support in OBIEE for tracing OBIEE workloads directly down to the database (as discussed for OBIEE < here), because the PHYSICAL_HASH_ID is what OBIEE sets as the ACTION field when it connects to the database and is available in Oracle through both AWR, V$ views, and DBMS_MONITOR. For example, in V$SESSION the ACTION field is set to the physical hash:

SQL> select username,program,action 
  from v$session where lower(program) like 'nqs%';

USERNAME PROGRAM                                          ACTION  
-------- ------------------------------------------------ ---------  
BISAMPLE (TNS V1-V3)         5065e891  
BISAMPLE (TNS V1-V3)         2b6148b2  
BISAMPLE (TNS V1-V3)         8802f14e  
BISAMPLE (TNS V1-V3)         206c8d54  
BISAMPLE (TNS V1-V3)         c1c121a7

The ACTION is also available in many EM screens such as this one:

Now with OBIEE the physical hash – which was previously only available in the nqquery.log file – is available in S_NQ_DB_ACCT which can in turn be joined to S_NQ_ACCT to find out the logical request related to the physical query seen on the database. Cool huh!



This can be extended even further to associate AWR workload reports with specific OBIEE requests:


One little grumble (no pleasing some people…) – it would have been nice if Usage Tracking also stored:

  • Timings at millisecond precision as well
  • The number of bytes (rather than just row count)
  • A proper TIMESTAMP WITH TIME ZONE (rather than the weird triplet of TS/DT/HOUR_MIN)
  • “Total time in BI Server”

Who knows, maybe in 12c?…

Footnote – START_TS in Usage Tracking in

As a note for others who may hit this issue, my testing has shown that Usage Tracking in appears to have introduced a bug with START_TS (on both S_NQ_ACCT and S_NQ_DB_ACCT), in that it stores only the date, not date + time as it did in previous versions. For example:

    WHERE  ROWNUM < 2 
    START_TS            START_DT            START_HOUR_MIN   
    ------------------- ------------------- -----  
    2015-03-19 15:32:23 2015-03-19 00:00:00 15:32
    WHERE  ROWNUM < 2 
    START_TS            START_DT            START_HOUR_MIN   
    ------------------- ------------------- -----  
    2015-01-27 00:00:00 2015-01-27 00:00:00 10:41
Initialisation Block information in Usage Tracking

A new table, S_NQ_INITBLOCK, has been added to BIPLATFORM and holds details of when an init block ran, for which user, and importantly, how long it took. From a performance analysis point of view this is really valuable data and it’s good to seeing it being added to the diagnostic data captured to database with Usage Tracking.

From a glance at the data it looks like there’s a bit of a bonus logging going on, with user sign in/sign out also recorded (“SIGNNING ON/SIGNED ON/SIGNED OFF”).


Note that there is no MBean for Init Block Usage Tracking, so regardless of how you configure the rest of Usage Tracking, you need to go to NQSConfig.ini to enable this one.

Presentation Services Cursor Cache

Oracle have added some additional Administration functionality for viewing and managing sessions and the cursor cache in Presentation Services. These let you track and trace more precisely user sessions.

From the Administration Page in OBIEE the new options are:

  1. Set dynamic log level per session from manage sessions

  2. Filter cursor cache based on specific user sessions

  3. Change sort order of cursor cache

  4. Show Presentation Services diagnostics per cursor

  5. Download cursor cache list as CSV

Some of these are somewhat low-level and will not be used day-to-day, but the general move towards a more open diagnostics interface with OBIEE is really positive and I hope we see more of it in 12c… :-)

Command Line Aggregate Advisor

Only for use by those with an Exalytics licence, the Summary Advisor was previously available in the Windows Administration Tool only but can now be run from the command line:

[oracle@demo setup]$ nqaggradvisor -h

    nQAggrAdvisor -d <dataSource> -u <userName> -o <outputFile> -c <tupleInQuotes>  
                  [-p <password>] [-F <factFilter>] [-z <maxSizeAggr>] [-g <gainThreshold>]  
                  [-l <minQueryTime>] [-t <timeoutMinutes>] [-s <startDate>]  
                  [-e <endDate>] [-C <on/off>] [-M <on/off>] [-K <on/off>]

    -d      : Data source name  
    -u      : User name  
    -o      : Output aggregate persistence script file name  
    -c      : Aggregate persistence target - tuple in quotes: Fully qualified Connection pool, fully qualified schema name, capacity in MB  
    -p      : Password  
    -F      : Fact filter file name  
    -z      : Max size of any single aggregate in MB  
    -g      : Summary advisor will run until performance improvement for new aggregates drops below this value, default = 1  
    -l      : The minimum amount of query time accumulated per LTS in seconds, before it is included for analysis, default = 0  
    -t      : Max run time in minutes - 0 for unlimited, default = 0  
    -s      : Statistics start date  
    -e      : Statistics end date  
    -C      : Prefer optimizer estimates - on/off, default = off  
    -M      : Only include measures used in queries - on/off, default = off  
    -K      : Use surrogate keys - on/off, default = on

    nQAggrAdvisor -d "AnalyticsWeb" -u "Administrator" -p "ADMIN" -o "C:\temp\aggr_advisor.out.txt"  
        -c "DW_Aggr"."Connection Pool","DW_Aggr".."AGGR",1000

    nQAggrAdvisor -d "AnalyticsWeb" -u "Administrator" -p "ADMIN" -o "C:\temp\aggr_advisor.out.txt" -F "C:\temp\fact_filter.txt" -g 10  
        -c "TimesTen_instance1"."Connection Pool","dbo",2000 -s "2011-05-02 08:00:00" -e "2011-05-07 18:30:00"  -C on -M on -K off


HTML5 images

In previous versions of OBIEE graph images were rendered in Flash by default, and PNG on mobile devices. You could force it to use PNG for all images but would loose the interactivity (tooltips etc). Now in OBIEE you can change the default from Flash to HTML5. This removes the need for a Flash plugin and is generally the way that a lot of visualisations are done on the web nowadays. To my eye there’s no difference in appearance:

To use HTML5 graphs by default, edit instanceconfig.xml and under <Charts> section add:


Note that html5 is case-sensitive. The config file should look something like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>  
<WebConfig xmlns="">  

If Presentation Services doesn’t come back up when you restart it after making this change then check the stdout logfile console~coreapplication_obips1~1.log as well as the standard sawlog.log file, both of which you’ll find in $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIPresentationServicesComponent/. The reason to check the console log file is that Presentation Services will refuse to start if the configuration supplied is invalid, and you’ll see an error message stating this here.

NQS ODBC functions

One for the Neos amongst you, a quick call of NQSGetSQLProcedures (as seen in SampleApp dashboard 7.90 NQS ODBC Procedures) and comparison with shows the following new & changed NQS ODBC calls. If this means nothing to you then it probably doesn’t need to, but if you’re interested in exploiting OBIEE functionality from all angles, documented or not, then these might be of interest. It goes without saying, these are entirely undocumented and unsupported, completely liable to change or be removed at any time by Oracle.

  • Added
    • NQSGetUserDefinedFunctions
    • NQSPAFIntegration
    • NQSSearchPresentationObjects
    • NQS_GetAllCacheEntries
    • NQS_GetOverallCacheInfo
    • NQS_GetRepositories
    • NQS_LoadNewBaseRP
    • NQS_LoadNewRPVersion
    • NQS_LockSessionAgainstAutoRPSwitchOver
    • NQS_SetRPDReadOnlyMode
    • NQS_SwitchOverThisSessionToNewRP
    • SAPurgeCacheBySubjectArea
    • SAPurgeCacheEntryByIDVector
    • SAPurgeXSACache
    • SASeedXSACache
  • Modified
    • NQSGetQueryLogExcerpt (additional parameter)
    • SAPurgeInternalCache (additional enum)
  • Removed
    • NQSChangeSelfPassword
Web Services

Web Services are one of the best ways to integrate with OBIEE programatically. You don’t need to be building heavy java apps just to use them – you can create and send the necessary SOAP messages from python or even just send it from bash with curl.

There are 2.5 new WSDLs – two new ones (v9, v10) plus v8 which has changed. The new services are:

  • KPIAssessmentService
  • ScorecardAssessmentService
  • ScorecardMetadataService
  • UserPersonalizationService

You’ll find documentation for the Web Services in the Integrator’s Guide.

User Image Upload

Users can now upload their own images for use in Title views, conditional formats, etc. From an administration point of view this means you’ll want to be keeping an eye on /root/shared/custom/images/ in the Presentation Catalog, either on disk and/or through the OBIEE Catalog View, switch to Admin and enable “Show Hidden Items”:


This new setting in NQSConfig.ini will warn users when they’re breaching defined query limits, but it won’t abort the query.

Pointless hacks

If you’re a geek like me, part of the fun of a new tool is simply poking around and seeing what’s new – not necessarily what’s useful. There’s plenty of great new stuff in, but let’s take a look at the “under the hood”, just Because Geek.

It was John Minkjan who first blogged several years about the xsd configuration schema files, and it is from this that we can find all the things that Presentation Services might be able to do – not just what it definitely can do, and not just what Oracle have documented that it can do. I wrote about some of these options a while back, and there are a few new ones in


  • EnableCloudBIEEHome sets the home page of OBIEE to be as it would be on BI Cloud Service (BICS). This is completely pointless since all the interesting stuff (Load Data, Model, Manage) is non-existent, even if it does give us a clue which application deployments are going to be supplying them (bimodeler and biserviceadministration respectively)

  • GridViews/ShowDataModels outputs a bunch of debug data in Answers Table or Pivot Views:

  • VirusScannerConfiguration – When a user uploads a custom image, this command will be called with it. For example, this simple script writes to a file the time and name of the file passed to it:

    echo '---' >> /tmp/log.txt  
    date >> /tmp/log.txt  
    echo $1 >> /tmp/log.txt

    If I save this as /tmp/ and add it to instanceconfig.xml:


    When I upload an image I get a row written to my log file. That in itself isn’t useful, but it could be a handy hook maybe from an auditing point of view, or indeed, virus scanning:

    [oracle@demo tmp]$ cat /tmp/log.txt  
    Wed May 20 16:01:47 EDT 2015  
Categories: BI & Warehousing

Some Oracle Big Data Discovery Tips and Techniques

Rittman Mead Consulting - Tue, 2015-05-12 05:49

I’ve been using Oracle Big Data Discovery for a couple of months now, and one of the sessions I’m delivering at this week’s Rittman Mead BI Forum 2015 in Atlanta is on Big Data Discovery Examples from the Field, co-presented with Tim Vlamis from Vlamis Software. Tim is going to concentrate on a data analysis and visualization example using a financial/trading dataset,  and I’m going to look at at some of the trickier, or less obvious aspects to the BDD development process that we’ve come across putting together PoCs and demos for customers. I’ll start first with the data ingestion and transformation part of BDD.

There’s two basic ways to get data into BDD’s DGraph engine; you can either use the Data Processing CLI command-line utility to sample, ingest and enrich Hive table data into the DGraph engine, or you can use the web-based data uploader to ingest data from a spreadsheet, text file or similar. For example, to load a Hive table called “bdd_test_tweets” into the DGraph engine using the command-line, you’d enter the commands:

[oracle@bigdatalite Middleware]$ cd BDD1.0/dataprocessing/edp_cli
[oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t bdd_test_tweets

Big Data Discovery would then read the Hive table metastore to get the table and column names, datatypes and file location, then spin-up a Spark job to sample, enrich and then load the data into the DGraph engine. If the Hive table has fewer than 1m rows the whole dataset gets loaded in, or the dataset is sampled if the number of Hive rows is greater than 1m. The diagram below shows the basic load, profile and enrich ingestion process.


There’s a couple of things to bear in-mind when you’re loading data into BDD in this way:

  • You can only load Hive tables, not Hive views, as the Spark loading process only works with full table definitions in the Hive metastore
  • If your Hive table uses a SerDe other than the ones that ship with Base CDH5, you’ll need to upload the SerDe into BDD’s EDP JAR file area in HDFS and update some JAR reference files before the import will work, as detailed in Chapter 3 of the Big Data Discovery Data Processing Guide doc
  • If you’ve installed BDD on a laptop or a smaller-than-usual Hadoop setup, you’ll need to make sure the SPARK_EXECUTOR_MEMORY value you set in bdd.conf file when you installed the product can be handled by the Hadoop cluster – by default SPARK_EXECUTOR_MEMORY is set to 48G for the install, but on my single laptop install I set it to 2G (after having first installed BDD, the data ingestion process didn’t work, and then I had to reinstall it with SPARK_EXECUTOR_MEMORY = 2G as the new setting)
  • If you installed an early copy of BDD you might also need to change the OLT_HOME value in the /localdisk/Oracle/Middleware/user_projects/domains/bdd_domain/bin/ file so that OLT_HOME=”/opt/bdd/olt” instead reads OLT_HOME=”/opt/bdd/edp/olt” – recent updates to the install files and installer correct this problem, but if it’s set wrong then the noun extraction part of the ingestion process won’t work either from the CLI, or from the BDD Studio Transformation screen
  • There’s also no current way to refresh or reload a BDD DGraph dataset, apart from deleting it from BDD and then re-importing it. Hopefully this, and the lack of Kerberos support, will be addressed in the next release

Another thing you might want to consider when providing datasets for use with BDD is whether you leave quotes around the column values, and whether you pre-strip out HTML tags from any text. Take for example the text file below, stored in CSV-delimited format:


The file contains three comma-separated fields per line; one with the IP address of the requesting user, the others with the page title and page content snippet, all three fields having quotes around their values due to the presence of commas in the content text. Loading this data into Hive using the Hue web interface gives us a table with quotes around all of the fields, as Hue (in this release) doesn’t strip quotes from CSV fields.


When I ingest this table into BDD using the Data Processing CLI, I’ve got just these three columns still with the quotes around the fields. I can easily remove the quotes by going into the Transformation screen and use Groovy transforms to strip the first and last characters from the fields, but this is more work for the user and I don’t benefit from the automatic enrichment that BDD can do when performing the initial ingest.


If, however, I replace the comma separator with a pipe symbol, and remove the quotes, like this:


and therefore use Hue’s ability to use pipe and other separators instead of commas (and quotes), my Hive table looks like this:


Now, when we ingest this table into BDD, we get six more derived attributes as the enrichment part of the ingestion process recognises the fields as containing IP addresses, text and so on. Presumably in the future BDD will have an option to ignore quotes around field values, but for now I tend to strip-out the quotes and uses pipes instead for my BDD ingestion files.


Similarly, with hive tables that contain fields with HTML content you can just load those fields into BDD as-is, and BDD will generally extract nouns and keywords and created derived fields for those. And whilst you can run Groovy transformations to strip-out the HTML tags (mostly), you’re then stuck with these derived columns that include HTML tag names – img, h2 and so on – in the keywords list. What I tend to do then is re-export the dataset with the content field stripped of the HTML tags, then re-ingest that table so I get a new keyword field with the HTML tags removed. What would be simpler though would be to strip-out the HTML tags before you load up the Hive table, so you didn’t have to do this round-trip to get rid of the HTML tag names from the noun keyword lists that are automatically generated during the ingest enrichment process.

Once you’ve got datasets loaded into BDD, something I didn’t quite get the value of when I first used BDD studio was the “scratchpad” feature. To take an example, in the masterclass session I bring in a table of tweets referencing Rittman Mead, and one of the attributes in the resulting BDD dataset is for the first hashtag mentioned in the tweet. I can select this attribute and click on the “Add to Scratchpad” link to add it into the BDD Studio scratchpad, like this:


The scratchpad then displays above the list of attributes for that dataset, and by default it shows a horizontal bar chart listing the number of times each hashtag in the dataset is referenced.


I could then, should I wish to, use the Refine By button to the left of the chart to filter-down (or “refine” in BDD and Endeca-speak) the chart to include just those tweets by a subset of Twitter users – in this case myself, Robin, Michael, Jerome and Edel.


I can also add other attributes to the scratchpad as well – for example, the Twitter handle for the person tweeting – so that we can turn the bar chart into a stacked bar chart with the Twitter handles used to show the breakdown of use of that hashtag by each of our staff.


You can also use these Scratchpad visualisations as the start of your main BDD Studio “Discover” dashboards, by pressing the Add to Discover page at the bottom right-hand corner of each visualization. In this way rather than creating your Discover dashboards from scratch each time, you can seed them with some starter graphs and data visualizations right from the dataset attribute views.


The last bit I’m going to talk about in the BI Forum session session are “dataset views”; by default, each dataset you create within a BDD project has just its own attributes within it, and if you use one of them to create a visualization in the Discovery section, you’ll not be able to use any of the attributes from your other datasets (although the faceted search feature above every BDD Studio page searches all datasets in your project and in the BDD Catalog, just like the Endeca Information Discovery “value searches” that I talked about in this older article. To use attributes from more than one dataset in a BDD Studio visualisation component you need to join them, similar to how you’d join tables in the OBIEE RPD.

To take the Tweets, Page Views and Page Content datasets I use in the BI Forum masterclass, consider a situation where I’d like to list out all o the tweets that reference our website, along with details of the page title, page category and other attributes that I can get from a second dataset that I pull from the website itself. To link these two datasets together I join them in BDD Studio using their common URL attribute (in reality I had to massage the datasets so that both URLs featured a trailing forward-slash (“/“) to make them join properly, but that’s another story)


If I then go to the Data Views tab within the Project Settings BDD Studio screen, I can see that two data views have been setup for this join; one (“rm_linked_tweets – linked”) leads on the RM Linked Tweets dataset (the tweets) and returns the 1547 tweets in that first dataset joined to pages in the Site Content dataset, the “site_content – linked” dataset starts from the 2229 records in the Site Content dataset and joins those records to the RM Linked Tweets dataset; you can then choose which one you want to use (or “drive off”) when you add components to the Discover dashboard part.


Where it gets interesting is when you add third, fourth datasets to the join. Depending on the way you join-in the third table affects the number of rows returned by the join; if join the web server logs dataset (“access_per_post_cat_authors”) to the Site Contents dataset the resulting three-way join view returns the 2229 rows driven by the entries in the Site Contents dataset, whereas if I join the tweets dataset to the web server logs dataset directly, so the tweets dataset joins first to the site contents dataset, and then separately to the web server logs dataset, like this:


then the resulting data view joining all three datasets return a row count equal to the rows in the tweets dataset driving it all.


The best way to work this all out in your head is to do what I did, and create a series of datasets with distinct row counts and join characteristics and then just test creating joins and viewing the resulting row count using the Preview button below the list of data views. To make things even more interesting you can choose, in the Discover page properties section, whether to left-outer join, equi-join or full-outer join a “primary” dataset used for the page with any it’s joined with, in our instance determining whether the full set of tweets are filtered by the list of pages they refer to (removing tweets that reference non-existant RM web pages in this example), or whether all tweets are returned regardless.


It’s actually pretty powerful stuff and you should avoid the temptation to pre-create all your joined datasets in Hive beforehand rather than use BDD Studio joins, as once you get your head around the concept it’s a flexible and straightforward way to join your datasets up in whatever way makes sense for your analysis – leading off of website hits for one type of analysis, and then pages referenced in tweets in the other, allowing yo to easily swap what would be the fact, and dimension tables in a regular relational database report.

That’s if for now though; it’s the day before the Atlanta Rittman Mead BI Forum and I need to to get myself to the venue and get set up for tomorrow’s masterclass with Jordan Meyer. To those of you that are coming along to Atlanta I look forward to seeing you, otherwise normal service will be resumed next week.

Categories: BI & Warehousing

Final Agenda for the Rittman Mead BI Forum 2015 Atlanta – Running this Week!

Rittman Mead Consulting - Mon, 2015-05-11 02:17

It’s the Monday before the Atlanta Rittman Mead BI Forum 2015, and delegates will start to arrive in Atlanta tomorrow and during the rest of the week. The first event in Brighton was excellent, and we’re hoping for something even better in Atlanta this week!

Safe travels for everyone coming to Atlanta, the official Twitter hashtag for the event is #biforum2015, and in the meantime here’s the final agenda for Atlanta’s BI Forum 2015:

Rittman Mead BI Forum 2015
Renaissance Atlanta Midtown Hotel, Atlanta, GA, USA 

Wednesday 13th May 2015

  • 9.00am – 9.30am Registration for Masterclass attendees
  • 9.30am – 12.30pm Masterclass Part 1
  • 12.30pm – 1.30pm Lunch
  • 1.30pm – 4.30pm Masterclass Part 2
  • 5.30pm – 6.30pm Drinks Reception in Renaissance Midtown Hotel Atlanta
  • 6.30pm – 7.30pm Oracle Keynote – Chris Lynskey
  • 7.30pm – 9.30pm Dinner at Renaissance Midtown Hotel Atlanta

Thursday 14th May 2015

  • 8.45am – 9.00am Welcome and Opening Comments
  • 9.00am – 9.45am Chris Lynskey (Oracle Corporation) – Looking Ahead to Oracle BI 12c and Visual Analyzer
  • 9.45am – 10.30am Robin Moffatt (Rittman Mead) – Smarter Regression Testing for OBIEE
  • 10.30am – 11.00am Coffee
  • 11.00am – 11.45pm Chris Lynskey (Oracle Corporation) – Big Data Discovery
  • 11.45am – 12.30pm Mark Rittman (Rittman Mead) and Tim Vlamis (Vlamis Software Solutions) – Big Data Discovery – Examples from the Field
  • 12.30pm – 1.30pm Lunch
  • 1.30pm – 2.30pm Day 1 Debate – “Self-Service BI – The Answer to Users’ Prayers, or the Path to Madness?”
  • 2.30pm – 3.15pm Tim German / Cameron Lackpour – Hybrid Mode – An Essbase Revolution
  • 3.15pm – 3.45pm Coffee
  • 3.45pm – 4.30pm Kevin McGinley (Red Pill Analytics) – Agile BI Applications: A Case Study
  • 6.00pm – 6.45pm Guest Speaker/Keynote – John Foreman – How Mailchimp used qualitative and quantitative analysis to build their next product
  • 7.00pm – 7.45pm Depart for dinner at restaurant
  • 8.00pm – 10.00pm Dinner at external venue

Friday 15th May 2015

  • 09.00am – 09.45am Stewart Bryson (Red Pill Analytics) – Supercharge BI Delivery with Continuous Integration
  • 09.45am – 10.30am Gabby Rubin (Oracle Corporation) – Solid Standing for Analytics in the Cloud
  • 10.30am – 11.15am Hasso Schaap (Qualogy) – Developing strategic analytics applications on OBICS PaaS
  • 11.15am – 11.30am Coffee
  • 11.30am – 12.15pm Andy Rocha and Pete Tamisin (Rittman Mead) – OBIEE Can Help You Achieve Your GOOOOOOOOOALS!
  • 12.15pm – 1.00pm Christian Screen (Sierra-Cedar) – 10 Tenets for Making Your Oracle BI Applications Project Succeed Like a Boss
  • 1.00pm – 1.30pm Short Lunch
  • 1.30pm – 2.30pm Data Visualization Bake-off
  • 2.30pm – 3.15pm Sumit Sarkar (Progress Software) – NoSQL and OBIEE
  • 3.15pm – 3.45pm Closing Remarks, and Best Speaker Award

If you’re interested in coming along to the Rittman Mead BI Forum 2015 in Atlanta, GA, there are still spaces available with details of the event here, and the registration form here – otherwise we’ll look forward to seeing you all at the Renaissance Atlanta Midtown Hotel later this week.

Categories: BI & Warehousing

New OlapUnderground Utility - The Essbase Outline Viewer

Tim Tow - Fri, 2015-05-08 07:54
Have you ever wanted your users to see an Essbase database outline structure, but had trouble giving them access to Essbase Administrative Services ("EAS") for fear that they may have access to change things they shouldn't change?  Maybe you didn't want to give them EAS and have to deal with the version of Java installed on their machine?  We have the solution to your problems.  After a bit of fits and starts, we have finally released the OlapUnderground Essbase Outline Viewer!

The OlapUnderground Essbase Outline Viewer is a read-only user interface built specifically to view the contents of an Essbase outline.  It is built in Microsoft .NET technology and should run on any Windows-based computer without any special downloads.  It also uses the MaxL outline xml format and, as your Hyperion administrators can create and distribute the outline xml file, end users do not need an Essbase login to view the outline.

Here is a screenshot of the Outline Viewer:

Click image to see it full-size

You can download the Essbase Outline Viewer from our website at  Let us know what you think!

Categories: BI & Warehousing

Just Under a Week to go Until the Atlanta BI Forum 2015 – Places Still Available!

Rittman Mead Consulting - Thu, 2015-05-07 09:38

The Rittman Mead Brighton BI Forum 2015 is now underway, with presentations from Oracle, Rittman Mead, partners and customers on a range of topics around Oracle BI, DW and Big Data. So far this week we’ve had a one-day masterclass from myself and Jordan Meyer on Delivering the Oracle Information Management & Big Data Reference Architecture, sessions from Oracle on OBIEE12c, the new SampleApp for OBIEE, Big Data Discovery, BI Cloud Service and Visual Analyzer. We’ve also had sessions from the likes of Emiel van Bockel, Steve Devine, Christian Screen and others on Exalytics, data visualization, Oracle BI Apps and other topics – and a very entertaining debate on self-service BI.


… and we’re doing it all again in Atlanta, GA next week! If you’re interested in coming along to the Rittman Mead BI Forum 2015 in Atlanta, GA, there are still spaces available with details of the event here, and the registration form here. We’re running BI Forum 2015 in the Renaissance Hotel Midtown Atlanta, the masterclass with myself and Jordan Meyer runs on the Wednesday, with the event itself kicking-off with a reception, meal and keynote from Oracle on Wednesday evening, followed by the main event itself starting Thursday morning. Hopefully we’ll see some of you there…!


Categories: BI & Warehousing

One Day to the Brighton Rittman Mead BI Forum 2015 – Here’s the Agenda!

Rittman Mead Consulting - Tue, 2015-05-05 15:33

It’s the night before the Brighton Rittman Mead BI Forum 2015, and some delegates are already here ready for the masterclass tomorrow. Everyone else will either be arriving later in the day for the drinks reception, Oracle Keynote and dinner, or getting here early Thursday morning ready for the event proper. Safe travels for everyone coming down to Brighton, the official Twitter hashtag for the event is #biforum2015, and in the meantime here’s the final agenda for this week’s event:

Rittman Mead BI Forum 2015
Hotel Seattle, Brighton, UK 

Wednesday 6th May 2015

  • 10.00 – 10.00 Registration for Masterclass attendees
  • 10.30 – 12.30 Masterclass Part 1
  • 13.00 – 13.30 Lunch
  • 13.30 – 16.30 Masterclass Part 2
  • 18.00 – 19.00 Drinks Reception in Hotel Seattle Bar
  • 19.00 – 20.00 Oracle Keynote – Nick Tuson & Philippe Lions
  • 20.00 – 22.00 Dinner at Hotel Seattle

Thursday 7th May 2015

  • 08.45 – 09.00 Welcome and Opening Comments
  • 09.00 – 09.45 Steve Devine (Independent) – The Art and Science of Creating Effective Data Visualisations
  • 09.45 – 10.30 Chris Royles (Oracle Corporation) – Big Data Discovery
  • 10.30 – 11.00 Coffee
  • 11.00 – 11.45 Christian Screen (Sierra-Cedar) – 10 Tenets for Making Your Oracle BI Applications Project Succeed Like a Boss
  • 11.45 – 12.30 Philippe Lions and Nick Tuson (Oracle Corporation) Looking Ahead to Oracle BI 12c and Visual Analyzer
  • 12.30 – 13.30 Lunch
  • 13.30 – 14.30 Day 1 Debate – “Self-Service BI – The Answer to Users’ Prayers, or the Path to Madness?”
  • 14.30 – 15.15 Emiel van Bockel (CB) Watch and see 12c on Exalytics
  • 15.15 – 15.45 Coffee
  • 15.45 – 16.30 Philippe Lions (Oracle Corporation) – Solid Standing for Analytics in the Cloud
  • 16.30 – 17.15 Manuel Martin Marquez (C.E.R.N.) – Governed Information Discovery: Data-driven decisions for more efficient operations at CERN
  • 18.00 – 18.45 Guest Speaker/Keynote – Reiner Zimmermann (Oracle Corporation) – Hadoop or not Hadoop …. this is the question
  • 19.00 – 20.00 Depart for dinner at restaurant
  • 20.00 – 22.00 Dinner at external venue

Friday 8th May 2015

  • 09.00 – 09.45 Daniel Adams (Rittman Mead) User Experience First: Guided information and attractive dashboard design
  • 09.45 – 10.30 André Lopes (Liberty Global) A Journey into Big Data and Analytics
  • 10.30 – 11.00 Coffee 
  • 11.00 – 11.45 Antony Heljula (Peak Indicators) – Predictive BI – Using the Past to Predict the Future
  • 11.45 – 12.30 Robin Moffatt (Rittman Mead) Data Discovery and Systems Diagnostics with the ELK stack
  • 12.30 – 13.00 Short Lunch
  • 13.00 – 14.00 Data Visualization Bake-off
  • 14.00 – 14.45 Gerd Aiglstorfer (G.A. itbs GmbH) Driving OBIEE Join Semantics on Multi Star Queries as User
  • 14.45 – 15.00 Closing Remarks, and Best Speaker Award

See you all at the Hotel Seattle, Brighton, tomorrow!

Categories: BI & Warehousing

So What’s the Real Point of ODI12c for Big Data Generating Pig and Spark Mappings?

Rittman Mead Consulting - Wed, 2015-04-29 00:30

Oracle ODI12c for Big Data came out the other week, and my colleague Jérôme Françoisse put together an introductory post on the new features shortly after, covering ODI’s new ability to generate Pig and Spark transformations as well as the traditional Hive ones. How this works is that you can now select Apache Pig, or Apache Spark (through pySpark, the Spark API through Python) as the implementation language for an ODI mapping, and ODI will generate one of those languages instead of HiveQL commands to run the mapping.


How this works is that ODI12c adds a bunch of new component-style KMs to the standard 12c ones, providing filter, aggregate, file load and other features that generate pySpark and Pig code rather than the usual HiveQL statement parts. Component KMs have also been added for Hive as well, making it possible now to include non-Hive datastores in a mapping and join them all together, something it was hard to do in earlier versions of ODI12c where the Hive IKM expected to do the table data extraction as well.

But when you first look at this you may well be tempted to think “…so what?”, in that Pig compiles down to MapReduce in the end, just like Hive does, and you probably won’t get the benefits of running Spark for just a single batch mapping doing largely set-based transformations. To my mind where this new feature gets interesting is its ability to let you take existing Pig and Spark scripts, which process data in a different, dataflow-type way compared to Hive’s set-based transformations and which also potentially also use Pig and Spark-specific function libraries, and convert them to managed graphical mappings that you can orchestrate and run as part of a wider ODI integration process.

Pig, for example, has the LinkedIn-originated DataFu UDF library that makes it easy to sessionize and further transform log data, and the Piggybank community library that extends Pig’s loading and saving capabilities to additional storage formats, and provides additional basic UDFs for timestamp conversion, log parsing and so forth. We’ve used these libraries in the past to process log files from our blog’s webserver and create classification models to help predict whether a visitor will return, with the Pig script below using the DataFu and Piggybank libraries to perform these tasks easily in Pig.

register /opt/cloudera/parcels/CDH/lib/pig/datafu.jar;
register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar;

DEFINE Sessionize datafu.pig.sessions.Sessionize('60m');
DEFINE Median datafu.pig.stats.StreamingMedian();
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.9','0.95');
DEFINE VAR datafu.pig.VAR();
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();

-- Import and clean logs
raw_logs = LOAD '/user/flume/rm_logs/apache_access_combined' USING TextLoader AS (line:chararray);

-- Extract individual fields
logs_base = FOREACH raw_logs
(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')) AS
(remoteAddr: chararray, remoteLogName: chararray, user: chararray, time: chararray, request: chararray, status: chararray, bytes_string: chararray, referrer:chararray, browser: chararray);

-- Remove Bots and convert timestamp
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|Bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*');

-- Remove uselesss columns and convert timestamp
clean_logs = FOREACH logs_base_nobots GENERATE CustomFormatToISO(time,'dd/MMM/yyyy:HH:mm:ss Z') as time, remoteAddr, request, status, bytes_string, referrer, browser;

-- Sessionize the data

clean_logs_sessionized = FOREACH (GROUP clean_logs BY remoteAddr) {
ordered = ORDER clean_logs BY time;
GENERATE FLATTEN(Sessionize(ordered))
AS (time, remoteAddr, request, status, bytes_string, referrer, browser, sessionId);

-- The following steps will generate a tsv file in your home directory to download and work with in R
store clean_logs_sessionized into '/user/jmeyer/clean_logs' using PigStorage('\t','-schema');

If you know Pig (or read my previous articles on this theme), you’ll know that pig has the concept of an “alias”, a dataset you define using filters, aggregations, projections and other operations against other aliases, with a typical pig script starting with a large data extract and then progressively whittling it down to just the subset of data, and derived data, you’re interested in. When it comes to script execution, Pig only materializes these aliases when you tell it to store the results in permanent storage (file, Hive table etc) with the intermediate steps just being instructions on how to progressively arrive at the final result. Spark works in a similar way with its RDDs, transformations and operations which either create a new dataset based off of an existing one, or materialise the results in permanent storage when you run an “action”. So let’s see if ODI12c for Big Data can create a similar dataflow, based as much as possible on the script I’ve used above.

… and in-fact it can. The screenshot below shows the logical mapping to implement this same Pig dataflow, with the data coming into the mapping as a Hive table, an expression operator creating the equivalent of a Pig alias based off of a filtered, transformed version of the original source data using the Piggybank CustomFormatToISO UDF, and then runs the results of that through an ODI table function that in the background transforms the data using Pig’s GENERATE FLATTEN command and a call to the DataFu Sessionize UDF.


And this is the physical mapping to go with the logical mapping. Note that all of the Pig transformations are contained within a separate execution unit, that contains operators for the expression to transform and filter the initial dataset, and another for the table function.


The table function operator runs the input fields through an arbitrary Pig Latin script, in this case defining another alias to match the table function operator name and using the DataFu Sessionize UDF within a FOREACH to first sort, and then GENERATE FLATTEN the same columns but with a session ID for user sessions with the same IP address and within 60 seconds of each other.


If you’re interested in the detail of how this works and other usages of the new ODI12c for Big Data KMs, then come along to the masterclass I’m running with Jordan Meyer at the Brighton and Atlanta Rittman Mead BI Forums where I’ll go into the full details as part of a live end-to-end demo. Looking at the Pig Latin that comes out of it though, you can see it more or less matches the flow of the hand-written script and implements all of the key steps.


Finally, checking the output of the mapping I can see that the log entries have been sessionized and they’re ready to pass on to the next part of the classification model.


So that to my mind is where the value is in ODI generating Pig and Spark mappings. It’s not so much taking an existing Hive set-based mapping and just running it using a different language, it’s more about being able to implement graphically the sorts of data flows you can create with Pig and Spark, and being able to get access to the rich UDF and data access libraries that these two languages benefit from. As I said, come along to the masterclass Jordan and I are running, and I’ll go into much more detail and show how the mapping is set up, along with other mappings to create an end-to-end Hadoop data integration process.

Categories: BI & Warehousing

Setting up Security and Access Control on a Big Data Appliance

Rittman Mead Consulting - Tue, 2015-04-28 14:05

Like all Oracle Engineered Systems, Oracle’s field servicing and Advanced Customer Services (ACS) teams go on-site once a BDA has been sold to a customer and do the racking, installation and initial setup. They will usually ask the customer a set of questions such as “do you want to enable Kerberos authentication”, “what’s the range of IP addresses you want to use for each of the network interfaces”, “what password do you want to use” and so on. It’s usually enough to get a customer going, but in-practice we’ve found most customers need a number of other things set-up and configured before they use the BDA in development and production; for example:

  • Integrating Cloudera Manager, Hue and other tools with the corporate LDAP directory
  • Setting up HDFS and SSH access for the development and production support team, so they can log in with their usual corporate credentials
  • Come up with a directory layout and file placement strategy for loading data into the BDA, and then moving it around as data gets processed
  • Configuring some sort of access control to the Hive tables (and sometimes HDFS directories) that users use to get access to the Hadoop data
  • Devising a backup and recovery strategy, and thinking about DR (disaster recovery)
  • Linking the BDA to other tools and products in the Oracle Big Data and Engineered Systems family; Exalytics, for example, or setting up ODI and OBIEE to access data in the BDA

The first task we’re usually asked to do is integrate Cloudera Manager, the web-based admin console for the Hadoop parts of the BDA, with the corporate LDAP server. By doing this we can enable users to log into Cloudera Manager with their usual corporate login (and restrict access to just certain LDAP groups, and further segregate users into admin ones and stop/start/restart services-type ones), and similarly allow users to log into Hue using their regular LDAP credentials. In my experience Cloudera Manager is easier to set up than Hue, but let’s look at a high-level at what’s involved.

LDAP Integration for Hue, Cloudera Manager, Hive etc

In our Rittman Mead development lab, we have OpenLDAP running on a dedicated appliance VM and a number of our team setup as LDAP users. We’ve defined four LDAP groups, two for Cloudera Manager and two for Hue, with varying degrees of access for each product.


Setting up Cloudera Manager is pretty straightforward, using the Administration > Settings menu in the Cloudera Manager web UI (note this option is only available for the paid, Cloudera Enterprise version, not the free Cloudera Express version). Hue security integration is configured through the Hue service menu, and again you can configure the LDAP search credentials, any LDAPS or certificate setup, and then within Hue itself you can define groups to determine what Hue features each set of users can use.


Where Hue is a bit more fiddly (last time I looked) is in controlling access to the tool itself; Cloudera Manager lets you explicitly define which LDAP groups can access the tool with other users then locked-out, but Hue either allows all authenticated LDAP users to login to the tool or makes you manually import each authorised user to grant them access (you can then either have Hue check-back to the LDAP server for their password each login, or make a copy of the password and store it within Hue for later use, potentially getting out-of-sync with their LDAP directory password version). In practice what I do is use the manual authorisation method but then have Hue link back to the LDAP server to check the users’ password, and then map their LDAP groups into Hue groups for further role-based access control. There’s a similar process for Hive and Impala too, where you can configure the services to authenticate against LDAP, and also have Hive use user impersonation so their LDAP username is passed-through the ODBC or JDBC connection and queries run as that particular user.

Configuring SSH and HDFS Access and Setting-up Kerberos Authentication

Most developers working with Hadoop and the BDA will either SSH (Secure Shell) into the cluster and work directly on one of the nodes, or connect into their workstation which has been configured as a Hadoop client for the BDA. If they SSH in directly to the cluster they’ll need Linux user accounts there, and if they go in via their workstation the Hadoop client installed there will grant them access as the user they’re logged-into the workstation as. On the BDA you can either set-up user accounts on each BDA node separately, or more likely configure user authentication to connect to the corporate LDAP and check credentials there.


One thing you should definitely do, either when your BDA is initially setup by Oracle or later on post-install, is configure your Hadoop cluster as a secure cluster using Kerberos authentication. Hadoop normally trusts that each user accessing Hadoop services via the Hadoop Filesystem API (FS API) is who they say they are, but using the example above I could easily setup an “oracle” user on my workstation and then access all Hadoop services on the main cluster without the Hadoop FS API actually checking that I am who I say I am – in other words the Hadoop FS API shell doesn’t check your password, it merely runs a “whoami” Linux command to determine my username and grants me access as them.


The way to address this is to configure the cluster for Kerberos authentication, so that users have to have a valid Kerberos ticket before accessing any secured services (Hive, HDFS etc) on the cluster. I covered this as part of an article on configuring OBIEE11g to connect to Kerberos-secured Hadoop clusters last Christmas and you can either do it as part of the BDA install, or later on using a wizard in more recent versions of CDH5, the Cloudera Hadoop distribution that the BDA uses.


The complication with Kerberos authentication is that your organization needs to have a Kerberos KDC (Key Distribution Center) server setup already, which will then link to your corporate LDAP or Active Directory service to check user credentials when they request a Kerberos ticket. The BDA installation routine gives you the option of creating a KDC as part of the BDA setup, but that’s only really useful for securing inter-cluster connections between services as it won’t be checking back to your corporate directory. Ideally you’d set up a connection to an existing, well-tested and well-understood Kerberos KDC server and secure things that way – but beware that not all Oracle and other tools that run on the BDA are setup for Kerberos authentication – OBIEE and ODI are, for example, but the current 1.0 version of Big Data Discovery doesn’t yet support Kerberos-secured clusters.

Coming-up with the HDFS Directory Layout

It’s tempting with Hadoop to just have a free-for-all with the Hadoop HDFS filesystem setup, maybe restricting users to their own home directory but otherwise letting them put files anywhere. HDFS file data for Hive tables typically goes in Hive’s own filesystem area /user/hive/warehouse, but users can of course create Hive tables over external data files stored in their own part of the filesystem.

What we tend to do (inspired by Gwen Shapira’a “Scaling ETL with Hadoop” presentation) is create separate areas for incoming data, ETL processing data and process output data, with developers then told to put shared datasets in these directories rather than their own. I generally create additional Linux users for each of these directories so that these can own the HDFS files and directories rather than individual users, and then I can control access to these directories using HDFS’s POSIX permissions. A typical user setup script might look like this:

[oracle@bigdatalite ~]$ cat 
sudo groupadd bigdatarm
sudo groupadd rm_website_analysis_grp
useradd mrittman -g bigdatarm
useradd ryeardley -g bigdatarm
useradd mpatel -g bigdatarm
useradd bsteingrimsson -g bigdatarm
useradd spoitnis -g bigdatarm
useradd rm_website_analysis -g rm_website_analysis_grp
echo mrittman:welcome1 | chpasswd
echo ryeardley:welcome1 | chpasswd
echo mpatel:welcome1 | chpasswd
echo bsteingrimsson:welcome1 | chpasswd
echo spoitnis:welcome1 | chpasswd
echo rm_website_analysis:welcome1 | chpasswd

whilst a script to setup the directories for these users, and the application user, might look like this:

[oracle@bigdatalite ~]$ cat 
set echo on
#setup individual user HDFS directories, and scratchpad areas
sudo -u hdfs hadoop fs -mkdir /user/mrittman
sudo -u hdfs hadoop fs -mkdir /user/mrittman/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/ryeardley
sudo -u hdfs hadoop fs -mkdir /user/ryeardley/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/mpatel
sudo -u hdfs hadoop fs -mkdir /user/mpatel/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/bsteingrimsson
sudo -u hdfs hadoop fs -mkdir /user/bsteingrimsson/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/spoitnis
sudo -u hdfs hadoop fs -mkdir /user/spoitnis/scratchpad
#setup etl directories
sudo -u hdfs hadoop fs -mkdir -p /data/rm_website_analysis/logfiles/incoming
sudo -u hdfs hadoop fs -mkdir /data/rm_website_analysis/logfiles/archive/
sudo -u hdfs hadoop fs -mkdir -p /data/rm_website_analysis/tweets/incoming
sudo -u hdfs hadoop fs -mkdir /data/rm_website_analysis/tweets/archive
#change ownership of user directories
sudo -u hdfs hadoop fs -chown -R mrittman /user/mrittman
sudo -u hdfs hadoop fs -chown -R ryeardley /user/ryeardley
sudo -u hdfs hadoop fs -chown -R mpatel /user/mpatel
sudo -u hdfs hadoop fs -chown -R bsteingrimsson /user/bsteingrimsson
sudo -u hdfs hadoop fs -chown -R spoitnis /user/spoitnis
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/mrittman
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/ryeardley
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/mpatel
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/bsteingrimsson
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/spoitnis
#change ownership of shared directories
sudo -u hdfs hadoop fs -chown -R rm_website_analysis /data/rm_website_analysis
sudo -u hdfs hadoop fs -chgrp -R rm_website_analysis_grp /data/rm_website_analysis

Giving you a directory structure like this (with the directories for Hive, Impala, HBase etc removed for clarity)


In terms of Hive and Impala data, there’s varying opinions on whether to create tables as EXTERNAL and store the data (including sub-directories for table partitions) in the /data/ HDFS area or let Hive store them in its own /user/hive/warehouse area – I tend to let Hive store them within its area as I use Apache Sentry to then control access to those Tables’s data.

Setting up Access Control for HDFS, Hive and Impala Data

At its simplest level, access control can be setup on the HDFS directory structure by using HDFS’s POSIX security model:

  • Each HDFS file or directory has an owner, and a group
  • You can add individual Linux users to a group, but an HDFS object can only have one group owning it

What this means in-practice though is you have to jump through quite a few hoops to set up finer-grained access control to these HDFS objects. What we tend to do is set RW access to the /data/ directory and subdirectories to the application user account (rm_website_analysis in this case), and RO access to that user’s associated group (rm_website_analysis_grp). If users then want access to that application’s data we add them to the relevant application group, and a user can belong to more than one group, making it possible to grant access to more than one application data area

[oracle@bigdatalite ~]$ cat ./ 
sudo -u hdfs hadoop fs -chmod -R 750 /data/rm_website_analysis
usermod -G rm_website_analysis_grp mrittman

making it possible for the main application owner to write data to the directory, but group members only have read access. What you can also now do with more recent versions of Hadoop (CDH5.3 onwards, for example) is define access control lists to go with individual HDFS objects, but this feature isn’t enabled by default as it consumes more namenode memory than the traditional POSIX approach. What I prefer to do though is control access by restricting users to only accessing Hive and Impala tables, and using Apache Sentry, or Oracle Big Data SQL, to provide role-based access control over them.

Apache Sentry is a project originally started by Cloudera and then adopted by the Apache Foundation as an incubating project. It aims to provide four main authorisation features over Hive, Impala (and more recently, the underlying HDFS directories and datafiles):

  • Secure authorisation, with LDAP integration and Kerberos prerequisites for Sentry enablement
  • Fine-grained authorisation down to the column-level, with this feature provided by granting access to views containing subsets of columns at this point
  • Role-based authorisation, with different Sentry roles having different permissions on individual Hive and Impala tables
  • Multi-tenant administration, with a central point of administration for Sentry permissions

From this Cloudera presentation on Sentry on Slideshare, Sentry inserts itself into the query execution process and checks access rights before allowing the rest of the Hive query to execute. Sentry is configured through security policy files, or through a new web-based interface introduced with recent versions of CDH5, for example.


The other option for customers using Oracle Exadata,Oracle Big Data Appliance and Oracle Big Data SQL is to use the Oracle Database’s access control mechanisms to govern access to Hive (and Oracle) data, and also set-up fine-grained access control (VPD), data masking and redaction to create a more “enterprise” access control system.


So these are typically tasks we perform when on-boarding an Oracle BDA for a customer. If this is of interest to you and you can make it to either Brighton, UK next week or Atlanta, GA the week after, I’ll be covering this topic at the Rittman Mead BI Forum 2015 as part of the one-day masterclass with Jordan Meyer on the Wednesday of each week, along with topics such as creating ETL data flows using Oracle Data Integrator for Big Data, using Oracle Big Data Discovery for faceted search and cataloging of the data reservoir, and reporting on Hadoop and NoSQL data using Oracle Business Intelligence 11g. Spaces are still available so register now if you’d like to hear more on this topic.

Categories: BI & Warehousing

Last Chance to Register for the Brighton Rittman Mead BI Forum 2015!

Rittman Mead Consulting - Tue, 2015-04-28 05:32

It’s just a week to go until the start of the Brighton Rittman Mead BI Forum 2015, with the optional one-day masterclass starting on Wednesday, May 6th at 10am and the event opening with a reception and Oracle keynote later in the evening. Spaces are still available if you want to book now, but we can’t guarantee places past this Friday so register now if you’re planning to attend.


As a reminder, here’s some earlier blog posts and articles about events going on at the Brighton event, and at the Atlanta event the week after:

We’re also running our first “Data Visualisation Challenge” at both events, where we’re asking attendees to create their most impressive and innovative data visualisation within OBIEE using the Donors Choose dataset, with the rule being that you can use any OBIEE or related technology as long as the visualisation runs with OBIEE and can respond to dashboard prompt controls. We’re also opening it up to OBIEE running as part of Oracle BI Cloud Service (BICS), so if you want to give Visual Analyser a spin within BICS we’d be interested in seeing the results.

Registration is still open for the Atlanta BI Forum event too, running the week after Brighton on the 13th-15th May 2015 at the Renaissance Atlanta Midtown hotel. Full details of both events are on the event homepage, with the registration links for Brighton and Atlanta given below.

  • Rittman Mead BI Forum 2015, Brighton –  May 6th – 8th 2015 
We look forward to seeing you all in Brighton next week, or Atlanta the week after – but remember to book soon, before we close registration!
Categories: BI & Warehousing

Using the ELK Stack to Analyse Donor’s Choose Data

Rittman Mead Consulting - Sat, 2015-04-25 13:35

Donor’s Choose is an online charity in America through which teachers can post details of projects that need funding and donors can give money towards them. The data from the charity since it began in 2000 is available to download freely here in several CSV datasets. In this article I’m going to show how to use the ELK stack of data discovery tools from Elastic to easily import some data (the donations dataset) and quickly start analysing it to produce results such as this one:

I’m assuming you’ve downloaded and unzipped Elasticsearch, Logstash and Kibana and made Java available if not already. I did this on a Mac, but the tools are cross-platform and should work just the same on Windows and Linux. I’d also recommend installing Kopf, which is an excellent plugin for the management of Elasticsearch.

CSV Data Ingest with Logstash

First off we’re going to get the data in to Elasticsearch using Logstash, after which we can do some analysis using Kibana.

To import the data with Logstash requires a configuration file which in this case is pretty straightforward. We’ll use the file input plugin, process it with the csv filter, set the date of the event to the donation timestamp (rather than now), cast a few fields to numeric, and then output it using the elasticsearch plugin. See inline comments for explanation of each step:

input {  
    file {  
        # This is necessary to ensure that the file is  
        # processed in full. Without it logstash will default  
        # to only processing new entries to the file (as would  
        # be seen with a logfile for a live application, but  
        # not static data like we're working with here)  
        start_position  => beginning  
        # This is the full path to the file to process.  
        # Wildcards are valid.  
        path =>  ["/hdd/ELK/data/opendata/opendata_donations.csv"]  

filter {  
        # Process the input using the csv filter.  
        # The list of column names I took manually from the  
        # file itself  
        csv {separator => ","  
                columns => ["_donationid","_projectid","_donor_acctid","_cartid","donor_city","donor_state","donor_zip","is_teacher_acct","donation_timestamp","donation_to_project","donation_optional_support","donation_total","dollar_amount","donation_included_optional_support","payment_method","payment_included_acct_credit","payment_included_campaign_gift_card","payment_included_web_purchased_gift_card","payment_was_promo_matched","via_giving_page","for_honoree","donation_message"]}

        # Store the date of the donation (rather than now) as the  
        # event's timestamp  
        # Note that the data in the file uses formats both with and  
        # without the milliseconds, so both formats are supplied  
        # here.  
        # Additional formats can be specified using the Joda syntax  
        # (  
        date { match => ["donation_timestamp", "yyyy-MM-dd HH:mm:ss.SSS", "yyyy-MM-dd HH:mm:ss"]}  
        # ------------
        # Cast the numeric fields to float (not mandatory but makes for additional analysis potential)
        mutate {
        convert => ["donation_optional_support","float"]
        convert => ["donation_to_project","float"]
        convert => ["donation_total","float"]

output {  
        # Now send it to Elasticsearch which here is running  
        # on the same machine.  
        elasticsearch { host => "localhost" index => "opendata" index_type => "donations"}  

With the configuration file created, we can now run the import:

./logstash-1.5.0.rc2/bin/logstash agent -f ./logstash-opendata-donations.conf

This will take a few minutes, during which your machine CPU will rocket as logstash processes all the records. Since logstash was originally designed for ingesting logfiles as they’re created it doesn’t actually exit after finishing processing the file, but you’ll notice your machine’s CPU return to normal, at which point you can hit Ctrl-C to kill logstash.

If you’ve installed Kopf then you can see at a glance how much data has been loaded:

Or alternatively query the index using Elasticsearch’s API directly:

curl -XGET 'http://localhost:9200/opendata/_status?pretty=true'

    "opendata" : {  
      "index" : {  
        "primary_size_in_bytes" : 3679712363,  
      "docs" : {  
        "num_docs" : 2608803,

Note that Elasticsearch will take more space than the source data (in total the 1.2Gb dataset ends up taking c.5Gb)

Data Exploration with Kibana

Now we can go to Kibana and start to analyse the data. From the Settings page of Kibana add the opendata index that we’ve just created:

Go to Discover and if necessary click the cog icon in the top right to set the index to opendata. The time filter defaults to the last 15 minutes only, and if your logstash has done its job right the events should have the timestamp of the actual donation, so you need to click on the time filter in the very top right of the screen to change time period to, for example, Previous year. Now you should see a bunch of data:

Click the toggle on one of the events to see the full data for it, including things like the donation amount, the message with the donation, and geographical details of the donor. You can find details of all the fields on the Donor’s Choose website here.

Click on the fields on the left to see a summary of the data within, showing very easily that within that time frame and sample of 500 records:

  • two thirds of donations were in the 10-100 dollar range
  • four-fifths included the optional donation towards the running costs of Donor’s Choose.

You can add fields into the table itself (which by default just shows the complete row of data) by clicking on add for the fields you want:

Let’s save this view (known as a “Search”), since it can be used on a Dashboard later:

Data Visualisation with Kibana

One of my favourite features of Kibana is its ability to aggregate data at various dimensions and grains with ridiculous ease. Here’s an example: (click to open full size)

Now let’s amend that chart to show the method of donation, or the donation amount range, or both: (click to open full size)

You can also change the aggregation from the default “Count” (in this case, number of donations) to other aggregations including sum, median, min, max, etc. Here we can compare cheque (check) vs paypal as a payment method in terms of amount given:

Kibana Dashboards

Now let’s bring the visualisations together along with the data table we saw in the the Discover tab. Click on Dashboard, and then the + icon:

Select the visualisations that you’ve created, and then switch to the Searches tab and add in the one that you saved earlier. You’ve now got a data table showing all currently selected data, along with various summaries on it.

You can rearrange the dashboard by dragging each box around to suit. Once you’ve got the elements of the dashboard in place you can start to drill into your data further. To zoom in on a time period click and drag a selection over it, and to filter on a particular data item (for example, state in the “Top ten states” visualisation) click on it and accept the prompt at the top of the screen. You can also use the freetext search at the top of the screen (this is valid on the Discover and Visualize pages too) to search across the dataset, or within a given field.

Example Analysis

Let’s look at some actual data analyses now. One of the most simple is the amount given in donations over time, split by amount given to project and also as the optional support amount:

One of the nice things about Kibana is the ability to quickly change resolution in a graph’s time frame. By default a bar chart will use an “Auto” granularity on the time axis, updating as you zoom in and out so that you always see an appropriate level of aggregation. This can be overridden to show, for example, year-on-year changes:

You can also easily switch the layout of the chart, for example to show the percentage of the two aggregations relative to each other. So whilst the above chart shows the optional support amount increasing by the year, it’s actually remaining pretty much the same when taken as a percentage of the donations overall – which if you look into the definition of the field (“we encourage donors to dedicate 15% of each donation to support the work that we do.“) makes a lot of sense

Analysis based on text in the data is easy. You can use the Terms sub-aggregation, where here we can see the top five states in terms of donation amount, California consistently being the top of the table.

Since the Terms sub-aggregation shows the Top-x only, you can’t necessarily judge the importance of those values in relation to the rest of the data. To do this more specific analysis you can use the Filters sub-aggregation to use free-form searches to create buckets, such as here to look at how much those from NY and CA donated, vs all other states. The syntax is field:value to include it, and -field:value to negate it. You can string these expressions together using AND and OR.

A lot of the analysis generally sits well in the bar chart visualisation, but the line chart has a role to play too. Donations are grouped according to the value range (<10, between 10 and 100, > 100), and these plot out nicely when considering the number of donations made (rather than total value). Whilst the total donation in a time period is significant, so is the engagement with the donors hence the number of donations made is important to analyse:

As well as splitting lines and bars, you can split charts themselves, which works well when you want to start comparing multiple dimensions without cluttering up a single chart. Here’s the same chart as previously but split out with one line per instance. Arguably it’s clearer to understand, and the relative values of the three items can be better seen here than in the clutter of the previous chart:

Following on from this previous graph, I’m interested in the spike in mid-value ($10-$100) donations at the end of 2011. Let’s pull the graph onto a dashboard and dig into it a bit. I’ve saved the visualisation and brought it in with the saved Search (from the Discover page earlier) and an additional visualisation showing payment methods for the donations:

Now I can click and drag the time frame to isolate the data of interest and we see that the number of donations jumps eight-fold at this point:

Clicking on one of the data points drills into it, and we eventually see that the spike was attributable to the use of campaign gift cards, presumably issued with a value > $10 and < $100.



The simplicity described in this article comes at a cost, or rather, has its limits. You may well notice fields in the input data such as “_projectid”, and if you wanted to relate a donation to a given project you’d need to go and look that project code up manually. There’s no (easy) way of doing this in Elasticsearch – whilst you can easily bring in all the project data too and search on projectid, you can’t display the two (project and donation) alongside each other (easily). That’s because Elasticsearch is a document store, not a relational database. There are some options discussed on the Elasticsearch blog for handling this, none of which to my mind are applicable to this kind of data discovery (but Elasticsearch is used in a variety of applications, not just as a data store for Kibana, so in others cases it is more relevant). Given that, and if you wanted to resolve this relationship, you’d have to go about it a different way, maybe using the linux join command to pre-process the files and denormalise them prior to ingest with logstash. At this point you reach the “right tool/right job” decision – ELK is great, but not for everything :-)


If you need to reload the data (for example, when building this I reprocessed the file in order to define the numerics as such, rather than the default string), you need to :

  • Drop the Elasticsearch data:
    curl -XDELETE 'http://localhost:9200/opendata'
  • Remove the “sincedb” file that logstash uses to record where it last read from in a file (useful for tailing changing input files; not so for us with a static input file)
    rm ~/.sincedb*

    (better here would be to define a bespoke sincedb path in the file input parameters so we could delete a specific sincedb file without impacting other logstash processing that may be using sincedb in the same path)
  • Rerun the logstash as above


Categories: BI & Warehousing

RFM Analysis in Oracle BI Apps

Dylan's BI Notes - Fri, 2015-04-24 19:16
I wrote the article RFM Analysis earlier.  We recently posted a more detailed description about how Oracle BI Apps implements this concept in the product. Customer RFM Analysis RFA related customer attributes are good examples of aggregated performance metrics as described in this design tip from the Kimball Group Design Tip #53: Dimension Embellishments By putting this […]
Categories: BI & Warehousing

BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

Rittman Mead Consulting - Fri, 2015-04-24 06:18

I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).

OBIEE Regression Testing

In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.

In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.

If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.

Data Discovery with the ELK Stack

My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:

The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:

By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.

The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!

Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.

Categories: BI & Warehousing

Is MERGE a bug?

Chet Justice - Wed, 2015-04-22 20:57
A few years back I pondered whether DISTINCT was a bug.

My premise was that if you are depending on DISTINCT to return a correct result set, something is seriously wrong with your table design. I was reminded of this again recently when I ran across Kent Graziano's post on Better Data Modeling: Are you making these 3 beginner mistakes in your data models?. Specifically:
Instead of that, you should be defining a natural, or business, key for every table in your system. A natural key is a an attribute or set of attributes (that occur naturally in the data set) required to uniquely identify a row in that table. In addition you should define a Unique Key Constraint on those attributes in the database. Then you can be sure you will not get any duplicate data into the tables.

CLARIFICATION: This point has caused a lot of questions and comments. To be clear, the mistake here is to have ONLY defined a surrogate key. i believe that even if using surrogate keys is the best solution for your design, you should ALSO define an alternate unique natural key. So why MERGE?

I learned about the MERGE statement in 2008. During an interview, Frank Davis asked me about when I would use it. I didn't even know what it was (and admitted that) but I went home that night and...wait...I think he asked me about multi table inserts. Whatever, credit is still going to Mr. Davis. Where was I? OK, so I had been working with Oracle for about 6 years at that point and I didn't know about it. My initial reaction was to use it everywhere (not really)! You know, shiny object and all. Look! Squirrel!

Why am I considering MERGE a bug? Let me be more specific. I was working with a couple of tables and had not written the API for them yet and a developer was writing some PL/SQL to update the records from APEX. In his loop he had a MERGE. I realized at that moment there was 1, no surrogate key and 2, no natural key defined (which ties in with Kent's comments up above). Upon realizing the developer was doing this, I knew immediately what the problem was (besides not using a PL/SQL API to nicely encapsulate the business logic). The table was poorly designed.

Easy fix. Update the table with a surrogate key and define a natural key. I was thankful for the reminder, I hadn't added the unique constraint yet. Of course had I written the API already I probably would have noticed the design error, either way, a win for design.

Now, there are perfectly good occasions to use the MERGE statement. Most of those, to me anyway, relate to legacy systems where you don't have the ability to change the underlying table structures (or it's just cost prohibitive) or ETL, where you want to load/update a dimension table in your data warehouse.

Noons, how's that? First time out in 10 months. Thanks for the push.
Categories: BI & Warehousing

Conformed Dimension and Data Mining

Dylan's BI Notes - Mon, 2015-04-20 20:48
I believe that Conformed Dimensions are playing a key roles in data mining.  Here is why: A conformed dimension can bring the data together from different subject area, and sometime, from different source system. The relevant data can be thus brought together.  Data Mining is a technique to find the pattern from the historical data. […]
Categories: BI & Warehousing

Data Integration Tips: ODI 12.1.3 – Convert to Flow

Rittman Mead Consulting - Thu, 2015-04-16 13:23

The many who have already upgraded Oracle Data Integrator from the 11g version to 12c probably know about this great feature called “convert to flow”. If not, well…here you go!

First, a little background on why I think this is an excellent bit of functionality. The ODI Upgrade Assistant will convert objects from 11g to 12c and it does a pretty decent job of it. When converting Interfaces, the upgrade process creates a Mapping in ODI 12c by taking the logical “mapping” layout and loading it into a Dataset object. I assumed the reason was because it wasn’t easy to convert an Interface directly to a full on flow-based mapping, which you typically would develop in ODI 12.1.3 rather than using the limited Dataset (only joins, filters, and lookups allowed). After the upgrade, you would then be stuck with loads of mappings that are not using the latest flow-based features and components.


Now, in ODI 12.1.3, we have the ability to convert our Dataset into the standard ODI 12c flow based components within the Mapping. With a right-click on the Dataset component, we can see the “Convert to Flow” option.


Select Convert to Flow and accept the warning that our Mapping will be changed forever…and boom! No more Dataset!

This is great for my individual Mappings, but now I want to convert my migrated Reusable Mapping Datasets to flow based components.


Wait, what? No option to Convert to Flow! It looks like the Reusable Mappings (which were upgraded from my ODI 11g Temporary Interfaces) cannot be converted to flow for some reason. Hmm… Well, let’s finish converting my Datasets to flow based components for the rest of my 342 upgraded Mappings…one-by-one. Yikes! Actually, we can find a better way to do this. Time to get Groovy with the ODI SDK!

Using Groovy, I can create a simple script to loop through all of my mappings, find the dataset, and call the convertToFlow function on that dataset component. Here’s a look at the guts of the script.

for (mapping in mappingsList){
  for (component in componentsList){

    java.util.List convertIssues = new ArrayList()
    blnConvert = 1
    try {
      blnConvert = component.convertToFlow(convertIssues)
      if (blnConvert) {
        for (item in convertIssues) {
          out.println item.toString()
    } catch (Exception e) {
    out.println e;
    out.println mapping.getName() + " had a dataset converted to flow."

Just remember to pass the results list object as a parameter to the convertToFlow call (and make sure the List object is properly instantiated as an ArrayList – as I was humbly reminded by David Allan via Twitter!). Once completed, you should be able to open each mapping and see that the dataset has been removed and only flow-based components exist.

Excellent, now we’ve completed our conversion in no time at all. But wait, what about those Reusable Mappings? Remember, we don’t have the right-click menu option to convert to flow as we did with the standard Mapping. Well, let’s see what our friend the ODI SDK has to say about that!

With a slight tweak to the code, replacing Mapping classes with ReusableMapping classes, we can batch convert our Reusable Mapping dataset components to flow based components in an instant. The reason it works via the API is due to the inheritance of the ReuseableMapping class. It inherits the same component methods from the interface oracle.odi.domain.mapping.IMapComponentOwner, which in turn have the same methods and functions, such as convertToFlow, as we had available in the Mapping class. I’m not quite sure why ODI Studio doesn’t expose “Convert to Flow” in the Reusable Mappings, but I’m sure it’s a simple fix we’ll see in an ODI 12c release down the road.

So there you have it, another Data Integration Tip from Rittman Mead – this time, a little help post-migration from ODI 11g to ODI 12c. If you would like more details on how Rittman Mead can help your migration of Oracle Data Integrator to the latest version, send us a note at We’d love to help!


Categories: BI & Warehousing

Oracle Data Integrator Enterprise Edition Advanced Big Data Option Part 1- Overview and install

Rittman Mead Consulting - Mon, 2015-04-13 14:54

Oracle recently announced Oracle Data Integrator Enterprise Edition Advanced Big Data Options as part of the new release of ODI. It includes various great new functionalities to work on an Hadoop ecosystem. Let’s have a look at the new features and how to install it on Big Data Lite 4.1 Virtual Machine.

Note that some of these new features, for example Pig and Spark support and use of Oozie, requires the new ODI EE Advanced Big Data Option license on-top of base ODI EE.

Pig and Spark support

So far ODI12c allowed us to use Hive for any Hadoop-based transformation. With this new release, we can now use Pig and Spark as well. Depending on the use case, we can choose which technology will give better performance and switch from one to another with very few changes. That’s the beauty of ODI – all you need is to do is create the logical dataflow in your mapping and choose your technology. There is no need to be a Pig Latin expert or a PySpark ninja, all of this will be generated for you! These two technologies are now available in the Topology, along with the Hadoop Data Server to define where lies the Data. You can also see some Loading Knowledge Modules for Pig and Spark.

Pig and Spark in ODI

Pig, as Mark wrote before, is a dataflow language. It makes it really appropriate with the new “flow paradigm” introduced in ODI 12c. The idea is to write a data pipeline in Pig Latin. That code will undercover create MapReduce jobs that will be executed.

Quoting Mark one more time, Spark is a cluster processing framework that can be used in different programming languages, the two most common being Python and Scala. It allows to do operation like filters, joins and aggregates. All of this can be done in-memory which can provides way better performance over MapReduce. The ODI team choose to use Python as a programming language for Spark so the Knowledge Modules will use PySpark.

New Hive Driver and LKMs

This release also brings significant improvements to the existing Hive technology. A new driver as been introduced under the name DataDirect Apache Hive JDBC Driver. It is actually the Weblogic Hive JDBC driver which aims at improving the performance and the stability.

New Hive Driver

New Knowledges Modules are introduced to benefit from this new driver and they are LKMs instead multi-connections IKMs as it use to be. Thanks to that, it can be combined with other LKMs into the same mapping which was not the case before.

Oozie Agent

Oozie is another Apache project and they define it as “a workflow scheduler system to manage Apache Hadoop jobs”. We can create workflow of different jobs in the Hadoop stack, and then schedule it at a certain time or trigger it when data becomes available.

What Oozie does is similar to the role of the ODI agent, and it’s now possible to use directly an existing Oozie engine instead of deploying a standalone agent on the hadoop cluster.

Oozie Engine

The Oozie engine will do what your ODI agent usually does – execution, scheduling, monitoring – but it is integrated in the Hadoop ecosystem. So we will be able to schedule and monitor our ODI jobs at the same place as all our other Hadoop jobs that we use outside of ODI. Oozie can also automatically retrieve the Hadoop logs. Also we lower the footprint because it doesn’t requires to install an ODI-specific component on the cluster. However, according to the white paper (link below), it looks like Load Plans are not supported. So the idea would be to execute the Load Plans with a standalone or JEE agent that will delegate the execution of Big Data-related scenarios to the Oozie Engine.

HDFS support in file-related ODI Tools

Most of the ODI tools handling files can also do it on HDFS now. So you can delete, move, copy files and folders. You can also append files and transfer it to HDFS via FTP. It’s even possible to detect when a file is created on HDFS. All you need to do is to indicate your Hadoop Logical Schema for source, target or both. In the following example I’m copying a file from the Unix filesystem to HDFS.


I think this is a huge step forward. If we want to use ODI 12c for our Hadoop data integration, it must be able to do everything end-to-end. The maintenance or administrative tasks such as archiving, deleting or copying should also be done using ODI. So far it was a bit tedious to created a shell script using hdfs dfs commands and then launch it using OdiOsCommand tool. Now we can directly use the file tools in a package or a procedure!

New mapping components : Jagged and Flatten

The two new components can be used in a Big Data context but also in your traditional data integration. The first one, Jagged, will pivot a set of key-value pairs into a columns with their values.

The Flatten components can be used with advanced files when you have nested attributes, like in JSON. Using a flatten component will generate more rows if needed to extract different values for a same attribute nested into another attribute.


You can see the detail of all the new features in the white paper “Advancing Big Data Integration” for ODI 12c.


How to install it?

This patch must be applied on top of an existing Oracle Data Integrator installation. It is not a bundled patch and it’s only related to Big Data Options so there is no point to install it if you don’t need its functionalities. Also make sure you are licensed for ODIEE Advanced Big Data Option if you plan to use Spark or Pig technology/KMs or execute your jobs using the Oozie engine.

To showcase this, I used the excellent –and free! – Big Data Lite 4.1 VM which already has ODI 12.1.3 and all the Hadoop components we need. So this example will be on an Oracle Enterprise Linux environment.

The first step is to download it from the OTN or My Oracle Support. Also make sure you close ODI Studio and shut down the agents. Then the README recommends to update OPatch and check the OUI. So let’s do that and also set some environment variables and unzip the ODI patch.

[oracle@bigdatalite ~]$ mkdir /home/oracle/bck
[oracle@bigdatalite ~]$ ORACLE_HOME=/u01/ODI12c/
[oracle@bigdatalite ~]$ cd $ORACLE_HOME
[oracle@bigdatalite ODI12c]$ unzip /home/oracle/Desktop/ -d $ORACLE_HOME 
[oracle@bigdatalite ODI12c]$ OPatch/opatch lsinventory -jre /usr/java/latest/
[oracle@bigdatalite ODI12c]$ export PATH=$PATH:/u01/ODI12c/OPatch/
[oracle@bigdatalite ODI12c]$ unzip -d /home/oracle/bck/ /home/oracle/Desktop/ 
[oracle@bigdatalite ODI12c]$ cd /home/oracle/bck/

This patch is actually composed of three piece. One of them, the second one, is only needed if you have an enterprise installation. If you have a standalone install, you can just skip it. Note that I always specify the JRE to be used by OPatch to be sure everything works fine.

[oracle@bigdatalite bck]$ unzip
[oracle@bigdatalite ODI12c]$ cd 20042369/
[oracle@bigdatalite 20042369]$ opatch apply -jre /usr/java/latest/
[oracle@bigdatalite 20042369]$ cd /home/oracle/bck/

 //[oracle@bigdatalite bck]$ unzip
 //[oracle@bigdatalite bck]$ cd 20674616/
 //[oracle@bigdatalite 20674616]$ opatch apply -jre /usr/java/latest/
 //[oracle@bigdatalite 20674616]$ cd /home/oracle/bck/

[oracle@bigdatalite bck]$ unzip 
[oracle@bigdatalite bck]$ cd 20562777/
[oracle@bigdatalite 20562777]$ opatch apply -jre /usr/java/latest/

Now we need to run the upgrade assistant that will execute some scripts to upgrade our repositories. But in Big Data Lite, the tables of the repository have been compressed, so we first need to uncompress them and rebuild the invalid indexes as David Allan pointed it out on twitter. Here are the SQL queries that will create the DDL statement you need to run if you are also using Big Data Lite VM :

 'alter table '||t.owner||'.'||t.table_name||' move nocompress;' q
 from all_tables t
 where owner = 'DEV_ODI_REPO'
 and table_name &lt;&gt; 'SNP_DATA';

select 'alter index '||owner||'.'||index_name||' rebuild tablespace '||tablespace_name ||';'
 from all_indexes
 where owner = 'DEV_ODI_REPO'
 and status = 'UNUSABLE';

Once it’s done we can start the upgrade assistant :

[oracle@bigdatalite 20562777]$ cd /u01/ODI12c/oracle_common/upgrade/bin
[oracle@bigdatalite bin]$ ./ua

Upgrade Assistant

The steps are quite straightforward so I’ll leave it to you. Here I selected Schemas, but if you have a standalone agent you will have to run it again and select “Standalone System Component Configurations” to upgrade the domain as well.

Before opening ODI Studio we will clear the JDev cache so we are sure everything looks nice.

[oracle@bigdatalite bin]$ rm -rf /home/oracle/.odi/system12.

We can now open ODI Studio. Don’t worry the version mentioned there and in the upgrade assistant is still but if you can see the new features it has been installed properly.

The last step is to go in the topology and change the driver used for all the Hive Data Server. As all the new LKMs use the new weblogic driver, we need to define the url instead of the existing one.  We simply select “DataDirect Apache Hive JDBC Driver” instead of the existing Apache driver.

And that’s it, we can now enjoy all the new Big Data features in ODI 12c! A big thanks to David Allan and Denis Gray for their technical and licensing help. Stay tuned as I will soon publish a second blog post detailing some features.

Categories: BI & Warehousing