Skip navigation.

BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.4 : Transforming Data using Python & Hive Streaming

Rittman Mead Consulting - Thu, 2014-06-12 03:30

This week I’m taking an in-depth look at ETL on the Oracle Big Data Appliance, using Oracle Data Integrator 12c to call the various bits of Hadoop functionality and orchestrate the whole process. So far, I’ve landed web log data into the Hadoop cluster using Flume, created a Hive table over the log data using Hive and the RegEx SerDe, then used further Hive transformations to join this data to other reference data, some of which came from an Oracle database via Sqoop. Here’s a complete listing of the posts so far, and the ones to come:

Our next transformation is a bit trickier though. The data we’re bringing in from our webserver has IP addresses recorded for each HTTP access, and what we want to do is use that IP address to identify the country that the website visitor was located in. Geocoding, as it’s called, is fairly easy to do if your data is stored in a database such as Oracle Database or you’re using a language like Python, with two steps needed to retrieve the IP address’s country:

1. First convert the IP address into a single integer, like this:

address = '174.36.207.186'
( o1, o2, o3, o4 ) = address.split('.')
integer_ip =   ( 16777216 * o1 )
             + (    65536 * o2 )
             + (      256 * o3 ) 
           +              o4

2. Then, using a freely-downloadable database from MaxMind, look to see which range of numbers your converted IP address is in:

SELECT ip_country
FROM geoip
WHERE 2921648058 BETWEEN begin_ip_num AND end_ip_num
LIMIT 1

The only problem is that Hive doesn’t support anything other than equi-joins. In a previous blog post I got around this issue by using Impala rather than Hive, and in another one I used Pig, and a custom Python UDF, to do the geocoding outside of the main Pig data flow. I could use Pig and a Python UDF in this situation, creating an ODI Procedure to call Pig from the command-line, but instead I’ll use another ODI KM, IKM Hive Transform, to call a Python script but entirely controlled from within ODI.

To set this geocoding process up, I first create a new ODI mapping, with my current Hive table as the source, and another Hive table this time with an additional column for the IP address country, as the target. I then map what columns I can into the target, leaving just the country column without a mapping. Notice I’ve put the source table within an ODI12c dataset – this isn’t mandatory, it’s just something I decided to do at the time – it doesn’t impact on the rest of the process.

NewImage

Now in a similar way to how I created the Python UDF to do the country lookup for Pig, I’ll create a variation of that script that expects columns of data as an input, outputs other columns of data, and does the geocoding within the script by calling the Python GeoIP API provided by MaxMind. Most importantly though, I’ll use it in conjunction with IKM Hive Transform’s ability to stream it’s Hive data through an arbitrary script, making use of the Hive TRANSFORM function and a feature called “hive streaming”. To keep things simple let’s first set it up outside of the KM, see how it works, and then configure the KM to use this python script as its transformation function.

Let’s start by looking at the source Hive table I’ll be starting with:

hive> describe access_per_post_categories;
OK
hostname           string             None                
request_date       string             None                
post_id            string             None                
title              string             None                
author             string             None                
category           string             None                
Time taken: 0.112 seconds, Fetched: 6 row(s)
hive> select * from access_per_post_categories limit 5;
OK
137.254.4.12[02/Jun/2014:21:20:41 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
137.254.4.12[02/Jun/2014:21:20:42 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
137.254.4.12[02/Jun/2014:21:20:43 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
137.254.4.12[02/Jun/2014:21:20:45 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
137.254.4.12[02/Jun/2014:21:21:23 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
Time taken: 0.976 seconds, Fetched: 5 row(s)

The target table is exactly the same, except for the additional column for country:

hive> describe access_per_post_full;
OK
hostname           string             None                
request_date       string             None                
post_id            string             None                
title              string             None                
author             string             None                
country            string             None                
category           string             None                
Time taken: 0.098 seconds, Fetched: 7 row(s)

The way IKM Hive Transform works is by streaming each incoming row of Hive data to the script registered with it, with each row’s columns tab-separated. In-turn, it expects the script to output similar rows of tab-separated values, each line terminated with a newline. My script, therefore, needs to accept an arbitrary number of incoming rows, parse each row into its constituent columns, call the MaxMind API to do the geocoding, and then output a tab-separated set of columns for each row it processes (a bit like pipelined table functions in Oracle PL/SQL).

So here’s the Python script that does this:

[root@bdanode1 tmp]# cat add_countries.py
#!/usr/bin/python
import sys
sys.path.append('/usr/lib/python2.6/site-packages/')
import pygeoip
gi = pygeoip.GeoIP('/tmp/GeoIP.dat')
for line in sys.stdin:
  line = line.rstrip()
  hostname,request_date,post_id,title,author,category = line.split('\t')
  country = gi.country_name_by_addr(hostname)
  print hostname+'\t'+request_date+'\t'+post_id+'\t'+title+'\t'+author+'\t'+country+'\t'+category

A few things to note about the script:

1. “Import sys” brings in the Python libraries that then allows me to read the input data, via sys.stdin.
2. I’ve previously installed the MaxMind GeoIP API, database and libraries, with the GeoIP.dat database stored alongside the script in /tmp. Import pygeoip brings in the Python library to use the API.
3. The bit that does the geocoding is “gi.country_name_by_addr”, which calls the MaxMind API. It’s this bit that replaces the need for a join with a BETWEEN clause in HiveQL.
4. To output data back to the calling Hive transformation, I just “print” it, using “\t” for tab.

One thing we’ll need to do though, similar to how we had to copy the JAR file used by Hive for the RegEx SerDe around all the nodes in the cluster, Is copy the GeoIP.dat file used by the MaxMind geocoding API to the same place on all the cluster nodes. We’ll also have to set the permissions on this file so it’s readable and writeable by the Hive/YARN process on each Hadoop node:

officeimac:~ markrittman$ ssh root@bdanode2.rittmandev.com
The authenticity of host 'bdanode2.rittmandev.com (192.168.2.231)' can't be established.
RSA key fingerprint is 1b:e7:ec:01:57:0c:09:4e:6d:19:08:a4:0d:df:00:e0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'bdanode2.rittmandev.com' (RSA) to the list of known hosts.
root@bdanode2.rittmandev.com's password: 
Last login: Fri Jun  6 16:55:44 2014 from 192.168.2.201
[root@bdanode2 ~]# ls /tmp/Geo* -l
-rwxrwxrwx 1 oracle oracle 687502 Jun  7 12:31 /tmp/GeoIP.dat

I also need to make sure the MaxMind geocoding Python API is installed into each node too:

wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python get-pip.py pip 
install pygeoip

Now I can run a HiveQL command and test out the script. I’ll start up a Hive shell session, and the first thing I need to do is register the script with Hive, distributing across the cluster and putting it in cache. Then I can select against the source table, using the TRANSFORM USING clause:

[oracle@bdanode1 ~]$ hive
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/06/09 20:48:01 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
 
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.12.0-cdh5.0.1.jar!/hive-log4j.properties
hive> add file x-marsedit-filelocal:///tmp/add_countries.py;                                
Added resource: x-marsedit-filelocal:///tmp/add_countries.py
hive> select transform (hostname,request_date,post_id,title,author,category)
    > using 'add_countries.py'                                              
    > as (hostname,request_date,post_id,title,author,category,country)      
    > from access_per_post_categories; 
...
63.73.199.69[02/Jun/2014:21:19:28 +0000] 14726 SmartView as the Replacement for BI Office with OBIEE 11.1.1.7 Mark Rittman United States Oracle EPM
77.125.81.239[02/Jun/2014:21:18:58 +0000] 14476 Upgrading OBIEE to 11.1.1.7 Robin Moffatt Israel Oracle BI Suite EE
77.125.81.239[02/Jun/2014:21:18:59 +0000] 14476 Upgrading OBIEE to 11.1.1.7 Robin Moffatt Israel Oracle BI Suite EE
Time taken: 27.645 seconds, Fetched: 176 row(s)

So it looks like it’s working. Let’s move back to ODI now and reference the script within the IKM Hive Transform KM settings. Note that I enter the name and filesystem location of the script in the KM settings but I don’t use the TRANSFORM_SCRIPT option to directly key the script contents into the KM; this should work, and if you do so the KM will write the script contents out to the agent’s host filesystem at the start of the process, but the KM in this ODI12c release has an issue parsing these keyed-in scripts, so you’re best leaving this setting empty and the KM will just use the file already in the filesystem.

NewImage

After running the mapping, I then go over to the Operator navigator and see that it’s run successfully. Looking at the code the KM generates, I can see it referencing our script in the HiveQL statement, with the script outputting the extra country column that’s then loaded into the target table along with the other columns.

NewImage

Then finally, checking the target table, I can see that each row in my final Hive table has the country name alongside each log file entry, along with the IP address and other details.

So – now we’re at the stage where we’ve finished the processing with the Hadoop cluster, and I’d like to copy the final set of data into an Oracle database, so that I can analyse it using Oracle SQL and other tools. In the final installment in this series we’ll do just that, using IKM File/Hive to Oracle (OLH/ODCH) and Oracle Loader for Hadoop, one of the Oracle Big Data Connectors.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.3 : Enhance with Oracle Reference Data via Sqoop, and CKMs

Rittman Mead Consulting - Wed, 2014-06-11 04:02

In the first two posts in this series, I used the software on the Oracle Big Data Appliance 3.0 to ingest web log data from the Rittman Mead blog server, parse and load that data into a Hive table, and then join that table to another to add details on post and author. Links to the post in this series are below, and I’ll complete them as the series is posted this week:

In this next step, I’ve got some data sitting in an Oracle database that I’d like to use to enhance the data in the Hadoop cluster; as a reminder, the diagram below shows how I’m looking to move data through the system, and we’re currently at step number three:

NewImage

The data in the Oracle tables contains categories that I can join to the page entries that have come through the ETL process, and the join is pretty simple: POST_ID = POST_ID; but how do I get access to the Oracle data – I need to register the Oracle source in the ODI topology and create a model to represent the table, but can I then just join that table to the Hive table and load the results into another Hive table? Let’s try.

The screenshot below shows the mapping that would use this approach, with two Oracle tables joining to the Hive table and another Hive table as the target; on the right-hand side of the screen is the join condition, across both sources:

NewImage

The Execution view shows how ODI intends to run the transformation, with LKM SQL to SQL KMs used to load the Oracle tables and stage their data in Hive, and then IKM Hive Control Append used to load the joined dataset into another Hive table.

NewImage

If you try and execute the mapping though, it fails – because the LKM SQL to SQL KM can’t work with Hive tables yet.

NewImage

In fact. the various IKMs that come with Oracle Data Integrator Application Adapter for Hadoop (IKM file to Hive, IKM Hive Control Append etc) are a bit of a mix of IKMs and LKMs in that they contain extraction code, and integration code in the same KM, and none of the regular LKMs and IKMs will otherwise work with Hadoop sources. In fact, what we need to do with ODI at this point is actually land the Oracle data in Hive first, then do the join, which then begs the second question – how do we do that?

Currently, the only way to get Oracle data in to Hadoop is via IKM File to Hive (LOAD DATA), which involves an unnecessary extra step of exporting the Oracle data to a file, and then loading that file into HDFS and Hive. What we can do though is use sqoop, a tool within Hadoop to extract from, and load into, relational databases, something I covered on the blog a few weeks ago. Sqoop creates data loading and unloading jobs that run in parallel on the Hadoop cluster, and can use native JDBC drivers or even additional plugins such as Oraoop to make the process run faster (though Oracle Loader for Hadoop is considered the fastest way to unload to Oracle, if you’ve licensed the Big Data Adapters).

The only problem is that there’s no official support for sqoop in ODI, so no KMs that make use of it. What you can do though is create a command-line script to run sqoop and include that in an ODI procedure, which is what I’ll now do to bring in my Oracle data into Hive. To do this, I create an ODI procedure and add a single task, using the Operating System (command-shell) technology type, and use it to tell Sqoop to create me a Hive table based on an SQL SELECT statement against my Oracle database:

NewImage

(note that I had to format the sqoop command, in practice, as one line, to get it to run – the above listing is so that you can see all of the code.)

Executing this procedure works OK, and thereafter I’ve got a single Hive table containing the joined dataset from Oracle.

NewImage

Joining this new Hive table to the previous one containing the distinct set of page views is then fairly straightforward, but something I’d also like to do is stop any entries going into the rest of the ETL process where the calling IP address is a test one we use, “63.73.199.69″. The way I can do this is to use the CKM Hive knowledge module and put a constraint on the hostname in the table I’ll be loading from the join, so that I can then use ODI’s flow control feature to divert those rows to an error table.

NewImage

I also need to define a primary key for this table, something that’s mandatory when flow control is used. So let’s put the mapping together, joining the table I just brought in from Sqoop with the latest version of the weblog entries Hive table, loading into the Hive table I’ve just enabled the constraint for:

NewImage

and then I enable flow control, and the CKM Hive check knowledge module, in the Physical mapping settings.

NewImage

This is of course one of the benefits of using ODI to do your Hadoop data loading – you’ve got access to the data quality and error handling features that come with the tool. Then, when I execute the mapping and check with the Operator navigator, I can see the error handling process running, and afterwards in Hue I can see the contents of the new error table, which now contains those log entries where my test IP address was used, removing them from the target Hive table where they’d ordinarily have gone.

NewImage

So that’s the third step in the ODI BDA ETL process complete. The next one’s a bit trickier though – I need to geocode the entries in the log table, assigning country names to each row based on where the IP address is located. More tomorrow.

Categories: BI & Warehousing

The BI Survey 14 – Have Your Voice Heard!

Rittman Mead Consulting - Wed, 2014-06-11 03:20

Long-term readers of this blog will know that we’ve supported for many years the BI Survey, an independent survey of BI tools customers and implementors. Rittman Mead have no (financial or other) interest in the BI Survey or its organisers, but we like the way it gathers in detailed data on which tools work best and when, and it’s been a useful set of data for companies such as Oracle when they prioritise their investment in tools such as OBIEE, Essbase and the BI Applications.

Here’s the invite text and link to the survey:

“We would like to invite you to participate in The BI Survey 14, the world’s largest annual survey of business intelligence (BI) users.

To take part in this year’s survey, visit: https://digiumenterprise.com/answer/?link=1906-PHB5RT7V

As a participant, you will:

  • Receive a summary of the results from the full survey
  • Be entered into a draw to win one of ten $50 Amazon vouchers
  • Ensure that your experiences are included in the final analyses

BARC’s annual survey gathers input from thousands of organizations to analyze their buying decisions, implementation cycles and the benefits they achieve from using BI software.

The BI Survey 14 is strictly vendor-independent: It is not sponsored by any vendor and the results are analyzed and published independently.

You will be asked to answer questions on your usage of a BI product from any vendor. Your answers will be used anonymously and your personal details will not be passed on to software vendors or other third parties.

Business and technical users, as well as vendors and consultants, are all encouraged to participate.

The BI Survey 14 should take about 20 minutes to complete. For further information, please contact Jevgeni Vitsenko at BARC (jvitsenko@barc.de). 

Click below to take part in The BI Survey 14: https://digiumenterprise.com/answer/?link=1906-PHB5RT7V

Categories: BI & Warehousing

Rittman Mead Featured in Oracle In-Memory Option Launch

Rittman Mead Consulting - Tue, 2014-06-10 17:02

Today saw the official launch of the Oracle Database In-Memory Option, with Larry Ellison going through the product features and then reading out some quotes and testimonials from beta testers. Rittman Mead were part of the beta testing program, with several of our team testing out various scenarios where we ETL’d into it, used it with OBIEE and worked out what would be involved in “in-memory-enabling” some of our customer’s BI systems.

In fact, as we said in our quote for the launch, enabling Oracle Database for in-memory analysis was almost “boringly simple” – just enable the option, choose your tables, drop any OLTP indexes and you’re ready to go.

NewImage

Of course, in practice you’ll need to think about which tables you’ll put into memory if RAM is limited, in some scenarios TimesTen might be a better option, and you’ll need to test your particular system and carefully consider whether you’ll keep particular indexes or materialised views, but we’re really excited about the In-Memory Option for Oracle Database as it’s got the potential to significantly improve query response times for users – and from what we’ve seen so far, it “just works”.

We’re still in the NDA period whilst beta testing goes on, but you can read more on the In-Memory Option on the Oracle website, and on the blog post I wrote when the feature was announced last Openworld. Once it goes GA look out for some in-depth articles on the blog around how it works, and details on how we’ll be able to help customers take advantage of this significant new Oracle Database feature.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.2 : Hive Table Joins, Aggregation and Loading

Rittman Mead Consulting - Tue, 2014-06-10 02:49

In this series of posts, I’m going to be looking at an end-to-end ETL process on Oracle Big Data Appliance, using Hadoop technologies to do the data manipulation and Oracle Data Integrator 12c to orchestrate the process. Over the five posts, I’ll be landing web server log data on the Hadoop cluster using Flume, loading that data into a Hive table, transforming and enhancing the dataset, and then loading the final dataset into an Oracle database using one of the Oracle Big Data Connectors. The first post in the list below has a schematic for the overall process, and over the rest of the week I’ll be adding the links in for the remaining posts in the series.

So in today’s step, I want to take the initial Hive table containing the full set of log data, and then do three things; first extract the page details out of the “request” column in the incoming Hive table, then join that to some reference data also in the BDA that’ll allow me to add details of the post author and title of the post, and finally aggregate and project just certain columns from the the dataset so I’ve got a Hive table of just page accesses by IP address, with the author and title details.

In fact this is the easiest out of the five steps to set up and bring together, as we’re just using basic ODI functionality, albeit with Hive tables rather than regular database tables. In the screenshot below you can see the incoming weblog hive table, with the individual columns split-out in the previous step via the RegEx SerDe, and I just join it to the table containing post and author details, apply a distinct on the join output and then load the results into a third Hive table.

NewImage

The join between the two tables is bit complex, because I need to extract just the URL details out of the request field (which also contains the transport method and other metadata), but its no different that joining two regular database tables. Also, I did need to check the Generate ANSI Syntax checkbox, as Hive expects table joins to be in this format rather than the “Oracle” format.

NewImage

Going over to the Physical part of the mapping, I set the KM to IKM Hive Control Append, turn on the feature to enable table truncation prior to load, and I’m ready to go.

NewImage

Then, when I execute the mapping, I can see its progress in the Operator navigator, and the code view shows me the HiveQL commands that the KM has generated to move data around the BDA.

NewImage

So far so good, and as I said, this was the easy bit. For the next transformation I want to bring in some additional reference data from an Oracle database, so the question for me is whether I need to land that data into the BDA before doing the transformation, or whether I can create a join between the Oracle table and my Hive table? Check back tomorrow for the next installment…

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.1 : Flume to Initial Hive Table

Rittman Mead Consulting - Mon, 2014-06-09 07:18

A few months ago I posted an article on the ODI12c examples in the new Oracle Big Data Appliance, and over the past few weeks I’ve been writing about the various components within the Cloudera CDH Hadoop stack, including Hive, Pig, Spark and Flume. Since then I’ve built out a virtualized example of an Oracle Big Data Appliance using the Oracle Big Data 3.0 software set, and I thought it’d be interesting to create an example ETL flow through the system showing how ODI could be used to initiate and control the process. As with any situation where you actually build a demo to do something specific, as opposed to just play around with the technology, you end up uncovering a few quirks and techniques that you wouldn’t have otherwise been aware of, so I’ll spend this week going through the various steps and calling out things others in the same situation might find useful – the steps I’ll go through are below, and I’ll add the links as the articles get published during the week;

As an overview, what I’ll be working with is a six-node Hadoop cluster running Cloudera CDH5, Oracle Big Data Connectors 3.0 and Oracle Data Integrator 12c, as you’d get with the current version of Oracle Big Data Appliance. Obviously BDA goes up to eighteen nodes as a maximum, and under the covers there’s lots more memory and much better networking that I was able to set up on VMWare ESXi, but what I’ve got is enough to prove the process. The diagram below shows the six nodes, and where the software is installed.

NewImage

I took a couple of short-cuts with the setup; obviously each node has a lot less RAM than BDA’s 64GB per node, but the main node in the cluster (bdanode1) running the HDFS NameNode and all the ODI software got bumped-up to 28GB, with the next two allocated 12GB and the others 8GB – enough to work through a demo at least. I also ran ODI Studio on the bdanode1 as well, instead of setting it up on a separate client VM, mainly to avoid having to set up all the Hadoop and Hive libraries on another machine. Other than that though, its the same CDH distribution you get on BDA, the same version of ODI and the Big Data Connectors, and so on, with the following software versions and downloads being used:

Setup of the Hadoop cluster is really out of scope for these articles, except to say that with CDH5, I find it easier to select the (non-default, deprecated) Packages install type rather than the new Parcels type, as this new methods installs all of the Hadoop software on each node in a new place – /opt/cloudera – rather than the usual /usr/lib/hadoop, /usr/lib/hive and so forth, meaning that most configuration examples you’ll read point to the wrong place for your install. Parcels are Cloudera’s way forward for delivering software components (there’s advantages in terms of patching across the cluster) but if you’re getting started with the Hadoop platform, installing in the old location usually makes things easier to follow. Other than that, the two bits of configuration you need to do is firstly to tell ODI Studio where to find that various Hadoop libraries and configuration files; because I’ve installed Studio directly on the main Hadoop node, I could then just add that node’s file locations to the Oracle user’s $HOME/.odi/oracledi/userlib/additional_path.txt file, so that it looked like this:

Also to make use of Oracle Loader for Hadoop, one of the Oracle Big Data Connectors and something we’ll use at the end to bulk-unload data from Hadoop to an Oracle database, the second thing I’ll need to do is set a couple of environment variables in the “oracle” user’s .bashrc profile file pointing to where OLH is installed, and where the Hadoop and Hive libraries and configuration files are:

[oracle@bdanode1 userlib]$ cat $HOME/.bashrc
# .bashrc
 
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
 
# User specific aliases and functions
export HADOOP_CLASSPATH=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/*:/etc/hive/conf
export OLH_HOME=/home/oracle/oracle/product/oraloader-3.0.0-h2

The scenario I’ll be working with is similar to the ones I covered on the blog recently, where I landed data from the Rittman Mead blog webserver into HDFS files using Hive, and then processed the files using Hive, Pig, Spark and so on. In this example though, I’ll use ODI to do the data manipulation where possible, but still use Hive and so forth under-the-covers to do the work. The diagram below shows the data flow that i’ll be looking to set up in the example:

NewImage

So in this scenario incoming data is being landed in the Hadoop cluster by Flume, using the process outlined in this previous blog post. All Flume is is a transport mechanism; it doesn’t in itself have any processing ability (making it analogous to GoldenGate) and all it does it transport log file entries from one place to another, via Flume agents on either end. All we as developers need to be aware of is (a) where the log files will be landed, and (b) that Flume will keep continuously writing to these files until the source log file gets rotated – therefore I can’t assume a log file is completely written to when I read from it.

What this means in practice is that if I want to do incremental loads, I need to consider the fact that a file I’m reading to might have more data in it later on. There’s various solutions to this – principally having Flume write to HBase, rather than raw HDFS files, and then I read from the HBase database noting the last extraction point at the time – but to keep things simple I’ll just do a full-load each time the ETL run takes place, meaning that I don’t need to think about incremental loads throughout the system.

So the first thing I need to do is have ODI take the incoming files and load them into a Hive table. To do this I set up Topology entries for the incoming HFDS files, and here’s the first “gotcha” – to create a connection to HDFS files, you use the regular File technology, but you leave the JDBC driver type blank, and put the address of the HDFS NameNode in to the JDBC URL – which of course is technically invalid and won’t allow you to either test it, or reverse-engineer the file names in, but is a “hack” used by the IKM File to Hive KM to get the address of your HDFS NameNode (if you choose to source the incoming file from HDFS rather than the local filesystem).

NewImage

Then, when you come to register the physical schemas to go with the new File technology data server, you’ll need to be aware that ODI just appends the final file name to the directory name when retrieving the file data – so if you want the connection to point to a directory, rather than just a file, you’ll need to set up the physical schema to be the directory “above” the one you’re interested in, and then set the file name later on to be that directory. In this example , I want the final file reference to point to hdfs://bdanode1.rittmandev.com/user/oracle/weblog_incoming_files/apache_access_combined, a whole directory (HDFS aggregates all the files in a directory if you reference just the directory in an operation) rather than just a single log file. You can see the directory and the log files it contains in the Hue screenshot below:

NewImage

I therefore set the physical schema to be hdfs://bdanode1.rittmandev.com/user/oracle/weblog_incoming_files, and the file reference in the datastore model is set to the final directory name, like this:

NewImage

If it seems a bit confusing, it’ll become clearer in a moment.

Now I need to go back to the Topology navigator and create a connection through to the Hive server on the Big Data Appliance VMs – in fact recent versions of CDH (CDH4, CDH5) swap out the old Hive Server for HiveServer2, so you’ll need to use the correct JDBC drivers (as supplied with CDH4/5) to connect to it, and also create the JDBC URL in the format jdbc:hive2://[machine name:port], as shown inn the screenshot below:

NewImage

A quick note about security at this point; by default, most CDH4/5 clusters are set up as what’s called “unsecured”, which means that whilst you use a username and login to connect to Cloudera Manager, for example, by this default although Hive and Impala request user credentials when you connect, they don’t actually check the password against anything, for example your LDAP server. You can connect these tools to LDAP or Active Directory, and typically you’d combine this with Kerebos authentication between the various components and when you connect via Impala and Hive, and typically you’d also use Apache Sentry to provide role-based access control to the data within individual HDFS files. But by default, Hive will accept more or less anything as a password when you connect, but then you may hit issues later on when your HDFS files are owned by a different user to the one you connect as.

Where this manifests itself is when a Hive table has underlying HDFS files owned by, say, the “admin” user in CDH5 (because that’s how you connected to Hue to upload them), but then you connect as the “oracle” user through Hive to then manipulate the Hive table contents. In practice, what I try to do is create any Hive tables (using Hue) as the user I’m then going to connect in Hive to them as, which means you’ll most probably need to go into Hue and create a new user called “oracle” (if that’s what you’re running ODI as, and connecting through Hive as) before creating the Hive tables you’ll then import into the ODI topology.

So once you’ve done all that, you can go into the Designer navigator and reverse-engineer the definition of your Hive tables into datastore models. In my case, I’ve got a bunch of tables that I’ll be using throughout the whole ETL process.

NewImage

Now it’s time for the first ODI mapping, to take the raw log file directory and load it into a Hive table. As it stands though, these raw log files are rows of just a single “column” of data – the log file entry in Apache CombinedLogFormat format. To make them useful to the rest of the ETL process I’ll somehow need to parse them into the individual log file elements, so I create a target Hive table that contains an entry for the raw log entry, and then columns for the various log file elements:

NewImage

The way that I parse the log file is to use a feature within the IKM File to Hive (LOAD DATA) KM that allows me to specify a regular expressed Serde (Serializer-Deserializer) to parse the log file entry into its individual columns, like this (note that you’ll need to ensure the hive-contrib-* JAR file is available to all of your Hadoop nodes before using this SerDe)

NewImage

In this instance, I want the KM to leave the source files in-place when doing the data load (Hive by default moves incoming source files into the /user/hive/warehouse directory area) as these files most probably haven’t been finished written-to by Flume yet, so I leave the EXTERNAL_TABLE value set to true (Hive external table, not Oracle external table) and make sure FILE_IS_LOCAL is set to FALSE, so that this KM knows to use the HDFS file location hack I set up in the topology earlier. Then, I just run the mapping and check that it’s worked OK:

NewImage

and I can check from the Model pane in the Designer navigator that I’ve now got a Hive table of individually split-up log entry columns to work with, for the rest of the downstream ETL process:

NewImage

So that’s the first stage done – next, I’ll be combining this Hive table with data from another one, using the IKM Hive Control Append KM.

Categories: BI & Warehousing

Rittman Mead at the Oracle Virtual Technology Summit 2014 – “The Architecture of Analytics: Big Time Big Data and Business Intelligence”

Rittman Mead Consulting - Mon, 2014-06-09 05:25

I’m pleased to announce that I’ll be presenting at the Oracle Technology Network (OTN) Virtual Technology Summit, an online event running over three days in early July. I’m speaking as part of the Middleware track, entitled “The Architecture of Analytics: Big Time Big Data and Business Intelligence”, along with fellow Oracle ACE Director Kevin McGinley, US BI Forum keynote speaker Richard Tomlinson, and by Tom Plunkett, Lead Author of the Oracle Big Data Handbook. It’s free to attend, it’s running in three timezones over the three days, and here’s the track details and session abstracts:

The Architecture of Analytics: Big Data and Business Intelligence

“More than just another hyped-up technological buzzword,  Big Data represents a dramatic shift not just in the amount of data to be dealt with, but also in how business intelligence is extracted from that data in order to inform critical business decisions. This OTN Virtual Technology Summit track will present a solution architect’s perspective on how business intelligence products in Oracle’s Fusion Middleware family and beyond fit into an effective big data architecture, and present insight and expertise from Oracle ACE Directors specializing in business Intelligence to help you meet your big data business intelligence challenges. Technologies covered in this track include Oracle Big Data Appliance, Oracle Data Integrator, Oracle Business Intelligence Enterprise Edition, Oracle Endeca, Oracle Data Warehouse, and Oracle Big Data Connectors, along with Hadoop, Hive, NoSQL, and MapReduce.”

Session Abstracts

1.Oracle Big Data Appliance Case Study: Using Big Data to Analyze Cancer-Genome Relationships [30:00]
by Tom Plunkett, Lead Author of the Oracle Big Data Handbook

This presentation takes a deep technical dive into the use of the Oracle Big Data Appliance in a project for the National Cancer Institute’s Frederick National Laboratory for Cancer Research.  The Frederick National Laboratory and the Oracle team won several awards for analyzing relationships between genomes and cancer subtypes with big data, including the 2012 Government Big Data Solutions Award, the 2013 Excellence.Gov Finalist for Innovation, and the 2013 ComputerWorld Honors Laureate for Innovation.

2.Getting Value from Big Data Variety [30:00]
by Richard Tomlinson, Director, Product Management, Oracle

Big data variety implies big data complexity. Performing analytics on diverse data typically involves mashing up structured, semi-structured and unstructured content. So how can we do this effectively to get real value? How do we relate diverse content so we can start to analyze it? This session looks at how we approach this tricky problem using Endeca Information Discovery.

3.How To Leverage Your Investment In Oracle Business Intelligence Enterprise Edition Within A Big Data Architecture [30:00]
By Oracle ACE Director Kevin McGinley

More and more organizations are realizing the value Big Data technologies contribute to the return on investment in Analytics.  But as an increasing variety of data types reside in different data stores, organizations are finding that a unified Analytics layer can help bridge the divide in modern data architectures.  This session will examine how you can enable Oracle Business Intelligence Enterprise Edition (OBIEE) to play a role in a unified Analytics layer and the benefits and use cases for doing so.

4.Oracle Data Integrator 12c As Your Big Data Data Integration Hub [90:00]
by Oracle ACE Director Mark Rittman

Oracle Data Integrator 12c (ODI12c), as well as being able to integrate and transform data from application and database data sources, also has the ability to load, transform and orchestrate data loads to and from Big Data sources. In this session, we’ll look at ODI12c’s ability to load data from Hadoop, Hive, NoSQL and file sources, transform that data using Hive and MapReduce processing across the Hadoop cluster, and then bulk-load that data into an Oracle Data Warehouse using Oracle Big Data Connectors.

We will also look at how ODI12c enables ETL-offloading to a Hadoop cluster, with some tips and techniques on real-time capture into a Hadoop data reservoir and techniques and limitations when performing ETL on big data sources.

Registration is free, and the form for all three events is online here. Look out for a preview of my big data ETL on BDA session over the next five days, and hopefully you’ll be able to make the VTS event and hear me talk about it in-person.

Categories: BI & Warehousing

Goodbye to Stewart…

Rittman Mead Consulting - Mon, 2014-06-09 05:00

We’re sad to announce that Stewart Bryson’s last day with Rittman Mead was last Friday, and he’s moving-on now to a new challenge. Five years ago Stewart co-founded Rittman Mead America with us, and most recently he’s been our Chief Innovation Officer, handing over the day-to-day running of our US business to Charles Elliott, our US Practice Manager.

One of the main contributions Stewart has made over the last few years is helping us develop the Rittman Mead Delivery Framework and the Rittman Mead Way. The Delivery Framework is a set of software, accelerators and processes that determine how Rittman Mead consultants go about projects (the Rittman Mead Way). Stewart has contributed greatly to our work around Extreme BI, version control and release management. These processes are now embedded in our delivery management team and framework, and as such will live on through the DNA of Rittman Mead, but in the meantime we wish him well going into the future.

Categories: BI & Warehousing

Data Integration Tips: GoldenGate on RAC – Action Script for Windows

Rittman Mead Consulting - Fri, 2014-06-06 14:29

I want to briefly interrupt my blog series on GoldenGate and ODI – A Perfect Match in 12c… to provide another Data Integration Tip that was found during a recent client engagement. I was tasked with installing GoldenGate 11g into a 2-node RAC, which is a pretty straight-forward process and well documented by the Oracle Data Integration product team. The challenge is that the client’s Oracle RAC environment was installed on Windows Server 2003.

The Scenario

Working through the GoldenGate setup and configuration process for RAC / Clusterware, most of the same approach applies to either Unix/Linux or Windows (maybe a future blog post?). There is also an example action script towards the bottom, but it’s written in shell script..and that just won’t work in Windows! For those unfamiliar with Clusterware, when an application is added as a resource, Clusterware will use the action script to perform specific functions, or actions, against the application. These actions, Check, Start, Stop, etc., will often times run commands through GoldenGate’s GGSCI application, ensuring the Manager process is running, starting the Manager, etc.

We decided to convert the shell script example to batch command script, as this is about as native as it gets in Windows. Everything was going well with the conversion until we reached this function and the highlighted code below.

#call_ggsci is a generic routine that executes a ggsci command
call_ggsci () {
  log "entering call_ggsci"
  ggsci_command=$1
  cd ${GGS_HOME}

  ggsci_output=`${GGS_HOME}/ggsci << EOF
  ${ggsci_command}
  exit
  EOF`

  log "got output of : $ggsci_output"
}

The script will simply change directory to the GoldenGate home, run the GGSCI application, execute the command (passed in as an argument), and then exit GGSCI. This is all wrapped within two EOF “tags”, indicating that the formatting in the script, such as hard returns, etc, should remain when the external command is run. It’s known as a heredoc, and we found that it’s not really possible to do in a batch script.

When attempting similar code in the batch script, without the heredoc, we ran into an issue. GGSCI attempted to run all of the commands on the same line, which failed miserably. We needed another approach to ensure the script would execute properly.

Here’s the Tip…

The goal was to find a way to run a GGSCI command on a single line. The solution: use GGSCI to execute GoldenGate obey files. Rather than attempt to place hard returns in the batch command script, we simply placed all of the necessary commands in an obey file and passed the obey command and script file location it into the function.

call :call_ggsci %GGS_HOME%\diroby\StartMgr.oby
 ...

:call_ggsci

 REM set ggsci_command variable with the passed-in value
 SET ggsci_command=%~1
 call :log "ggsci_command: %ggsci_command%"

 REM log whether or not the GGS_HOME path exists
 if exist %GGS_HOME% (
  call :log "%GGS_HOME% exists"
 ) else (
  call :log "%GGS_HOME% does not exist"
 )
 ...

The obey file could be as complex as you like, but in this case it simply starts the manager process.

start-mgr-obey

This approach works very well, but we did end up with the additional obey files to maintain. Next time around, I’ll probably use a scripting language such as Groovy or Python, as either should work just fine on Linux or Windows.

Look for more Data Integration Tips on the blog, as Rittman Mead always coming up with innovative solutions to interesting challenges!

Categories: BI & Warehousing

Rittman Mead Announce New Partnerships with Cloudera and Amazon Web Services

Rittman Mead Consulting - Thu, 2014-06-05 06:00

To complement our strategic partnership with Oracle Corporation, Rittman Mead are very pleased to announce two new technology partnerships to complement our Oracle-based services; with Cloudera, the industry-leading Hadoop software vendor, and with Amazon Web Services, the infrastructure-as-a-service provider that’s become the standard for deploying highly-available, secure cloud applications. As business intelligence, analytics and data warehousing systems extend onto the cloud and to encompass big data, these two new partnerships will allow us to offer best-of-breed services in these area whilst complementing our main technology partnership with Oracle. So why Cloudera, why Amazon Web Services, and why now?

Extending BI from the EDW to the “Enterprise Data Hub”

If you’re a regular reader of this blog you’ll probably have read several articles by myself and other authors about cloud, Hadoop and big data. Whilst the vast majority of BI & DW systems we put together for clients today are based around regular database and application data sources, increasingly customers are asking us to help them bring non-traditional, NoSQL and Hadoop-based data sources into their BI platform, adding what’s becoming termed “enterprise data hubs” alongside the more traditional enterprise data warehouses. We’re also seeing more interest in deploying these systems into the cloud – either driven by the economics of the cloud vs. on-premise systems, or because other key line-of-business systems are moving into the cloud and it makes sense to deploy your BI there too. And Oracle, of course, have noticed this trend, with support being added to OBIEE and ODI for reporting against, and loading data out of/into Hadoop data sources, and Oracle themselves due to launch Oracle Business Intelligence Cloud Services in the next couple of months.

NewImage

Oracle of course already have some Hadoop-based products out in the market, including the very-impressive Oracle Big Data Appliance and their Oracle Big Data Connectors, and we think these products have the potential to make a massive impact on the types of project and solutions we can offer customers today. But at Rittman Mead we like to get ahead of the technology curve, and so we decided to in addition partner directly with Cloudera, makers of the Hadoop software within Big Data Appliance but a significant software vendor in themselves, to ensure our team is at the forefront of delivering cutting-edge big data solutions based around Cloudera CDH, Apache Spark and Cloudera Impala.

Over the past few months we’ve started to deliver projects and PoCs for customers around Cloudera Hadoop technology and Oracle Engineered Systems, Database and BI tools, and what’s emerged as our core competence is the ability to bring together Oracle’s chosen Hadoop distribution and tools with the core Oracle technologies our customers use. We’ve developed systems using Oracle’s Big Data Connectors, to for example analyse huge datasets across multiple Hadoop nodes using Oracle R Advanced Analytics for Hadoop, and we’re currently working with a couple of customers’ Oracle DBA teams who are in the process of adopting Oracle Big Data Appliance. Most importantly, as official Cloudera Partners we’ve got access to their technical and solutions architecture resources, giving us a similar same level of technical backup as we have access to on our core Oracle projects.

BI Moves into the Cloud, for Improved Business Agility and TCO

The other big trend that’s driving a lot of innovation in the BI industry is “cloud”. If you were at one of our BI Forum events in Brighton and Atlanta this year, you’ll know that cloud is front-and-centre in Oracle’s product strategy at the moment, both in terms of making BI available as part of the wider Oracle Public Cloud, but also as a way of accelerating innovation and making more of the product self-service. What’s now been officially named “Oracle BI Cloud Service” (BICS) was officially launched at last year’s Openworld and we’re about to help beta-test the product for Oracle prior to its launch later in the year, and we’re expecting BICS to be particularly attractive to existing OBIEE customers looking to quickly spin-up departmental BI environments without the need for IT to get involved.

But as I covered in a couple of blog posts earlier in the year, BICS functionality is likely to be limited in its initial incarnation and many customers are going to be looking to run “full” OBIEE in the cloud, along with a full Oracle database and an ETL infrastructure using tools such as ODI or Informatica, and for these types of customer a more complete cloud solution will be needed – which is why we’ve also partnered with Amazon Web Services, in our view by far the best cloud service provider on the market and the platform behind companies such as Netflix and Dropbox.

Logo amazon aws

We’ve been long-term users of Amazon AWS since around three or four years ago, initially running our training servers on the platform but more recently, undertaking internal and customer development work on their “virtual private cloud” platform as part of our new “Extreme BI in the Cloud” initiative. As techies, we appreciate the power and flexibility of the Amazon AWS API which we use to script and automate much of our “DevOps” work, and we’re finding more and more of our customers are moving to AWS independently, due to their excellent uptime and levels of customer service, and the wide ecosystem of AWS-native and partner products such as Amazon Redshift, Amazon Elastic MapReduce, Amazon EC2 and S3, and Attunity Cloudbeam. Again, we’ve partnered officially with Amazon AWS so that we can train our team-up and access technical and solutions architecture resources, and as with the Cloudera partnership, our particular specialisation is in deploying Oracle-based technologies onto the Amazon AWS platform.

So does this mean that Rittman Mead are de-emphasising our Oracle Partnership or otherwise taking our attention away from solutions built-on OBIEE, ODI, Essbase and the Oracle Database? Far from it; the core of the company will always be around Oracle technology, but by partnering with two best-of-breed complementary organisations – Cloudera, for Oracle-centric Hadoop and big data solutions, and Amazon AWS, for cloud and hybrid deployments – we can continue to offer customers the most innovative and industry-leading solutions as BI evolves from its database and on-premise roots to cover big data and cloud deployments.

Look out for our our blog over the coming months as we cover integrating Oracle technology with Cloudera’s Hadoop platform and Oracle Big Data Appliance, and look out for products and services where we combine Oracle and Cloudera technology in innovative ways, creating cutting edge solutions for our customers both on-premise, in the Amazon and Oracle clouds, or as hybrid on-premise/cloud deployments.

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 2: Journalizing Knowledge Module

Rittman Mead Consulting - Wed, 2014-06-04 16:35

This is the second post in a blog series on the integration between Oracle Data Integrator (ODI) 12c and GoldenGate 12c. The first post focused on the latest, yet-to-be-released, Oracle Information Management Reference Architecture and some high-level features new to the 12c versions of each product. We also began working through the details of the GoldenGate installation and configuration, specifically the JAgent setup, which is necessary for communication between ODI and GoldenGate during the setup of “online” journalizing. In this post, we’ll look at the new features of the Journalizing Knowledge Module “JKM Oracle to Oracle Consistent (OGG Online)” and get started on the modifications to the JKM that must occur for us to load the Raw Data Reservoir (RDR – acronym coined by Nick Hurt in the comments of my first post…I like it!) and Foundation Layer simultaneously.

ODI Journalizing

Before I get into editing the JKM, let me briefly go through an introduction of ODI Journalizing, aka Change Data Capture (CDC). In case you missed it earlier, ODI CDC is implemented using a Journalized Knowledge Module. The JKM generates the infrastructure for the CDC, creating journal tables that store the change rows and views that provide access to the change rows for use in ODI Mappings. ODI CDC can be implemented using various capture processes, such as triggers on source tables, timestamps on rows, or mining of the database logs via Oracle Streams or, in our case, Oracle GoldenGate. A great explanation of ODI JKMs and how they work, written by Christophe Dupupet of the Oracle Data Integration A-Team, can be found here.

Import ODI JKMs

The integration between GoldenGate and ODI begins with a Journalized Knowledge Module, in our example we’ll be using the “JKM Oracle to Oracle Consistent (OGG Online)”. The JKM, delivered with ODI, will create the GoldenGate parameter files, configure GoldenGate process groups (and start them up), and generate the ODI CDC tables and views. The JKM, added to the source Model in 12c, uses the ODI metadata to generate the GoldenGate parameter file mappings. This alone saves quite a bit of manual work and reduces possible typos caused by human error.

JKM Oracle to Oracle Consistent (OGG Online)

In the previous post, I mentioned the new capability of the JKM that allows for an “online” integration between ODI and GoldenGate. But, there are many other new features that need to be described, so we’ll walk through those here.

ODI Tool: OdiOggCommand

The JKM uses an undocumented ODI Tool called OdiOggCommand in the target command of some tasks that are executed only when in “online” mode. This tool has different values for a parameter called OPERATION.

EXECUTECMD: executes various commands, such as Add Extract, within GGSCI
EXECUTEOBEY: runs the OBEY command against an obey file in GGSCI
DEFGEN: generates the source definitions file by executing the DEFGEN command
SAVEFILE: uploads the parameter and obey files to the GoldenGate installation directory

OdiOggCommand

I imagine the code behind this ODI Tool is simply executing command line calls to GGSCI, DEFGEN, etc. It would be great to see some Oracle documentation on this one!

GoldenGate Topology

The “online” aspect of the JKM requires that a Data Server, Physical Schema and Logical Schema all be setup under the GoldenGate technology in ODI. The Data Server contains the location and connection information for communicating with the JAgent on either the source or target GoldenGate server. Under that, a Physical Schema must be setup for each extract (the pump is included) and replicat process group that will be implemented on that server.

GoldenGate Physical Schema

In this screenshot, we have a Physical Schema with the process type set to Capture, also called the extract. Here we set up the directory path on the source to the location where captured transactions from the source database logs will be stored in GoldenGate’s own log files, called trail files. The remote trail file directory, setup on the target server and accessed by the pump process in order to move transactions from the source trail to the target trail, must also be added. Additional options such as trail file size (how large a trail file can get before rolling over to the next file) are also available to be set. Just below the Capture Process Properties are the Additional Options, parameters that can be added to the extract or replicat parameter files based on specific needs of the solution.

GoldenGate Physical Schema Additional Options

Here we can add options to handle different aspects of the extract or replicat, such as TRANLOGOPTIONS on the extract. Once I add this option, I get a template of the additional parameters I can configure. For example, I may want to access my transaction logs stored in ASM via the database rather than directly from ASM. The DBLOGREADER option lets me utilize the ASM API in the database, simplifying my access to the logs.

TRANLOGOPTIONS DBLOGREADER, DBLOGREADERBUFSIZE 2597152

When I add the additional option to the capture physical schema, it will be generated as a part of the extract parameter file. This helps to drastically reduce the amount of manual editing that had to occur after the parameter files were generated in the 11g version, and is a great addition to the JKM.

JKM Setup and Options

The JKM is actually applied to the source ODI Model, which is a grouping of logical table structures called Datastores. On the Journalizing tab in the Model, we first set the type (Consistent Set) and choose the appropriate JKM.

GoldenGate Process Selection

In the GoldenGate Process Selection section, we choose the Capture and Delivery Logical Schemas that were setup in the ODI Topology. If none have been created, but the Data Server for the source and target GoldenGate installations exist, we can choose to create the Physical and Logical Schemas directly from the Model by clicking the Create button. This is a nifty way to separate the system administration role (setting up the Data Server) from the metadata management or developer role.

The JKM also has a set of Options with configurable values, some being required and others optional. The number of Options has been reduced by quite a bit in the 12c JKM. This makes sense, as we’ve seen that additional metadata such as the trail file location and trail file size are set elsewhere. We’ll go through these in more detail later on when setting the values in our example.

JKM Options

That covers the updated features in the JKM, now let’s talk Knowledge Module customization.

JKM Customization

In the first post, I described how we want to extract the source data once and replicate it into both the Foundation Layer and Raw Data Reservoir in parallel. The JKM is set up to load the ODI CDC Framework (the fully replicated table and J$ table) out of the box, but not the Foundation table. In case you’re unfamiliar with the purpose of the Foundation layer, the idea is to store all transactional history from the source tables by converting every transaction into an insert, and tracking the type of change (insert / update / delete), commit date, and commit SCN. With this information stored from the beginning of the data warehouse, it can be used for historical drill-throughs from the dimensional model, or to completely reload a star schema – including all history. With the stage set, let’s look at what we’re going to change in the JKM.

1. Add New Option “APPLY_FOUNDATION”
This option, when true, will allow the Start Journal process to generate the source-to-foundation mapping statement in the Replicat (apply) process.

2. Add New Option “FND_LSCHEMA”
The Logical Schema name for the Foundation layer schema.

3. Add New Task “Create apply prm (4) RM”
This task will create the source-to-foundation mapping code and add it to the replicat parameter file.

4. Set Option on “Create apply prm (4) RM” Task
Set the execution Options to have APPLY_FOUNDATION as the only checked option. This will determine whether or not that task will execute when Start Journal is run.

5. Edit Task “Execute apply commands online RM”
We’ll comment out the “start replicat …” command, as we’ll first need to complete an initial load of the source data to the target.

Now, I did say I would get into the editing of the JKM in this post, but after describing the new features of the Knowledge Module, I’ll save the actual details for the part 3 of the series! Up next, editing the JKM and building the necessary ODI objects so we can start journalizing.

Categories: BI & Warehousing

Fun with SQL - Silver Pockets Full

Chet Justice - Wed, 2014-06-04 16:13
Silver Pockets Full, send this message to your friends and in four days the money will surprise you. If you don't, well, a pox on your house. Or something like that. I didn't know what it was, I just saw this in my FB feed:



Back in November, I checked to see the frequency of having incremental numbers in the date, like 11/12/13 (my birthday) and 12/13/14 (kate's birthday). I don't want to hear how the rest of the world does their dates either, I know (I now write my dates like YYYY/MM/DD on everything, just so you know, that way I can sort it...or something).

Anyway, SQL to test out the claim of once every 823 years. Yay SQL.

OK, I'm not going to go into the steps necessary because I'm lazy (and I'm just lucky to be writing here), so here it is:
select *
from
(
select
to_char( d, 'yyyymm' ) year_month,
count( case
when to_char( d, 'fmDay' ) = 'Saturday' then 1
else null
end ) sats,
count( case
when to_char( d, 'fmDay' ) = 'Sunday' then 1
else null
end ) suns,
count( case
when to_char( d, 'fmDay' ) = 'Friday' then 1
else null
end ) fris
from
(
select to_date( 20131231, 'yyyymmdd' ) + rownum d
from dual
connect by level <= 50000
)
group by
to_char( d, 'yyyymm' )
)
where fris = 5
and sats = 5
and suns = 5
So over the next 50,000 days, this happens 138 times. I'm fairly certain that doesn't rise to the once every 823 years claim. But it's cool, maybe.
YEAR_MONTH       SATS       SUNS       FRIS
---------- ---------- ---------- ----------
201408 5 5 5
201505 5 5 5
201601 5 5 5
201607 5 5 5
201712 5 5 5
128 more occurrences...
214607 5 5 5
214712 5 5 5
214803 5 5 5
214908 5 5 5
215005 5 5 5

138 rows selected
I'm not the only dork that does this either, here's one in perl. I'm sure there are others, but again, I'm lazy.
Categories: BI & Warehousing

How We Deliver Agile OBIEE Projects – Introducing ExtremeBI

Rittman Mead Consulting - Wed, 2014-06-04 04:56

Most OBIEE projects that we see are delivered through some sort of “waterfall” method, where requirements are defined up-front, there’s several stages of development, one or more major releases at the end, and any revision to requirements takes the form of a change request. These work well where requirements can be defined upfront, and can be reassuring to customers when they want to agree a fixed-price up-front with every subsequent change clearly costed. But, as with the development world in general, some customers are starting to look at “agile” methods for delivering BI projects, where requirements emerge over the course of a project, there isn’t so much of a fixed design or specification at the start, but instead the project adds features or capabilities in response to what are called “user stories”, making it more likely in-the-end that what ends-up getting delivered is more in-line with what users want – and where changes and additions to requirements are welcomed, rather than extra-cost change requests.

OBIEE naturally lends itself to working in an agile manner, through the three-layer nature of the repository (RPD); by separating the physical representation of the source data from how it is then presented to the end-users, you can start from the off with the dimensional model that’s your end goal, and then over time evolve the back-end physical layer from pointing directly at the source system to instead point at a data warehouse or OLAP cube. In fact, I covered this approach back in 2008 in a blog post called “A Future Oracle OBIEE Architecture” where I positioned OBIEE’s BI Server as a “business logic layer”, and speculated that at some point in the future, OBIEE might be able to turn the logical > physical mappings in the RPD into actual ODI mappings and transformation.

NewImage

In the end, although OBIEE’s aggregate persistence feature gave us the ability to spin-off aggregate tables and cubes from the RPD logical model, full ETL “push-down” never came although you can see traces of it if you have a good poke around the DLLs and directories under the BI Server component. What did happen though was Exadata; with Exadata, features such as SmartScan, and its ability to do joins between normalised tables much faster than regular databases meant that it became possible to report directly against an OLTP schema, or a ODS-like foundation layer, only adding ETL to build a performance star schema layer if it was absolutely necessary. We covered this in a series of posts on Agile Data Warehousing with Exadata, and the focus of this method was performance – by adding Exadata, and the metadata flexibility in OBIEE’s RPD, we could deliver agile projects where Exadata gave us the performance even when we reported directly against a third-normal form data source.

NewImage

 

 

And this approach worked well for our customers; if they’d invested in Exadata, and were open to the idea of agile, iterative development, we could typically deliver a working system in just a few months, and at all times what the users got was what they’d requested in their user story backlog. But there were still ways in which we could improve this method; not everyone has access to Exadata, for example, and reporting directly against a source system makes it tricky to add DW features like history, and surrogate keys, so recently we introduced the successor to this approach, in the form of an OBIEE development method we called “ExtremeBI”. Building our previous agile work, ExtremeBI introduced an integration element, using GoldenGate and ODI to replicate in real time any source systems we were interested in to the DW foundation layer, add the table metadata that DW systems expect, and then provide a means to transform the logical to physical RPD mappings into ODI ETL specifications.

NewImage

But in a way, all the technical stuff is by-the-by; what this means in practice for customers is that we deliver working systems from the first iteration; initially, by reporting directly against a replicated copy of their source system (with replication and metadata enhancement by GoldenGate, and optionally ODI),and then over subsequent iterations adding more end-user functionality, OR hardened ODI ETL code, all the while driven by end-user stories and not some technical design signed-off months ago and which no longer reflects what users actually want.

NewImage

What we’ve found though from several ExtremeBI customer engagements, is that it’s not just down to the technology and how well ODI, OBIEE and GoldenGate work; the major factors in successful projects are firstly, having the project properly pre-qualified at the start; not every project, and not every client, suits agile working, and agile works best if you’re “all in” as opposed to just agreeing to work in sprints but still having a set-in-stone set of requirements which have to be met at a certain time. The second important success factor is proper project organisation; we’ve grown from just a couple of guys with laptops back in 2007 to a fully-fledged, end-to-end development organisation, with full-time delivery managers,a managed services desk and tools such as JIRA, and you need to have this sort of thing in place, particularly a project management structure that’s agile-friendly and a good relationship with the customer where they’re fully-signed up to the agile approach. As such, we’ve found the most success where we’ve used ExtremeBI for fairly technically-savvy customers, for example a MIS department, who’ve been tasked with delivering something for reasonable price and over a short amount of months, who understand that not all requirements can be delivered, but really want their system to get adopted, delight their customer and focus its features on what’s important to end-users.

As well as processes and a method, we’ve also developed utilities and accelerators to help speed-up the initial setup, and ensure the initial foundation and staging layers are built consistently, with GoldenGate mappings already put in place, and ready for our developers to start delivering reports against the foundation layer, or use these foundation-layer tables as the basis of a data mart or warehouse build-out. The screenshot below shows this particular tool, built using Groovy and run from within the ODI Studio user interface, where the developer selects a set of source tables from an ODI model, and then the utility builds out the staging and foundation layers automatically, typically saving days over the manual method.

NewImage

We’ve also built custom KMs for ExtremeBI, including one that uses Oracle Database’s flashback query feature to pull historical transactions from the UNDO log, as an alternative to Oracle Streams or Oracle GoldenGate when these aren’t available on the project.

All together, using Rittman Mead’s ExtremeBI method along with OBIEE, ODI and optionally GoldenGate has meant we’ve been able to deliver working OBIEE systems for customers over just a few months, typically for a budget less than £50k. Coupled with cloud hosting, where we can get the customer up-and-running immediately rather than having to wait for their IT department to provision servers, we think this the best way for most OBIEE11g projects to be delivered in the future. If you’re interested, we’ve got more details on our “ExtremeBI in the Cloud” web page, or you can contact me via email – mark.rittman@rittmanmead.com – if you’d like to discuss it more,

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 1: Getting Started

Rittman Mead Consulting - Fri, 2014-05-30 14:13

Over the years, I’ve blogged quite a bit about integration between Oracle Data Integrator and GoldenGate, and how to make it all work with the Oracle Reference Architecture. With the release of the 12c versions of ODI and GoldenGate last October, and a soon-to-be-updated reference architecture, it’s time to write a few posts on the subject again.

Getting Started with 12c

First, let me describe the new Journalizing Knowledge Module (JKM) that has been introduced in ODI 12c for integration with GoldenGate: JKM Oracle to Oracle Consistent (OGG Online). This JKM now allows GoldenGate to be setup in an “online” mode, meaning the GoldenGate parameter files and process groups will be configured and installed on the source and target GoldenGate servers. ODI communicates with the GoldenGate JAgent to perform the installation. The “offline” mode still exists as it did in the 11g version of the JKM, in which the parameter files, etc. are created in a temporary location, and then manually moved to the source and target. I’ll use the “online” JKM throughout this series of posts.

Another change to how Journalizing is implemented for GoldenGate in ODI is the Model to which the JKM is applied. In ODI 11g, the GoldenGate JKM was always applied to the Model containing the target tables, leaving the source table metadata completely out of the picture. This made sense, as GoldenGate handled everything on the source side. Now in ODI 12c, the source tables are reverse engineered and the JKM applied to the source Model. This allows the source table to be used in a single mapping for both the initial load and incremental load of the performance layer through the use of ODI 12c deployment specifications. The target, or fully replicated table, is no longer necessary in the metadata. We’ll talk through this concept in more detail later on.

Reference Architecture Update

Finally, before we get into the details, let’s go over the latest (yet to be released, I might add) version of the Oracle Information Management Reference Architecture. It was first presented by Stewart Bryson and Andrew Bond (Oracle) at the Rittman Mead BI Forum in Brighton (which is why I was given the OK to mention it pre-release!).

Information Management - Logical View

This latest reference architecture is not much different than previous versions. The main difference that pertains to this blog post is that the Staging Layer has been renamed the Raw Data Reservoir. If you look through the presentation by Stewart and Andrew, you’ll see that the many of the principles remain the same. They also describe more about an Agile approach to implementing the reference architecture, using GoldenGate, OBIEE against transactional schemas, and eventually ETL development using ODI, a methodology we here at Rittman Mead use with our clients, and call ExtremeBI. Look for the official release of the latest Information Management Reference Architecture in the next couple of months.

In this blog series, we’ll look at how to load the Raw Data Reservoir and Foundation Layer using GoldenGate, and subsequently load the Access and Performance Layer with Oracle Data Integrator Mappings. If you recall from my 11g posts on the subject, we don’t need to load these layers in sequence. GoldenGate will allow the extract of source data once and replication to multiple targets in parallel.

load-raw-data-fnd

Now that we’ve gone through some of the updated concepts for 12c and the Reference Architecture, let’s look at the high-level steps that must be taken in order to implement GoldenGate and ODI integration.

  • Install GoldenGate on the source and target servers – including JAgent configuration
  • Edit the “JKM Oracle to Oracle Consistent OGG (Online)” Knowledge Module (enable Foundation Layer load)
  • Setup ODI Topology (database schema and GoldenGate connections)
  • Setup and start Journalizing on the source Model
  • Develop Mappings for initial load and incremental load
  • Perform initial load and start replication

There is quite a lot of detail to add to these steps, so let’s get right into it.

GoldenGate 12c Installation and JAgent Configuration

The install of GoldenGate 12c is pretty straight-forward, so I don’t plan on going into much detail here. A step-by-step guide can be found on the DBASolved blog, with the installation setting up and starting the manager process and creating the necessary subdirectories. We then need to configure the JAgent on both the source and target GoldenGate installations, enabling ODI to communicate with GoldenGate during the “online” JKM start journalizing process, which will automatically configure and start the GoldenGate process groups. Setting up the JAgent for ODI integration is essentially the same as if you were setting up Oracle Enterprise Manager integration with GoldenGate.

First, you’ll notice that the file jagent.prm exists in the dirprm directory after installation completes. This parameter file will be used by the jagent process once started in GGSCI.

JAgent.prm

Next, we need to enable monitoring of GoldenGate by adding an entry, ENABLEMONITORING, to the GLOBALS file. Create the GLOBALS file (with no extension) in the GoldenGate home directory and open it in your favorite text editor. Simply add the line to enable monitoring, close and save the file.

Edit GLOBALS file

To allow secure communication between ODI and GoldenGate, we must create an Oracle Wallet with a password for JAgent. From the GoldenGate install directory, run the password agent. This will create the cwallet.sso file in the dirwlt subdirectory.

Create Oracle wallet

We’re almost there! Now we need to make a slight change to the Config.properties file in the cfg directory under the GoldenGate home. Edit this file and make the following changes:

Set the agent type to Oracle Enterprise Manager. If left as the default OGGMON, the JAgent will attempt to register with the Monitor server, which most likely is not installed.

agent.type.enabled=OEM

Under the JMX Username, add the line to signify that SSL is not being used by JAgent (unless SSL is actually being used!).

monitor.jmx.username=cmroot
jagent.ssl=false

Finally, ensure the JAgent port is unique across all installations, and on the server in general. In my example, I’m using a single Virtual Machine to host both the source and target GoldenGate installations, so I need to be careful about which ports are in use.

jagent.rmi.port=5571

Before starting the JAgent, go ahead and stop, then start the Manager process to ensure all changes have been initialized. Then, start the JAgent and check to ensure both Manager and JAgent are running. That completes the GoldenGate installation and JAgent configuration.

Start JAgent

If you want to just skip all of this installation work and get right to it, you can always download the Prebuilt Machine for Oracle Data Integrator 12c. It’s a VirtualBox VM with ODI 12c and GoldenGate 12c already installed and configured to work with the Getting Started Guide. The JAgent is already configured on the source and target GoldenGate installations, making it easy to get up and running. This is a great resource that the Oracle Data Integration product team has provided, and it sounds like they plan to continue adding to it in the future.

In Part 2 of the blog post series, we’ll edit the JKM to enable parallel load of the Raw Data Reservoir and Foundation Layer, as well as begin setup of the ODI Topology and metadata.

Categories: BI & Warehousing

Visual Regression Testing of OBIEE with PhantomCSS

Rittman Mead Consulting - Fri, 2014-05-23 09:20

Earlier this year I wrote a couple of blogs posts (here and here) discussing the topic of automated Regression Testing and OBIEE. One of the points that I was keen make was that OBIEE is a stack of elements and depending on the change being tested, it may be sensible to focus on certain elements in the stack instead of all of it. For example, if you are changing the RPD, there is little value in doing a web-based test when you can actually test for the vast majority of regressions using the nqcmd tool alone.

I also argued that testing the front end of OBIEE using tools such as Selenium is difficult to do comprehensively, it can be inflexible, time-consuming and in some cases just not a sensible use of effort. These tools work around the idea of parsing the web page that is served up and checking for presence (or absence) of a particular piece of text or an element on a web page. So for example, you could run a test and tell it to fail if it finds the text “Error” on the page, or you could say only pass the test if some known-content is present, such as a report title or data figure. This type of testing is prone to a great deal of false-negatives, because to efficiently build any kind of test case you must focus on something to check for in the page, but you cannot code for every possible error or failure. It is also usually based heavily on the internal IDs of elements on the page in locating the ‘something’ to check for. As the OBIEE Document Object Model (DOM) is undocumented code, Oracle are at presumably at liberty to change it whenever they feel like it, and thus any tests written based on it may fail. Finally, OBIEE 11g still defaults to serving up graphs as Flash objects, which Selenium et al just cannot handle, and so cannot be tested.

So, what do we do about regression testing the OBIEE front end?

What do we need to test in the front end?

There is still a strong case for regression testing the OBIEE front end. Analyses get changed, Dashboards break, permissions are updated – all these things can cause errors or problems for the end user, but which are something that testing further down the OBIEE stack (using something like nqcmd) will not cover.

Consider a simple dashboard:

If one of the dashboard pages that are linked to in the central section get moved in the Presentation Catalog, then this happens:

OK, so Invalid Link Path: is pretty easy to code in as an error check into Selenium. But, what about if the permissions on an analysis used in the dashboard get changed and the user can no longer access it when running the dashboard?

This is a different problem altogether. We need to check for the absence of something. There’s no error, there just isn’t the analysis that ought to be present. One way around this would be to code for the presence of the analysis title text or content – but that is not going to scale nor be maintainable to do for every dashboard being tested.

Another thing that is important to check in the front end is that authorisations are enforced as they should be. That is, a user can see the dashboards that they should be able to, and that they cannot see the ones they’re not. Changes made in the LDAP directory holding users and their groups, or a configuration change in the Application Roles, could easily mean that a user can no longer see the dashboards they should be able to. We could code for this specific issue using something like Web Services to programatically check each and every actual permission – but that could well be overkill.

What I would like to introduce here is the idea of testing OBIEE for regressions visually - but automated, of course.

Visual Regression Testing

Driven by the huge number of applications that are accessed solely on the web (sorry, “Cloud”), a new set of tools have been developed to support the idea of testing web pages for regressions visually. Instead of ‘explaining’ to the computer specifically what to look for in a page (no error text, etc), visual regression testing uses a process to compare images of a web page, comparing a baseline to a sample taken afterwards. This means that the number of false-negatives (missing genuine errors because the test didn’t detect them) drops drastically because instead of relying on coding a test program to parse the Document Object Model (DOM) of an OBIEE web page (which is extremely complex), instead it is simply considering if two snapshots of the resulting rendered page look the same.

The second real advantage of this method is that typically the tools (including the one I have been working with and will demonstrate below, PhantomCSS) are based on the actual engine that drives the web browsers in use by real end-users. So it’s not a case of parsing the HTML and CSS that the web server sends us and trying to determine if there’s a problem or not – it is actually rendering it the same as Chrome etc and taking a snapshot of it. PhantomCSS uses PhantomJS, which uses the engine that Safari is built on, WebKit.

Let’s Pretend…

So, we’ve got a tool – that I’ll demonstrate shortly – that can programatically fetch and snapshot OBIEE pages, and compare the snapshots to check for any changes. But what about graphs rendered in flash? These are a blindspot usually. Well here we can be a bit cheeky. If you pretend (in the User-Agent HTTP request header) to be an iPhone or iPad (devices that don’t support flash) then OBIEE obligingly serves up PNG graphs plus some javascript to do the hover tooltips. Because it’s a PNG image that means that it will be rendered correctly in our “browser”, and so included in the snapshot for comparison.

CasperJS

Let’s see this scripting in action. Some clarification of the programs we’re going to use first:

  • PhantomJS is the core functionality we’re using, a headless browser sporting Javascript (JS) APIs
  • CasperJS provides a set of APIs on top of PhantomJS that make working with web page forms, navigation etc much easier
  • PhantomCSS provides the regression testing bit, taking snapshots and running code to compare them and report differences.

We’ll consider a simple CasperJS example first, and come on to PhantomCSS after. Because PhantomCSS uses CasperJS for its core interactions, it makes sense to start with the basics.

Here is a bare-bones script. It loads the login page for OBIEE, echoes the page title to the console, takes a snapshot, and exits:

var casper = require('casper').create();

casper.start('http://rnm-ol6-2:9704/analytics', function() {
  this.echo(this.getTitle());
  this.capture('casper_screenshots/login.png');
});

casper.run();

I run it from the command line:

$ casperjs casper_example_01.js
Oracle Business Intelligence Sign In
$

As you can see, it outputs the title of the page, and then in the screenshots folder I have this:

I want to emphasise again to make clear why this is so useful: I ran this from the commandline only. I didn’t run a web browser, I didn’t take any snapshots by hand – it was all automatic.

Now, let’s build a bit of a bigger example, where we login to OBIEE and see what dashboards are available to us:

// Set the size of the browser window as part of the 
// Casper instantiation
var casper = require('casper').create({viewportSize: {
        width: 800,
        height: 600
    }});

// Load the login page
casper.start('http://rnm-ol6-2:9704/analytics', function() {
  this.echo(this.getTitle());
  this.capture('casper_screenshots/login.png');
});

// Do login
casper.then(function(){
  this.fill('form#logonForm', { NQUser: 'weblogic' ,
                                NQPassword: 'Password01'
                              }, true);
}).
waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?bieehome',function(){
  this.echo('Logged into OBIEE','INFO')
  this.capture('casper_screenshots/afterlogin.png');
  });

// Now "click" the Dashboards menu
casper.then(function() {
  this.echo('Clicking Dashboard menu','INFO')
  casper.click('#dashboard');
  this.waitUntilVisible('div.HeaderPopupWindow', function() {
    this.capture('casper_screenshots/dashboards.png');
  });
});

casper.run();

So I now get a screenshot of after logging in:

and after “clicking” the Dashboard menu:

The only bit of the script above that isn’t self-explanatory is where I am referencing elements. The references are as CSS3 selectors and are easily found using something like Chrome Developer Tools. Where the click on Dashboards is simulated, there is a waitUntilVisible function, which is crucial for making sure that the page has rendered fully. For a user clicking the menu, they’d obviously wait until it appears but computers work much faster so functions like this are important for reining them back.

To round off the CasperJS script, let’s add to the above navigating to a Dashboard, snapshotting it (with graphs!), and then logging out.

[...]
casper.then(function(){
  this.echo('Navigating to GCBC Dashboard','INFO')
  casper.clickLabel('GCBC Dashboard');
})

casper.waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?dashboard', function() {
  casper.waitWhileVisible('div.AjaxLoadingOpacity', function() {
    casper.waitWhileVisible('div.ProgressIndicatorDiv', function() {
      this.capture('casper_screenshots/dashboard.png');
    })
  })
});

casper.then(function() {
  this.echo('Signing out','INFO')
  casper.clickLabel('Sign Out');
});

Again, there’s a couple of waitWhileVisible functions in there, necessary to get CasperJS to wait until the dashboard has rendered properly. The dashboard rendered is captured thus:

PhantomCSS

So now let’s see how we can use the above CasperJS code in conjunction with PhantomCSS to generate a viable regression test scenario for OBIEE.

The script remains pretty much the same, except CasperJS’s capture gets replaced with a phantomcss.screenshot based on an element (html for the whole page), and there’s some extra code “footer” to include that executes the actual test.

So let’s see how the proposed test method holds up to the examples above – broken links and disappearing reports.

First, we run the baseline capture, the “known good”. The console output shows that this is the first time it’s been run, because there are no existing images against which to compare:

In the screenshots folder is the ‘baseline’ image for each of the defined snapshots:

Now let’s break something! First off I’ll rename the target page for one of the links in the central pane of the dashboard, which will cause the ‘Invalid Link Path’ message to display.

Now I run the same PhantomCSS test again, and this time it tells me there’s a problem:

When an image is found to differ, a composite of the two highlighting the differences is created:

OK, so first test passed (or rather, failed), but arguably this could have been picked up simply by parsing the page returned from the OBIEE server for known error strings. But what about a disappearing analysis – that’s more difficult to ascertain from the page source alone.

Again, PhantomCSS picks up the difference, and highlights it nice and clearly in the generated image:

For the baseline image that you capture it would be against a “gold” version of a dashboard – no point including ad-hoc reports or dashboards under development. You’d also want to work with data that was unchanging, so where available a time filter fixed at a point in the past, rather than ‘current day’ which will be changing frequently.

Belts and Braces?

So visual regression testing is a great thing, but I think a hybrid approach, of parsing the page contents for text too, is worthwhile. CasperJS provides its own test APIs (which PhantomCSS uses), and we can write simple tests such as the following:

this.test.assertTextDoesntExist('Invalid Link Path', 'Check for error text on page');
this.test.assertTextDoesntExist('View Display Error', 'Check for error text on page');
phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

So check for a couple of well-known errors, and then snapshot the page too for subsequent automatic comparison. If an assertion is failed, it shows in the console:

This means that what is already be being done in Selenium (or for which Selenium is an assumed default tool) could even be brought into the same single test rig based around CasperJS/PhantomCSS.

Frame of Reference

The eagle-eyed of you will have noticed that the snapshots generated by PhantomCSS above are not the entire OBIEE webpage, whereas the ones from CasperJS earlier in this article are. That is because PhantomCSS deliberately wants to focus on an area of the page to test, identified using a CSS3 selector. So if you are testing a dashboard, then considering the toolbar is irrelevant and can only lead to false-positives.

phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

Similarly, considering the available dashboard list (to validate enforced authorisations) just needs to look at the list itself, not the rest of the page.  (and yes, that does say “Protals” – even developers have fat fingers sometimes ;-) )

phantomcss.screenshot('div.HeaderSharedProtals','Dashboard list');

Using this functionality means that the generated snapshots used for comparison can be done to exclude things like the alerts bar (which may appear or disappear between tests).

The Devil’s in the Detail

I am in no doubt that the method described above has definitely got its place in the regression testing arsenal for OBIEE. What I am yet to be fully convinced of is quite to what extent. My beef with Selenium et al is the level of detail one has to get in to when writing tests – identifying strings to test for, their location in the DOM, and so on. Yet above in my CasperJS/PhantomCSS examples, I have DOM selectors too, so is this just the same problem? At the moment, I don’t think so. For Selenium, to build a comprehensive test, you have to dissect the DOM for every single test you want to build. Whereas with CasperJS/PhantomCSS I think there is the need to write a basic framework for OBIEE (the basics of which are provided in this post; you’re welcome), which can then be parameterised based on dashboard name and page only. Sure, additional types of tests may need new code, but it would be more reusable.

Given that OBIEE doesn’t come with an out of the box test rig, whatever we build to test it is going to be bespoke, whether its nqcmd, Selenium, JMeter, LoadRunner, OATS, QTP, etc etc — the smart money is picking the option that will be the most flexible, more scalable, easiest to maintain, and take the least effort to develop. There is no one “program to rule them all” – an accurate, comprehensive, and flexible test suite is invariably going to utilise multiple components focussing on different areas.

In the case of regression testing – what is the aim of the testing? What are you looking to validate hasn’t broken after what kind of change?  If all that’s changed in the system is the DBAs adding some indexes or partitioning to the data, I really would not be going anywhere near the front end of OBIEE. However, more complex changes affecting the Presentation Catalog and the RPD can be well covered by this technique in conjunction with nqcmd. Visual regression testing will give you a pass/fail, but then it’s up to you to decipher the images, whereas nqcmd will give you a pass/fail but also an actual set of data to show what has changed.

Don’t forget that other great tool — you! Or rather, you and your minions, who can sit at OBIEE for 5 minutes and spot certain regressions that would take magnitudes of order greater in time to build a test to locate. Things like testing for UI/UX changes between OBIEE versions is something that is realistically handled manually. The testing of the dashboards can be automated, but faster than I can even type the requirement, let alone build a test to validate it – does clicking on the save icon bring up the save box? Well go click for yourself – done? Next test.

Summary

I have just scratched the surface of what is possible with headless browser scripting for testing OBIEE. Being able to automate and capture the results of browser interactions as we’ve seen above is hugely powerful. You can find the CasperJS API reference here if you want to find out more about how it is possible to interact with the web page as a “user”.

I’ve put the complete PhantomCSS script online here. Let me know in the comments section or via twitter if you do try it out!

Thanks to Christian Berg and Gianni Ceresa for reading drafts of this article and providing valuable feedback. 

Categories: BI & Warehousing

Trickle-Feeding Log Data into the HBase NoSQL Database using Flume

Rittman Mead Consulting - Wed, 2014-05-21 16:02

The other day I posted an article on the blog around using Flume to transport Apache web log entries from our website into Hadoop, with the final destination for the entries being an HDFS file – with the HDFS file essentially mirroring the contents of the webserver log file. Once you’ve set this transport mechanism up, you could create a Hive table over the HDFS files, for example, or further transform the data using Pig, Spark or some other mechanism.

When you load data into HDFS files though, there are a couple of things you need to be aware of; HDFS is optimised for large, streaming reads of files stored in very large disk blocks, with the classic use-case being MapReduce transformations that crunch large sets of incoming data and hand-off the results to another process. What it’s not good at is random retrievals of single file records, something you’ll notice if you try and return a single row from a Hive table request. Moreover, HDFS files are write-once, no updates or overwrites, which is why Hive only supports SELECTS and not UPDATES or DELETES. Altogether, whilst HDFS is great for landing and then processing large chunks of data, if you’re looking for more granular, database-type storage on Hadoop, you’ll need to think of something else.

And within the context of Cloudera Hadoop, that other thing is HBase, a “NoSQL” database that’s also open-source and runs on the Hadoop framework. Whilst you can work with HBase in similar ways to how you work with relational databases – you can create columns, load data into it, insert and update data and so forth – HBase and NoSQL are in lots of ways the complete opposite of relational databases like Oracle Database, as they trade-off things we normally take for granted but that have performance and scalability impacts – ACID transactions, the ability to support complex table relationships, very rich query languages and application support – for extreme scalability and flexibility. If you’re scared of losing your data then HBase is one of the better NoSQL databases, with strong (rather than “eventual”) consistency, automatic shading and lots of high-availability features, but it’s not designed for running your payroll (yet).

One reason we might want to land data in HBase or another NoSQL database, rather than in regular HDFS files, is if we then want to do fast individual record lookups within the landed data. Another reason would be HBase’s support for complex record types, making it easy to store for example nested XML datasets, and its ability – like the Endeca Server – to hold completely different sets of “columns” for each row in the database, and even version those rows giving us almost a “multi-dimensional” database. Internally, HBase stores data as key-value pairs giving it the ability to hold completely different data in each database row, and under the covers HBase data is in turn stored in indexed “StoreFiles” within HDFS, giving it HDFS’s scalability and access to the Hadoop framework, but adding fast random access to individual records.

NewImage

Where HBase (and most NoSQL databases) get complicated though is that there’s no SQL*Developer or TOAD to create tables, and no SQL or PL/SQL to load and manipulate them – it’s all done through Java and custom code – this article by Lars George who gave the Hadoop Masterclass as last week’s BI Forum goes into a bit more detail, along with his HBase slides and his book, “HBase: The Definitive Guide”.

So let’s look at a simple example of loading Apache CombinedLogFormat log file entries into HBase, using Flume to transport and ingest the data from our webserver into Hadoop and put together again by Nelio Guimaraes from the RM team. We’ll start by defining the HBase table, which like regular relational tables has rows but which has the concept of column families and column qualifiers rather than just columns. In practice, a column family + qualifier name makes what we’d normally think of as a column, but crucially under the covers column within families are stored together on disk, like column-store relational databases, making them fast to query and randomly access. Like a spreadsheet or OLAP database each combination of row and column family/qualifier is called a “cell”, and moreover only populated cells are stored on disk, with the added bonus of cell entries being timestamped, giving us the ability to retrieve previous versions of cell entries, like the temporal query feature in Oracle Database 12c.

NewImage

For more details on how HBase stores data, and how HBase schemas are defined, the white paper “Introduction to HBase Schema Design” by Cloudera’s Amandeep Khurana is a good reference point and introduction. So let’s go into the HBase shell and create a table to contain our log data; we’ll define as containing three column families (“common”,”http” and “misc”), with the actual column qualifiers defined at the point we load data into the table – one of the key features of HBase, and most NoSQL databases, is that you can introduce new columns into a store at the point of loading, just by declaring them, with each row potentially containing its own unique selection of columns – which is where Endeca Server gets its ability to store “jagged datasets” with potentially different attribute sets held for groups of rows.

[root@cdh5-node1 ~]# hbase shell
14/05/21 06:00:07 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell

hbase(main):001:0> list
TABLE                                                                           
0 row(s) in 2.8030 seconds

=> []
hbase(main):002:0> create 'apache_access_log', 
hbase(main):003:0* {NAME => 'common'},
hbase(main):004:0* {NAME => 'http'},
hbase(main):005:0* {NAME => 'misc'}
0 row(s) in 0.5460 seconds

In this example, the way we’re going to populate the HBase table is to use Flume; like the Flume and HDFS example the other day, we’ll use a “sink”, in this case a HBase sink, to take the incoming Flume activity off the channel and load it into the HBase table. Flume actually has two HBase sinks; one called HBaseSink which writes synchronously (more straightforward but slower) and another called AysncHBaseSink which writes asynchronously, potentially with higher overall throughput than synchronous writes and with full consistency even if there’s a failure (based on replaying the channel data), but with a slightly more complex serialisation approach. We’ll use the asynchronous sink in this example, and assuming you’ve already got the source configuration file set-up (see the previous blog post on Flume and HDFS for an example), the target Flume conf file in our case looked like this:

## TARGET AGENT ##
## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent-hbase.conf -n collector

#http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = 0.0.0.0
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2 mc3

## Channels ##
## Source writes to 3 channels, one for each sink
collector.channels = mc1 mc2 mc3

#http://flume.apache.org/FlumeUserGuide.html#memory-channel

collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 1000

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 1000

collector.channels.mc3.type = memory
collector.channels.mc3.capacity = 1000

## Sinks ##
collector.sinks = LocalOut HadoopOut HbaseOut

## Write copy to Local Filesystem 
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume-ng
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /user/root/flume-channel/%{log_type}/%d%m%Y
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

###############################################################
# HBase sink config 
###############################################################
collector.sinks.HbaseOut.type = org.apache.flume.sink.hbase.AsyncHBaseSink
collector.sinks.HbaseOut.channel = mc3
collector.sinks.HbaseOut.table = apache_access_log
collector.sinks.HbaseOut.columnFamily = common
collector.sinks.HbaseOut.batchSize = 5000
collector.sinks.HbaseOut.serializer = com.hbase.log.util.AsyncHbaseLogEventSerializer
collector.sinks.HbaseOut.serializer.columns = common:rowKey,common:hostname,common:remotehost,common:remoteuser,common:eventtimestamp,http:requestmethod,http:requeststatus,http:responsebytes,misc:referrer,misc:agent

A few points to note:

  • The collector.sinks.HbaseOut.type setting determines the sink type we’ll use, in this case org.apache.flume.sink.hbase.AsyncHBaseSink
  • collector.sinks.HbaseOut.table sets the HBase table name we’ll load, “apache_access_log”
  • collector.sinks.HbaseOut.serializer.columns actually defines the column qualifiers, in this case mapping incoming serialised log file rows into a set of HBase column families and qualifiers
  • collector.sinks.HbaseOut.serializer is the most important bit – and tells HBase how to turn the incoming Flume data into HBase loads, through a Java program called the “serializer”.

And its this serializer, the Java program that does the actual loading of the HBase table, that’s the final piece of the jigsaw. There are standard templates to use when writing this piece of code, and in our case the serializer looked like this:

package com.hbase.log.util;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.hbase.async.AtomicIncrementRequest;
import org.hbase.async.PutRequest;
import org.apache.flume.conf.ComponentConfiguration;
import org.apache.flume.sink.hbase.SimpleHbaseEventSerializer.KeyType;
import org.apache.flume.sink.hbase.AsyncHbaseEventSerializer;

import com.google.common.base.Charsets;
/**
 * A serializer for the AsyncHBaseSink, which splits the event body into
 * multiple columns and inserts them into a row whose key is available in
 * the headers
 *
 * Originally from https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase
 */
public class AsyncHbaseLogEventSerializer implements AsyncHbaseEventSerializer 
{
    private byte[] table;
    private byte[] colFam;
    private Event currentEvent;
    private byte[][] columnNames;
    private final List<PutRequest> puts = new ArrayList<PutRequest>();
    private final List<AtomicIncrementRequest> incs = new ArrayList<AtomicIncrementRequest>();
    private byte[] currentRowKey;
    private final byte[] eventCountCol = "eventCount".getBytes();
    // private String delim;

    @Override
    public void initialize(byte[] table, byte[] cf) 
    {
        this.table = table;
        this.colFam = cf;
    }

    @Override
    public void setEvent(Event event) 
    {
        // Set the event and verify that the rowKey is not present
        this.currentEvent = event;
        String rowKeyStr = currentEvent.getHeaders().get("rowKey");
        //if (rowKeyStr == null) {
        //  throw new FlumeException("No row key found in headers!");
        //}
        //currentRowKey = rowKeyStr.getBytes();
    }

    public String[] logTokenize(String event)
    {

        String logEntryPattern = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\"";
        Pattern p = Pattern.compile(logEntryPattern);
        Matcher matcher = p.matcher(event);

        if (!matcher.matches()) 
        {
            System.err.println("Bad log entry (or problem with RE?):");
            System.err.println(event);
            return null;
        }

        String[] columns = new String[matcher.groupCount()+1];
        
        columns[0]= Long.toString(System.currentTimeMillis());
        
        for (int i = 1; i <= matcher.groupCount(); i++) 
        {
            columns[i] = matcher.group(i);
        }

        return columns;

    }

    @Override
    public List<PutRequest> getActions() 
    {
        // Split the event body and get the values for the columns
        String eventStr = new String(currentEvent.getBody());
        long unixTime = System.currentTimeMillis();
        //String[] cols = eventStr.split(",");
        //String[] cols = eventStr.split(regEx);
        //String[] cols = eventStr.split("\\s+");
        //String[] cols = eventStr.split("\\t");
        //String[] cols = eventStr.split(delim);
        String[] cols = logTokenize(eventStr);
        puts.clear();
        String[] columnFamilyName;
        byte[] bCol;
        byte[] bFam;
        for (int i = 0; i < cols.length; i++) 
        {
            //Generate a PutRequest for each column.
            columnFamilyName = new String(columnNames[i]).split(":");
            bFam = columnFamilyName[0].getBytes();
            bCol = columnFamilyName[1].getBytes();

            if (i == 0) 
            {
                currentRowKey = cols[i].getBytes();
            }
            //PutRequest req = new PutRequest(table, currentRowKey, colFam,
            //columnNames[i], cols[i].getBytes());
            PutRequest req = new PutRequest(table, currentRowKey, bFam,
            bCol, cols[i].getBytes());
            puts.add(req);
        }
        return puts;
    }

    @Override
    public List<AtomicIncrementRequest> getIncrements() 
    {
        incs.clear();
        //Increment the number of events received
        incs.add(new AtomicIncrementRequest(table, "totalEvents".getBytes(), colFam, eventCountCol));
        return incs;
    }

    @Override
    public void cleanUp() 
    {
        table = null;
        colFam = null;
        currentEvent = null;
        columnNames = null;
        currentRowKey = null;
    }

    @Override
    public void configure(Context context) 
    {
        //Get the column names from the configuration
        String cols = new String(context.getString("columns"));
        String[] names = cols.split(",");
        columnNames = new byte[names.length][];
        int i = 0;
        
        for(String name : names) 
        {
            columnNames[i++] = name.getBytes();
        }
        
        //delim = new String(context.getString("delimiter"));
    }

    @Override
    public void configure(ComponentConfiguration conf) {}
}

HBase, rather than supporting the regular SELECT and INSERTS we’re used to with Oracle, instead uses “get” and “put” methods to retrieve, and store, data – along with “delete” and “scan”. The regular synchronous HBase sync uses these methods directly, taking data off the Flume channel and inserting it into the HBase table (or indeed, updating existing rows based on the row key), whilst the asychnronous method uses a layer in-between the incoming data and the write, allowing data (or “events”) to continue streaming in even if all the downstream data hasn’t get been committed. It’s this code though that maps each incoming bit of data – in this case, a parsed log file – to column families and qualifiers in the HBase table, and you’d need to write new code like this, or amend the exiting one, if you wanted to load other HBase tables in your Hadoop cluster – a long way from the point-and-click ETL approach we get with ODI, but a lot more flexible too (if that’s what you want).

Then it’s a case of compiling the Java code, like this:

mkdir com; mkdir com/hbase; mkdir com/hbase/log; mkdir com/hbase/log/util
vi com/hbase/log/util/AsyncHbaseLogEventSerializer.java
export CLASSPATH=/usr/lib/flume-ng/lib/*
javac com/hbase/log/util/AsyncHbaseLogEventSerializer.java
jar cf LogEventUtil.jar com
jar tf LogEventUtil.jar com
chmod 775 LogEventUtil.jar
cp LogEventUtil.jar /usr/lib/flume-ng/lib

Next, we had to run the following command before enabling Flume with this setup, because of an issue we found with Zookeeper stopping Flume working in this setup:

mv /etc/zookeeper/conf/zoo.cfg /etc/zookeeper/conf/zoo.cfg-unused

and finally, we start up the Flume target server agent, followed by the source one (again see the previous article for setting up the source Flume agent):

flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent-hbase.conf -n collector

Then, after a while, log data starts getting loaded into the HBase table. You can check on it using Hue, and the HBase Browser:

NewImage

Or you can go back into the HBase shell and run the scan command to view the data, with each row representing a cell in the overall table storage:

hbase(main):001:0> scan 'apache_access_log'
ROW                   COLUMN+CELL                                               
 1400628560331        column=common:eventtimestamp, timestamp=1400628560350, val
                      ue=20/May/2014:15:28:06 +0000                             
 1400628560331        column=common:hostname, timestamp=1400628560335, value=89.
                      154.89.101                                                
 1400628560331        column=common:remotehost, timestamp=1400628560336, value=-
 1400628560331        column=common:remoteuser, timestamp=1400628560338, value=-
 1400628560331        column=common:rowKey, timestamp=1400628560333, value=14006
                      28560331                                                  
 1400628560331        column=http:requestmethod, timestamp=1400628560352, value=
                      GET / HTTP/1.1                                            
 1400628560331        column=http:requeststatus, timestamp=1400628560356, value=
                      200                                                       
 1400628560331        column=http:responsebytes, timestamp=1400628560358, value=
                      9054                                                      
 1400628560331        column=misc:agent, timestamp=1400628560377, value=Mozilla/
                      5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.
                      14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14     
 1400628560331        column=misc:referrer, timestamp=1400628560359, value=-    
 1400628560344        column=common:eventtimestamp, timestamp=1400628560383, val
                      ue=20/May/2014:15:28:06 +0000

This is all great, and a good starting point if you plan to process your data with other Java programs as the next step. But what if you want to view the data in a more convenient way, perhaps as a regular table? To do that you can use Hive again, this time using Hive’s HBase integration features to tell it the data is stored in HBase format, and to let it know how to display the various HBase column families and qualifiers. In our case, the DDL to create the corresponding Hive table looks like this:

DROP TABLE IF EXISTS hive_apache_access_log;
CREATE EXTERNAL TABLE hive_apache_access_log
(
unixtimestamp string,
eventtimestamp string,
hostname string,
remotehost string,
remoteuser string,
requestmethod string,
requeststatus string,
responsebytes string,
agent string,
referrer string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,common:eventtimestamp,common:hostname,common:remotehost,common:remoteuser,http:requestmethod,http:requeststatus,http:responsebytes,misc:agent,misc:referrer')
TBLPROPERTIES ('hbase.table.name' = 'apache_access_log');

giving us the ability, either from the Hive shell like this, or from tools like OBIEE and ODI, to query the NoSQL database and brings its data into more regular, relational data stores.

hive> select * from hive_apache_access_log;
OK
1400628560331   20/May/2014:15:28:06 +0000  89.154.89.101   -   -   GET / HTTP/1.1  200 9054    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14 -
1400628560344   20/May/2014:15:28:06 +0000  89.154.89.101   -   -   GET /wp-content/plugins/crayon-syntax-highlighter/css/min/crayon.min.css?ver=2.5.0 HTTP/1.1 304 -   Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14 http://www.rittmanmead.com/
1400628560345   20/May/2014:15:28:06 +0000  89.154.89.101   -   -   GET /wp-content/plugins/jetpack/modules/widgets/widgets.css?ver=20121003 HTTP/1.304 -   Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/537.75.14 http://www.rittmanmead.com/
...

We’ll be covering more on HBase, and Oracle’s NoSQL Database, in future articles on the blog.

Categories: BI & Warehousing

Photos and Presentation Downloads from the Rittman Mead BI Forum 2014

Rittman Mead Consulting - Mon, 2014-05-19 14:59

Well we’re back in Brighton, UK now after the second successful week of the Rittman Mead BI Forum 2014. In Week 1, we went to the Seattle Hotel in Brighton (read the recap here), and then straight-after we flew over to Atlanta, GA, to run the second week – and it was possibly even better than Brighton ;-)

Congratulations to Omri Traub, winner of the Atlanta Best Speaker award – the first person from Oracle to win, in fact – and to all of the other presenters who helped put together an excellent event. If you’re interested, I’ve uploaded a bunch of photos from both Brighton and Atlanta to Flickr, and you can view the BI Forum 2014 photo set here.

Biforumusa

As usual, where we’ve got permission (or the PDF) from the presenter, we’re making all of the presentations available for download to both attendees and non-attendees – not everyone can make it to the event, but we don’t want you to miss-out. We’re also very grateful to Lars George and Cloudera for making their Hadoop Masterclass slides available too – thanks everyone.

Other than that – thanks again to everyone who attended, and hopefully we’ll see you all again next year!

Categories: BI & Warehousing

List of our Recent “Getting Started” Hadoop Articles

Rittman Mead Consulting - Sun, 2014-05-18 04:54

We’ve published a number of “getting started with Hadoop” articles over the past few months, but these aren’t always easy to find on the blog. I’ve therefore compiled a list of the more recent ones, which you’ll find below:

Categories: BI & Warehousing

Trickle-Feeding Log Files to HDFS using Apache Flume

Rittman Mead Consulting - Sun, 2014-05-18 03:20

In some previous articles on the blog I’ve analysed Apache webserver log files sitting on a Hadoop cluster using Hive, Pig and most recently, Apache Spark. In all cases the log files have already been sitting on the Hadoop cluster, SFTP’d to my local workstation and then uploaded to HDFS, the Hadoop distributed filesystem, using Hue, and the only way to add to them is to repeat the process and manually copy them across from our webserver. But what if I want these log files to be copied-across automatically, in a kind of “trickle-feed” process similar to how Oracle GoldenGate trickle-feeds database transactions to a data warehouse? Enter Apache Flume, a component within Hadoop and the Cloudera CDH4/5 distribution of Hadoop, which does exactly this.

Flume is an Apache project within the overall Hadoop ecosystem that provides a reliable, distributed mechanism for collecting aggregating and moving large amounts of log data. Similar to GoldenGate it has transaction collectors, mechanisms to reliably transmit data from source to target, and mechanisms to write those log events to a centralised data store, for example HDFS. It’s free and comes with Cloudera CDH, and coupled with something at the target end to then process and work with the incoming log entries, is a pretty powerful and flexible way to transmit log-type entries from (potentially) multiple source providers to a central Hadoop cluster.

To take our example, we’ve got a webserver that’s generating our Apache CombinedLogFormat log entries as users generate activity on the website. We then set up Flume agents on the source webserver, and then the Hadoop client node that’s going to receive the log entries, which then writes the log entries to HDFS just like any other file activity. The Flume agent on the webserver source “tail”s the Apache access.log file copying across entries as they’re made (more or less), so that the target HDFS log file copies are kept up to date with individual log entries, not just whole log files as they’re closed off, with the diagram below showing the overall schematic:

Flume Topology

Down at the Flume component level, Flume consists of agents, Java processes that sit on the source, target and any intermediate servers; channels, intermediate staging points that can persist log entries to disk, database or memory; and sinks, processes that take log transactions out of a channel and write them to disk. Flume is designed to be distributed and resilient, and won’t take the source down if the target Hadoop environment isn’t available; if this type of situation occurs, transactions will slowly fill-up the channel used by the source agent until such time as it runs out of space, and then further log transactions are lost until the target comes back up and the source agent’s channel regains some spare space. The diagram below, from the Cloudera blog about the latest generation of Flume (Flume NG, for “Next Generation”) shows the Flume product topology:

NewImage

whilst the next diagram shows how Flume can collect and aggregate log entries from multiple servers, and then combine them into one log stream sent to a single target.

NewImage

In our example, that’s all there is to it; in more complex examples, perhaps where the source is sending XML log entries, you’d need a downstream processor on the target platform to decode, deserialise or parse the incoming log files – Flume is just a transport mechanism and doesn’t do any transformation itself. You can also choose how the log entries are held by each of the agents’ channels; in the example we’re going to use, channel data is just held in-memory which is fast to run and setup, but you’d lose all of your data in the process if the server went down. Other, more production-level processes would persist the channel entries to file, or even a mySQL database.

For our setup we need to two agents, one on the source and one on the target server, each of which has its own configuration file. The source agent configuration file looks like this, with key entries called-out underneath it:

## SOURCE AGENT ##
## Local instalation: /home/ec2-user/apache-flume
## configuration file location:  /home/ec2-user/apache-flume/conf
## bin file location: /home/ec2-user/apache-flume/bin
## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent

# http://flume.apache.org/FlumeUserGuide.html#exec-source
source_agent.sources = apache_server
source_agent.sources.apache_server.type = exec
source_agent.sources.apache_server.command = tail -f /etc/httpd/logs/access_log
source_agent.sources.apache_server.batchSize = 1
source_agent.sources.apache_server.channels = memoryChannel
source_agent.sources.apache_server.interceptors = itime ihost itype

# http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor
source_agent.sources.apache_server.interceptors.itime.type = timestamp

# http://flume.apache.org/FlumeUserGuide.html#host-interceptor
source_agent.sources.apache_server.interceptors.ihost.type = host
source_agent.sources.apache_server.interceptors.ihost.useIP = false
source_agent.sources.apache_server.interceptors.ihost.hostHeader = host

# http://flume.apache.org/FlumeUserGuide.html#static-interceptor
source_agent.sources.apache_server.interceptors.itype.type = static
source_agent.sources.apache_server.interceptors.itype.key = log_type
source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined

# http://flume.apache.org/FlumeUserGuide.html#memory-channel
source_agent.channels = memoryChannel
source_agent.channels.memoryChannel.type = memory
source_agent.channels.memoryChannel.capacity = 100

## Send to Flume Collector on Hadoop Node
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
source_agent.sinks = avro_sink
source_agent.sinks.avro_sink.type = avro
source_agent.sinks.avro_sink.channel = memoryChannel
source_agent.sinks.avro_sink.hostname = 81.155.163.172
source_agent.sinks.avro_sink.port = 4545

  • Source is set to “apache_server”, i.e. an Apache HTTP server
  • The capture mechanism is the Linux “tail” command
  • Log entries are held by the channel mechanism in-memory, rather than to file or database
  • Timestamp is used by the source collector to tell which entries are new
  • The agent then sends the log entries to a corresponding Flume agent on the Hadoop Cluster, in this case an IP address that corresponds to my network’s external IP address, with Flume network traffic then NATted by my router to cdh4-node1.rittmandev.com, the client node in my CDH4.6 Hadoop cluster running on VMWare.

The target server in my Hadoop cluster then has a corresponding configuration file set up, looking like this:

## TARGET AGENT ##
## configuration file location:  /etc/flume-ng/conf
## START Agent: flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector

#http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = 0.0.0.0
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2

#http://flume.apache.org/FlumeUserGuide.html#memory-channel

collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100

collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100

## Sinks ##
collector.sinks = LocalOut HadoopOut

## Write copy to Local Filesystem 
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /var/log/flume-ng
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

## Write to HDFS
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /user/root/flume-channel/%{log_type}/%y%m%d
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 0
collector.sinks.HadoopOut.hdfs.rollCount = 10000
collector.sinks.HadoopOut.hdfs.rollInterval = 600

Key entries in this log file are:

  • Apache AVRO is the file format we’re using to transmit the data, and Flume is working on port 4545
  • There’s two sink collector channels defined – “mc1” for writing file entries to the local server filesystem, and one to HDFS
  • The maximum number of events (log entries) Flume will store in the various channels (log entry persistence stores) is 100, meaning that if the target platform goes down and more than 100 log transactions back-up, then further ones will get lost until we can clear the channel down. Of course this limit can be increased, assuming there’s memory or disk spare.

I then SSH into the target Hadoop node and start the Flume agent, like this:

[root@cdh4-node1 ~]# flume-ng agent -c conf -f /etc/flume-ng/conf/flume-trg-agent.conf -n collector
Info: Including Hadoop libraries found via (/usr/bin/hadoop) for HDFS access
Info: Excluding /usr/lib/hadoop/lib/slf4j-api-1.6.1.jar from class path
...
14/05/18 18:15:29 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
14/05/18 18:15:29 INFO hdfs.BucketWriter: Creating /user/root/flume-channel/apache_access_combined/18052014/FlumeData.1400433329254.tmp

and then repeat the step for the source webserver, like this:

[ec2-user@ip-10-35-143-131 apache-flume]$ sudo bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent
Warning: JAVA_HOME is not set!
+ exec /usr/bin/java -Xmx20m -cp '/home/ec2-user/apache-flume/conf:/home/ec2-user/apache-flume/lib/*' -Djava.library.path= org.apache.flume.node.Application -f conf/flume-src-agent.conf -n source_agent

Finally, moving across to Hue I can see new log entries being written to the HDFS file system:

NewImage

So there you go – simple transport of webserver log entries from a remote server to my Hadoop cluster, via Apache Flume – thanks again to Nelio Guimaraes from the RM team for setting the example up.

Categories: BI & Warehousing

Mobile App Designer mis-configuration error

Rittman Mead Consulting - Wed, 2014-05-14 09:21

I’ve been doing some work recently with OBIEE’s new Mobile App Designer (MAD). It’s a great bit of software that I’m genuinely impressed with, but it’s got its little v1 quirks, and helpful error messages are not its forte. I hit a MADdening (sorry) problem with it that Google and My Oracle Support both drew blanks on, so I’m posting it here in case it helps out others with the same problem.

Setting up MAD is a bit of a fiddly process involving patching OBIEE (regardless of the base version you’re on – hopefully in the future it will get rolled into the patchsets) and performing other bits of setup detailed in the documentation. The problem that I hit manifested itself twofold:

  1. Publishing an App to the Apps Library worked fine, but updating an existing App threw an error in the browser:
    Failed to publish /~weblogic/GCBC Mobile - Phone.xma:oracle.xdo.webservice.exception.AccessDeniedException: PublicReportService::executeUpdateTemplateForReport Failure: user has no access to report[/Apps Library//GCBC Mobile - Phone.xma] due to [Ljava.lang.StackTraceElement;@4e6106df
  2. Trying to subscribe to any App threw a generic error in the browser: “Error occurred while accessing server. Please contect administrator.” with the corresponding bipublisher.log showing: 
    [2014-05-13T16:49:53.449+01:00] [bi_server1] [WARNING] [] [oracle.xdo] [tid: 24] [userId: <anonymous>] [ecid: 3f3d2d8955322f32:2f756afc:145f4d10b2f:-8000-0000000000003eea,0] [APP: bimad#11.1.1] User (weblogic) with session id: q2fq8fkh66f85ghamsq164u9qs98itvnk0c826i is looking for object in biee path: /shared/Apps Library//GCBC.xma/_mreport.xma[[
    Object Error [Context: 0, code: QM3V3HLV, message: Invalid path (/shared/Apps Library//GCBC.xma/_mreport.xma) -- ]
    ]]
    [2014-05-13T16:49:53.450+01:00] [bi_server1] [WARNING] [] [oracle.xdo] [tid: 24] [userId: <anonymous>] [ecid: 3f3d2d8955322f32:2f756afc:145f4d10b2f:-8000-0000000000003eea,0] [APP: bimad#11.1.1] oracle.xdo.XDOException: Target app not found in the repository :/Apps Library//GCBC.xma[[
        at oracle.xdo.online.AppStoreIO.doPost_subscribeApp(AppStoreIO.java:311)
        at oracle.xdo.online.AppStoreIO.doPost(AppStoreIO.java:120)
    [...]

One of my esteemed Rittman Mead colleagues, Francesco Tisiot, pointed out that the path referenced in the errors has a double slash in it. On checking my configuration, I had indeed fat-fingered one of the settings. APPS_LIBRARY_FOLDER_LOCAL is defined in the <DOMAIN_HOME>/config/bipublisher/repository/Admin/Configuration/xmlp-server-config.xml file, and mine looked like this:

<property name="APPS_LIBRARY_FOLDER_LOCAL" value="/Apps Library/"/>

All I needed to do was to remove the trailing slash after Library:

<property name="APPS_LIBRARY_FOLDER_LOCAL" value="/Apps Library"/>

After restarting the bimad application deployment all was good again with the MAD world and I could republish and subscribe to Apps happily.

 

Categories: BI & Warehousing