Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 15 hours 3 min ago

OBIEE How-To: A View Selector for your Dashboard

Wed, 2014-10-29 20:00

A common problem report developers face is user groups having different needs and preferences, and as a consequence these user groups want to see their data presented in different ways. Some users prefer to see a graph when others want a table is a classic example. So, how do we do this? It’s a no brainer… we use a view selector. View selectors give us a great amount of flexibility by allowing us to swap out one analysis view for another. You might even take it a step further and use a view selector to swap out an entire compound layout for another one, giving the user an entirely different set of views to look at. Truly powerful stuff, right?

But view selectors do have one limitation… they’re only available at the analysis level. What if you wanted this selector functionality at the dashboard level so that you could swap out an analysis from one subject area for one from different subject area? Or what if you wanted to be able to switch one dashboard prompt for another one? You’re out of luck, it’s just not possible…

Just kidding… of course it’s possible. As it turns out, it’s fairly straightforward to build your own dashboard level view selector using other objects already provided by OBIEE out-of-the-box.

Create a dashboard variable prompt to drive the content. We need a way for the users to select the view they want to see. View selectors have a built in dropdown prompt to accomplish this at the analysis level. To do this at the dashboard level we’re going to use a dashboard prompt.

So, the first step is to create a new dashboard prompt object and add a variable prompt. You can name the variable whatever you wish, for this example we’re just going to call it P_SECTION. You can set the User Input to whatever you want, but it’s important that only one option is selected at a time… multiple values should not be allowed. Let’s set the user input to “Choice List” and add some custom values.

What you name these custom values isn’t important but the labels should be descriptive enough so that the users understand the different options. Just keep in mind, the values you use here will need to exactly match the analysis we create in the next step. For this example, let’s use ‘Section1′, ‘Section2′, and ‘Section3′ to keep things simple.

JFB - View Selector - P_SECTION Prompt

 Create an analysis to drive the conditional logic. We need to create an analysis that will return a set number of rows for each of the options in the prompt we just created. The number of rows returned then drives which section we see on the dashboard.

Ultimately, the logic of this analysis doesn’t matter, and there are a dozen ways to accomplish this. To keep things simple, we’re just going to use CASE statements. While not an elegant solution, it’ll work just fine for our basic example.

Add three columns to the criteria, we’ll use a Time dimension and modify the column formula with the following CASE statements. Make sure that the text strings match the Custom Values used in the prompt.

CASE WHEN "Time"."T05 Per Name Year" IN ('2006') THEN 'Section1' END

CASE WHEN "Time"."T05 Per Name Year" IN ('2006', '2007') THEN 'Section2' END

CASE WHEN "Time"."T05 Per Name Year" IN ('2006', '2007', '2008') THEN 'Section3' END

JFB - View Selector - Table

Now we need to update the filter so that the appropriate rows are shown based upon what the user selects. Basically, we need the request to return 1, 2, or 3 rows based upon our P_SECTION presentation variable.

For our example we’re going to create a filter for each of the options and set them equal to the presentation variable we created earlier in our dashboard prompt. Only one filter will be true at a time so the operator between these filters has been set to OR. Also you’ll notice that the default value for the presentation variable has been set to ‘Section1′, across the board. If, for whatever reason, the P_SECTION variable isn’t set we want the dashboard to default to the first section.

JFB - View Selector - Filter

CASE WHEN "Time"."T05 Per Name Year" IN ('2006') THEN 'Section1' END is equal to / is in @{P_SECTION}{Section1}
OR CASE WHEN "Time"."T05 Per Name Year" IN ('2006', '2007') THEN 'Section2' END is equal to / is in @{P_SECTION}{Section1}
OR CASE WHEN "Time"."T05 Per Name Year" IN ('2006', '2007', '2008') THEN 'Section3' END is equal to / is in @{P_SECTION}{Section1}

So, let’s quickly walk through how this works. The end user selects ’Section1’ from the dashboard prompt. That selection is stored in our P_SECTION presentation variable, which is then passed to and used by our filter. With ‘Section1’ selected only the 1st line of the filter will hold true which will result in a single row returned. When ‘Section2’ is chosen, the second row of the filter is true which returns two rows, and so on.

We’re almost done, in the next step we’ll create some conditions on the individual dashboard sections and put it all together.

Create sections and set some conditions. We just need to create our sections and set some conditions so that they are shown/hidden appropriately. Create a new dashboard page. Edit the dashboard page and drag three empty sections on to the page. Create a condition on the first section using the Analysis created in the last step. The first condition we need to create should be True If Row Count is equal to 1.

JFB - View Selector - Condition

Are you beginning to see how this is going to work? The only time we’ll get a single row back is when the presentation variable is set to ‘Section1’. When P_SECTION is set to ‘Section2’ we’ll get two rows back from our analysis. Go ahead and create a second condition that is True If Row Count is equal to 2 for section 2. For section 3 create a condition that’s True If Row Count is equal to 3.

JFB - View Selector - Dashboard Editor

Since we aren’t adding content to these sections, you’ll want to make sure to enable the option to “Show Section Title” or add a couple text fields so that you can easily identify which section is rendered on the page. Lastly, drag the dashboard prompt onto the page. Save the dashboard page and let’s take a look.

When the page first renders, you should see something similar to the following screenshot. The prompt is set to ‘Section1’ and sure enough, Section 1 appears below it. If you change the selection to ‘Section2’ or ‘Section3’ and hit apply, Section 1 will be hidden and the corresponding content will appear. All that’s left now would be to go back and add content to the sections.

JFB - View Selector - Result

So, using only out-of-the-box features, we were able to create an extremely versatile and dynamic bit of functionality… and all it took was a dashboard prompt, an analysis to hold our conditional logic, and some sections and conditions.

This approach is just another tool that you can use to help deliver the dynamic content your users are looking for. It provides flexibility within the context of a single dashboard page and also limits the need to navigate (and maintain) multiple pages. Admittedly, the example was just walked through isn’t all that exciting, but hopefully you can see the potential.

Some of your users want a minimalist view allowing them to filter on just the basics, while others want to slice and dice by everything under the sun? Create two prompts, a basic and an advanced, and allow the users to switch between the two.

JFB - View Selector - BasicAdv

Want to pack a large amount of charts into a page while still minimizing scrolling for those poor souls working with 1024×768? No problem, have a low-res option of the dashboard.

JFB - View Selector - LowRes

 The finance department wants a to see a dashboard full of bar charts, but the payroll department is being totally unreasonable and only wants to see line graphs? Well, you get the idea…

Categories: BI & Warehousing

Monitoring OBIEE with the ELK stack

Tue, 2014-10-21 09:37

Monitoring the health of an OBIEE system and diagnosing problems that may occur is a vital task for the system’s administrator and support staff. It’s one that at Rittman Mead we help customers with implementing themselves, and also provide as a managed service. In this article I am going to discuss the ELK stack, which fills a specific gap between the high-level monitoring and configuration functionality of Enterprise Manager 11g Fusion Middleware Control, and the Enterprise-grade monitoring, alerting and configuration management of Enterprise Manager 12c Cloud Control.

The ELK stack enables you to rapidly access both summary and detail information across the stack, supporting swift identification and diagnosis of any issues that may occur. The responsive interface lets you to drill into time periods or any ad-hoc field or filter as you wish, to analyse and diagnose problems. Data can be summarised and grouped arbitrarily, displaying relative error rates ensuring that genuine problems are not lost in the ‘noise’ of usual operation.

Out of the box, OBIEE ships with Enterprise Manager 11g Fusion Middleware Control (FMC), which as the name says is part of the Enterprise Manager line of tools from Oracle for managing systems. It is more of a configuration and deployment tool than it is really a monitoring and diagnostics one. The next step up is FMC’s (very) big brother, Enterprise Manager 12c Cloud Control (EM12c). This is very much its own product, requiring its own infrastructure and geared up to monitoring an organisation’s entire fleet of [Oracle] hardware and software. With this greatly enhanced functionally with EM12c also comes a license cost. The ELK stack conceptually fits perfectly alongside your existing EM FMC, providing a most excellent OBIEE monitoring dashboard and analysis tool, and allowing you to explore the kind of diagnostics and historical data that you could have access to in EM 12c.

In ELK we can see at a glance what kind of relative activity there has been on the system over the past few days:

There have been some errors, and the top three nQS and ORA error codes and messages are shown. This is an important differentiator to EM where you can search for errors, but cannot see straightaway if it is a one-off or multiple occurence. By grouping by error message it’s possible to quickly see what the biggest problem on a system may currently be:

At this point we might want to drill down into what was being run when the errors were being thrown. For example, from the error summary alone we can see the biggest problem was a locked database account – but which database was being queried? Lower down the dashboard page is a list of log details, and by clicking on the search icon against an error message we can filter the results shown:

We can use the search icon again to restrict results by ECID

And from there see all the related log entries, including which connection pool the request was against (and thus which database account is locked)

Another way of diagnosing a sudden rash of errors would be to instead drilldown on time alone to take a more holistic view at the logs (useful also given that ECIDs don’t always give the full picture). Using the system activity timeline along with the events log view it is a piece of cake to do this – simply click and drag a time window on the chart to instantly zoom into it.

Taking a step back up, we can see at a glance which areas of the OBIEE metadata model (RPD) are being used, as well as where we are pulling logs from – and all of these are clickable in order to filter the results further. So it’s easy to see, for a given subject area, what’s the current error rate? Or to quickly access all the log files for a specific set of components alone (for example, BI Server and OPMN). Any field that is displayed, whether in a chart or a detailed log view, can be clicked and used as the input for an ad-hoc filter.

It’s not just errors and logs we can monitor – the current and trending performance of the system (or a part of it; note the filtering by subject area and database described above) can be observed and of course, drilled into:

The ELK stack

The ELK stack is a suite of free software made up of three tools, the first letter of each giving it its name:

  • ElasticSearch
  • Logstash
  • Kibana

At a very high level, we collect and enrich diagnostic data from log files using logstash, store it in ElasticSearch, and present and analyse it through Kibana.

  • ElasticSearch is a document store, in which data with no predefined structure can be stored. Its origins and core strength are in full text search of any of the data held within it, and it is this that differentiates it from pure document stores such as MongoDB that Mark Rittman wrote about recently. Data is loaded and retrieved from ElasticSearch through messages sent over the HTTP protocol, and one of the applications that can send data this way and works extremely well is Logstash.
  • Logstash is an innocuous looking tool that at first glance one could mistakenly write off as “just” a log parser. It does a lot more than that and a healthy ecosystem of input, filter, codec and output plugins means that it can interface between a great variety of applications, shifting data from one to another and optionally processing and enriching it along the way.
  • The final piece of the stack is Kibana, a web application that enables one to build very flexible and interactive time-based dashboards, sourcing data from ElasticSearch. Interestingly, another of my favourite tools that I have written about before – and will write about again in this article – is Grafana which is forked from Kibana (and modified to source its data from time-series databases like graphite/carbon/whisper or InfluxDB) – thus if you’re at home with one you will be the other.

In this article I’m going to show how to set up your own ELK stack to monitor OBIEE, based on SampleApp v406.

Who is this for?

As you will see below, setting up and configuring the ELK stack does involve rolling up ones sleeves and diving right in. If you’re looking for an off-the-shelf monitoring solution then you should look elsewhere (such as EM12c). But if you want to have a crack at it I think you’ll be pleasantly surprised at what is possible once you get past the initially (bumpy) learning curve. The capabilities are great, and there’s an active support community as is the case with lots of open-source tools. With a bit of work it is possible to create a monitoring environment tailored pretty much entirely to your design.

Installing the stack

ELK runs on all common linux distributions (including Oracle Linux), as well as Mac OS. The only prerequisite is a JDK for ElasticSearch and Logstash, and web server for Kibana; here I am using Apache.

First up, let’s install JDK 1.7 (SampleApp has 1.6, which isn’t enough):

sudo yum install -y java-1.7.0-openjdk.x86_64

Apache is already installed on SampleApp, which we can verify thus:

[oracle@demo ~]$ sudo yum install -y httpd
Loaded plugins: refresh-packagekit
Setting up Install Process
Package httpd-2.2.15-29.0.1.el6_4.x86_64 already installed and latest version
Nothing to do
[oracle@demo ~]$ sudo service httpd status
httpd is stopped

It’s shutdown by default and that’s fine because we need to update the configuration on it anyway.

ElasticSearch

The easiest way to install ElasticSearch is using the yum repository:

sudo rpm --import http://packages.elasticsearch.org/GPG-KEY-elasticsearch

cat > /tmp/elasticsearch.repo<<EOF
[elasticsearch-1.3]
name=Elasticsearch repository for 1.3.x packages
baseurl=http://packages.elasticsearch.org/elasticsearch/1.3/centos
gpgcheck=1
gpgkey=http://packages.elasticsearch.org/GPG-KEY-elasticsearch
enabled=1
EOF

sudo mv /tmp/elasticsearch.repo /etc/yum.repos.d/
sudo yum install -y elasticsearch

I’d then set it to start at boot automagically:

sudo chkconfig elasticsearch on

and then start it up:

sudo service elasticsearch start

One final, optional, step in the installation is a plugin called kopf which gives a nice web dashboard for looking at the status of ElasticSearch:

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/
cd /usr/share/elasticsearch/bin
sudo ./plugin -install lmenezes/elasticsearch-kopf

logstash

There’s no repository for logstash, but it’s no biggie because there’s no install as such, just a download and unpack. Grab the download archive for logstash from the ELK download page, and then unpack it:

cd ~/Downloads
wget https://download.elasticsearch.org/logstash/logstash/logstash-1.4.2.tar.gz
# Need to use sudo because /opt is owned by root
sudo tar -xf logstash-1.4.2.tar.gz --directory /opt/
sudo mv /opt/logstash-1.4.2/ /opt/logstash/
sudo chown -R oracle. /opt/logstash/

Kibana

As with logstash, Kibana just needs downloading and unpacking. There’s also a wee bit of configuration to do, so that the web server (Apache, in our case) knows to talk to it, and so that Kibana knows how to find ElasticSearch.

cd ~/Downloads/
wget https://download.elasticsearch.org/kibana/kibana/kibana-3.1.0.tar.gz
sudo tar -xf kibana-3.1.0.tar.gz --directory /opt
sudo mv /opt/kibana-3.1.0/ /opt/kibana/
sudo chown -R oracle. /opt/kibana/

Now to configure Apache, telling it where to find Kibana. If you have existing sites configured, you’ll need to sort this bit out yourself, but on a vanilla SampleApp v406 you can use the following sed command to set up the needful:

sudo sed -i'.bak' -e 's/DocumentRoot.*$/DocumentRoot "\/opt\/kibana\/"/g' /etc/httpd/conf/httpd.conf

Lastly, Kibana needs to know where to find ElasticSearch, which is where it is going to pull its data from. An important point here is that the URL of ElasticSearch must be resolvable and accessing from the web browser you run Kibana on, so if you are using a DNS name it must resolve etc. You can update the configuratinon file config.js by hand (it’s the elasticsearch: definition that needs updating), or use this sed command:

sed -i'.bak' -e 's/^\s*elasticsearch:.*$/elasticsearch: "http:\/\/demo.us.oracle.com:9200",/g' /opt/kibana/config.js

Finally, [re]start Apache so that it uses the new configuration:

sudo service httpd restart

You should be able to now point your web browser at the server and see the default Kibana dashboard. So for sampleapp, if you’re running Firefox locally on it, the URL would simply be http://localhost/ (port 80, so no need to specify it in the URL). Note that if you’re doing anything funky with network, your local web browser needs to be able to hit both Apache (port 80 by default), and ElasticSearch (port 9200 by default).

Configuring ELK end-to-end

Now that we’ve got the software installed, let’s see how it hangs together and create our first end-to-end example. There’s a good logstash tutorial here that covers a lot of the functionality. Here, I’ll just look at some of the very basics, creating a very simple logstash configuration which will prompt for input (i.e. stdin) and send it straight to ElasticSearch. The kopf plugin that we installed above can show that the data made it to ElasticSeach, and finally we will create a very simple Kibana dashboard to demonstrate its use.

Logstash works by reading a configuration file and then running continually waiting for the configured input. As well as the input we configure an output, and optionally in between we can have a set of filters. For now we will keep it simple with just an input and output. Create the following file in /opt/logstash and call it logstash-basic.conf:

input {
        stdin {}
        }
output {
        elasticsearch {}
        }

It’s pretty obvious what it’s saying – for the input, use stdin, and send it as output to elasticsearch (which will default to the localhost).
Run this with logstash:

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/
cd /opt/logstash
bin/logstash -f logstash-basic.conf

After a few moments you should get a prompt. Enter some text, and nothing happens… apparently. What’s actually happened is that the text, plus some information such as the current timestamp, has been sent to ElasticSearch.

Let’s see where it went. In a web browser, got to http://localhost:9200/_plugin/kopf/. 9200 is the port on which ElasticSearch listens by default, and kopf is a plugin we can use to inspect ElasticSearch’s state and data. ElasticSeach has a concept of an index, in which documents, maybe of the same repeating structure but not necessarily, can be stored. Crudely put, this can be seen as roughly analogous to tables and rows of data respectively. When logstash sends data to ElasticSearch it creates one index per day, and in kopf now you should see an index with the current date, with a “document” for each line you entered after running logstash:

ElasticSearch can be queried using HTTP requests, and kopf gives a nice way to construct these and see the results which are in JSON format. Click on the rest (as in, REST API) menu option, leave the request as default http://localhost:9200/_search, and click send. You’ll see in the response pane a chunk of JSON in amongst which are the strings that you’ve entered to logstash:

Enter a few more lines into the logstash prompt, and then head over to http://localhost/ where you should find the default dashboard, and click on the Logstash Dashboard option:

It’s fairly bare, because there’s very little data. Notice how you have a histogram of event rates over the past day at the top, and then details of each event at the bottom. There are two things to explore here. First up, go and enter a bit more data into logstash, so that the create events have been spread out over time. Click the refresh icon on the Kibana dashboard, and then click-drag to select just the period on the chart that has data. This will zoom in on it and you’ll see in greater definition when the events were created. Go and click on one of the event messages in the lower pane and see how it expands, showing the value of each field – including message which is what logstash sent through from its input to output.

Now let’s get some proper data in, by pointing logstash at the BI Server log (nqsserver.log). Create a new configuration file, logstash-obi.conf, and build it up as follows. First we’ll use the file input to get data from …wait for it….a file! The syntax is fairly obvious:

input {
        file {
                path => "/app/oracle/biee/instances/instance1/diagnostics/logs/OracleBIServerComponent/coreapplication_obis1/nqserver.log"
                }
        }

Now we need to tell Logstash how to interpret the file. By default it’ll chuck every line of the log to ElasticSearch, with the current timestamp – rather than the timestamp of the actual event.

Now is time to introduce the wonderful world of the grok. A grok is one of the most important of the numerous filter plugins that are available in logstash. It defines expected patterns of content in the input, and maps it to fields in the output. So everything in a log message, such as the timestamp, user, ecid, and so on – all can be extracted from the input and stored as distinct items. They can also be used for further processing – such as amending the timestamp output from the logstash event to that of the log file line, rather than the system time at which it was processed.

So, let us see how to extract the timestamp from the log line. An important part of grok’ing is patterns. Grok statement are written as Regular expressions, or regex (obXKCD), so to avoid continual wheel-reinventing of regex statements for common objects (time, ip addresses, etc) logstash ships with a bunch of these predefined, and you can build your own too. Taking a line from nqsserver.log we can see the timestamp matches the ISO 8601 standard:

So our grok will use the pre-defined pattern TIMESTAMP_ISO8601, and then everything else (“GREEDYDATA”) after the timestamp, map to the log message field. The timestamp is in square brackets, which I’ve escaped with the backslash character. To indicate that it’s a grok pattern we want to match, it’s enclosed in %{ } markers.

\[%{TIMESTAMP_ISO8601:timestamp}\] %{GREEDYDATA:log_message}

This can be broken down as follows:

\[                                  The opening square bracket, escaped by \
%{TIMESTAMP_ISO8601:timestamp}      Capture an ISO 8601 timestamp, store it in a field called 'timestamp'
\]                                  The closing square bracket, escaped by \
%{GREEDYDATA:log_message}           Capture everything else ('GREEDYDATA' is also a grok pattern) and store it in the 'log_message' field

A grok operator in logstash is part of the filter processing, so we need a new stanza in the configuration file, after input and before output. Note that the grok operator is matching our pattern we built above against the message field, which is pre-populated by default by the input stream. You can grok against any field though.

input {
        file {
                path => "/app/oracle/biee/instances/instance1/diagnostics/logs/OracleBIServerComponent/coreapplication_obis1/nqserver.log"
                }
        }
filter {
        grok {
                match => ["message","\[%{TIMESTAMP_ISO8601:timestamp}\] %{GREEDYDATA:log_message}"]
            }
        }
output {
        elasticsearch {}
        }

Now we can see in the resulting capture we’ve extracted the timestamp to a field called “timestamp”, with the remainder of the field in “log_message”

But – the actual timestamp of the log entry that we have attached to the event stored in ElasticSearch, a special field called @timestamp is still reflecting the timestamp at which logstash read the logfile entry (30th September), rather than when the logfile entry was created (11th June). To fix this, we use a new filter option (grok being the first), the date filter:

date {
        match => ["timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSZZ"]
}

This matches the timestamp field that we captured with the grok, and converts it into the special @timestamp field of the event, using the mask specified. Now if we go back to kopf, the ElasticSearch admin tool, we can see that a new index has been created, with the date of log entry that we just parsed and correctly extracted the actual date from:

And over in Kibana if you set the timeframe long enough in the filter at the top you’ll be able to find the log entries showing up, now with the correct timestamp. Note that ElasticSearch accounts for the timezone of the message, storing it in UTC:

A large part of setting up your own ELK stack is this configuration of the filters in order to imbue and extract as much information from the log as you want. Grok filters are just the beginning – you can use conditional paragraphs to grok certain fields or logs for certain strings (think, error messages and ORA codes), you can mutate output to add in your own fields and tags based on what has been read. The point of doing this is that you enrich what is in the logs by giving the different pieces of data meaning that you can then drive the Kibana dashboard and filtering through.

Tips for using logstash

To think of the logstash .conf file as merely “configuration” in the same way a simple .ini is would be to underestimate it. In effect you are writing a bit of data transformation code, so brace yourself – but the silver lining is that whatever you want to do, you probably can. Logstash is a most flexible and powerful piece of software, and one in which the model of input, filter, codec and output work very well.

In the spirit of code development, here are some tips you may find useful:

  • Grok patterns can inherit. Study the provided patterns (in the logstash patterns folder), and build your own. Look for commonality across files and look to reuse as much as possible. If you have multiple unrelated patterns for every type of log file in FMW then you’ve done it wrong.
  • When you’re building grok statements and patterns, use the superb http://grokdebug.herokuapp.com. The advantage of this over a standard regex debugger is that it supports the grok syntax of capture groups and even custom patterns.
  • For building up and testing regex from scratch, there are some useful sites:

    On the Mac there is a useful tool called Oyster which does the same as the above sites but doesn’t require an Internet connection.

  • By default a failed grok will add the tag _grokparsefailure to the event. If you have multiple grok clauses, give each its own unique tag_on_failure value so that you can see from the output message which grok statement failed.

    grok {
        match => [  "message", "%{WLSLOG}"
        ]
        tag_on_failure => ["_grokparsefailure","grok03"]
    }

  • You can use # as a comment character, so comment your code liberally, particularly whilst you’re finding your way with the configuration language so you can find your way back from the gingerbread house.
  • grok’ing takes resources, so target your grok statements by using conditionals. For example, to only parse nqquery logs for an expected string, you could write :

    if [path] =~ /nqquery/ {
        # Do your grok here
        }

    NB the =~ denotes a regex match, and the / delineate the regex string.

  • Use a conditional to split the output, writing and grok failures to stdout and/or a file, so that it is easier to see what’s failing and when. Note the use of the rubydebug codec here to prettify output sent to stdout.

    output {
        if "_grokparsefailure" not in [tags] {
            elasticsearch { }
        } else {
            file { path => "unprocessed.out" }
            stdout { codec => rubydebug }
        }
    }

  • Use the multiline codec to work with log messages that span multiple lines, and watch out for newline characters – grok does not like them so use a mutate gsub to fix them out.
Building a Kibana dashboard

Now that we’ve got the data streaming through nicely from logstash into ElasticSearch, let us take a look at building dashboards against it in Kibana. Kibana is a web application that runs within an existing web server such as Apache, and it builds dashboards from data stored in ElasticSearch.

The building blocks of a Kibana dashboard are rows, which contain panels of a given pane width, up to twelve per row. On a panel goes one of the types of object, such as a graph. We’ll see now how to build up a dashboard and the interactions that we can use for displaying and analysing data.

When you first load Kibana you get a default dashboard that links to a few other start dashboards. Here we’re going to properly start from scratch so as to build up a picture of how a dashboard is created. Click on Blank Dashboard, and then in the top-right corner click on Configure Dashboard. Click on Index and from the Timestamping option select day. What this does is tell Kibana which ElasticSearch indices it is to pull data from, in this case using the standard Logstash index naming pattern – which we observed in kopf earlier – logstash-YYYY.MM.DD. Click Save, and then select Add a row. Give the row a title such as System Activity, click on Create Row and then Save. A new row appears on the dashboard.

Now we’ll add a graph to the row. Click on Add panel to empty row, and select Histogram as the Panel Type. Note that by default the width is 4 – change this to 12. You’ll note that there are plenty of options to explore, but to start with we’ll just keep it simple, so go ahead and click Save. By default the chart will show all data, so use the Time filter dropdown option at the top of the screen to select a recent time period. Assuming your data has loaded from logstash into ElasticSearch you should see a graph similar to this:

This is a graph of the number of events (log file entries) per time period. The graph will amend the resolution according to the zoom so that a reasonable resolution of data is shown, or you can force it through the Resolution option in the graph properties. In the legend of the chart you can see the resolution currently used.

Assuming you’ve selecting a broad time interval, such as the last week, you’ll presumably want to drill into the data shown. This is very intuitive in Kibana – simply click and drag horizontally over the time period you want to examine.

There are two important concepts for selecting and grouping data in Kibana, called filters and queries. A query groups data based on conditions, and we’ll explore those later. Filtering is a predicate applied to all data returned. Think of it just like a WHERE clause on a SQL query. You can see the current filter(s) applied at the top of the dashboard.

You may well want to hide these, and they can be collapsed – as can the query row and all of the dashboard rows – by clicking on the little triangle

From the Filter area you can also add, amend, disable and remove filters.

So far all we’ve got is a graph showing system activity over time, based on events recorded in a log file. But, we’ve no way of seeing what those logs are, and this is where the Table panel comes in. Add a new row, give it a title of Log messages and add the Table panel to it specifying the span as 12. You should now see a list of messages with timestamps corresponding to the time period shown in the graph. You can customise the Table panel, for example specifying which fields to show; by default it shows _source which is the raw row returned by ElasticSearch. More useful to us is the log_message field that we parsed out using the grok in logstash earlier. You can do this by selecting the relevant field from the Fields list on the left (which can be collapsed for convenience), or editing the Table panel and specifying it in the Columns area.

From the Table panel it is possible to select the data shown even more precisely by adding additional filters based on data in the table. Clicking on a particular row will expand it and show all of the associated fields, and each field has a set of Action options. You can filter only for that value, or specifically excluding it, and you can also add each field into the table shown (just like we did above for logmessage). So here I can opt to only display messages that I’ve tagged in logstash as coming from the BI Server component itself:

You’ll note in the second screenshow, once the filter has been applied, that the graph has changed and is showing less data. That is because a filter is global to a dashboard. But what if we want to show on the graph counts for all logs, but in the data table just those for BI Server? Here is where queries come into play. A query also looks like a predicate, but rather than restricting the data returned it just identifies a set of data within what is returned. To illustrate this I’m first going to remove all existing filters except the time period one:

And now in the Query area click on the + (to the right of the line). Now there are two queries, both with a wildcard as their value meaning they’ll each match everything. In the second query box I add the query Component: OracleBIServerComponent – note for this to work your logstash must be sending messages to ElasticSearch with the necessary Component field. Once updated, the second query’s impact can be seen in the graph, which is showing both the “all” query and the BI Server component tows. Use the View > option in the top left of the graph as a quick way of getting to the graph settings, including disabling cumulative/stack view:

Each panel in Kibana can be configured to show all or some of the query groups that have been defined. This is most useful for creating breakdowns of data, including those that are splitting it in different ways and you wouldn’t want all of the options displayed in entirety on all panels. You might want to group out the components, and the types of error, and then show a break down of system activity by one or the other – but not necessarily both. To configure which query a panel is to show use the Configure option in the top-right of a panel and go to the Queries tab. If it’s set to all then each and every query set will be shown individually on the panel.

If it’s selected then you can select one or more of the defined query sets to display

There are about a dozen types of panel in Kibana, and I’m not going to cover them all here. The other ones particularly of interest for building this kind of OBIEE monitoring dashboard include:

  • Terms is basically a SELECT FIELD, COUNT(*) ... GROUP BY FIELD. It shows the top x number of terms for a given field, and how frequently they occurred. Results can be as a pie or bar chart, or just a table:From a Terms panel you can add filters by clicking on a term. In the example above, clicking on the pie segment, or table row icon, for ERROR would add a filter to show just ERROR log entries
  • Trends shows the trend of event occurrences in a given time frame. Combined with an appropriate query you can show things like error rates
  • Stats shows a set of statistics, so you can identify mean response times, maximum users logged on, and so on – assuming you have this data coming through from the logstash parsing.

Once you’ve built your dashboard save it (Kibana stores it in ElasticSearch). The dashboard is defined in json and you can opt to download this too.

The complete OBIEE monitoring view

Parsing logs is a great way to get out valuable information from the text stored and build visualisations and metrics on top of it. However, for pure metrics alone (such as machine CPU, OBIEE DMS metrics, and so on) a close-relation to Kibana, Grafana, is better suited to the task. Thus we have the text-based data going into ElasticSearch and reported through Kibana, and the pure metrics into a time-based store such as whisper (graphite’s database) and reported through Grafana. Because Grafana is a fork of Kibana, the look and feel is very similar.

Using obi-metrics-agent the DMS metrics from OBIEE can be collected and stored in whisper, and so also graphed out in Grafana alongside the system metrics. This gives us an overall architecture like this:

Obviously, it would be nice if we could integrate the fundamental time-based nature of both Kibana and Grafana together, so that drilling into a particular time period of interest maybe from an error rate point of view in the logs would also show the system and DMS metrics for the same period. There has been discussion about this (1, 2, 3) but I don’t get the impression that it will happen soon, if ever. One other item of interest here is Marvel, which is a commercial offering for monitoring ElasticSearch – through Kibana. It makes use of stock Kibana panel types, along with some new ones including the Nodes panel type, which suits the requirements we have of monitoring OBIEE/system metrics within a Kibana view, but unfortunately it looks like currently it is going to remain within Marvel only.

One other path to consider is trying to get the metrics currently sent to graphite/whisper instead into ElasticSearch so that Kibana can then report on them. The problem with this is twofold. Firstly, ElasticSearch is fundamentally a text-based store, whereas whisper fits much better for time/metric data (as would another DB such as influxDB). So trying to crow-bar the two together may not be the best solution, and instead better for it to be resolved at front end as discussed above. Secondly, Kibana’s graphing capabilities do not conceptually extend to multiple metrics in the same graph – only multiple queries – which means that graphing something that would be simple in Grafana (such as CPU wait/user/sys) would be overly complex in Kibana.

Architecting an ELK deployment

So far I’ve shown how to configure ELK on a single server, reading logs from that same server. But there are two extra things we should consider. First, Logstash in particular can be quite a ‘heavy’ process depending on how much work you’re doing with it. If you are processing all the logs that FMW writes, and have lots of grok filters (which isn’t a bad thing; it means you’re extracting lots of good information), then you will see logstash using a lot of CPU, lots IO, possibly to the detriment of other processes on the system – a tad ironic if the purpose of using logstash is to monitor for any system problems that occur. Secondly, ELK works very well with mutiple servers. You might have a scaled out OBIEE stack, or want to monitor multiple environments. Rather than replicating the ELK stack on each server instead it’s better for each server to push its log messages to a central ELK server for processing. And since the processing takes place on the ELK server and not the server being monitored we reduce the local resource footprint too.

To implement this kind of deployment, you need something like logstash-forwarder on the OBI server which is a light-touch program, sending the messages to logstash itself on the ELK server, over a custom protocol called lumberjack. Logstash then processes the messages as before, except it is reading the input from the logstash-forwarder rather than from file. An alternative approach to this is using redis as a message broker, with logstash running on both the source (sending output to redis) and ELK server (using redis as the input). This approach is documented very well here / here, and the former of using logstash-forwarder here. Logstash-forwarder worked very well for me in my tests, and seems to fit the purpose nicely.

Conclusion

Responsive monitoring tools are crucial for successful and timely support of an OBIEE system, and the ELK stack provides an excellent basis on which to build beyond the capabilities of Enterprise Manager Fusion Middleware Control. The learning curve is a bit steep at first, and you have to be comfortable with installing unpackaged tools, but the payoff makes it worth it! If you are interested in finding out about how Rittman Mead can help with your OBIEE implementation or other areas, please contact us.

Categories: BI & Warehousing

Using rlwrap with Apache Hive beeline for improved readline functionality

Fri, 2014-10-17 06:18

rlwrap is a nice little wrapper in which you can invoke commandline utilities and get them to behave with full readline functionality just like you’d get at the bash prompt. For example, up/down arrow keys to move between commands, but also home/end to go to the start/finish of a line, and even ctrl-R to search through command history to rapidly find a command. It’s one of the standard config changes I’ll make to any system with Oracle’s sqlplus on, and it works just as nicely with Apache Hive’s commandline interface, beeline.

beeline comes with some of this functionality (up/down arrow) but not all (for me, it was ‘home’ and ‘end’ not working and printing 1~ and 5~ respectively instead that prompted me to setup rlwrap with it).

Installing rlwrap

To install rlwrap simply add the EPEL yum packages to your repository configuration:

sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/`uname -p`/epel-release-6-8.noarch.rpm

and then install rlwrap from yum:

sudo yum install -y rlwrap

Use

Once rlwrap is installed you can invoke beeline through it manually, specifying all the standard beeline options as you would normally: (I’ve used the \ line continuation character here just to make the example nice and clear)

rlwrap -a beeline \
-u jdbc:hive2://bdanode1:10000 \
-n rmoffatt -p password \
-d org.apache.hive.jdbc.HiveDriver

Now I can connect to beeline, and as before I press up arrow to access commands from when I previously used the tool, but I can also hit ctrl-R to start typing part of a command to recall it, just like I would in bash. Some other useful shortcuts:

  • Ctrl-lclears the screen but with the current line still shown
  • Ctrl-kdeletes to the end of the line from the current cursor position
  • Ctrl-udeletes to the beginning of the line from the current cursor position
  • Esc-fmove forward one word
  • Esc-bmove backward one word
    (more here)

And most importantly, Home and End work just fine! (or, ctrl-a/ctrl-e if you prefer).

NB the -a argument for rlwrap is necessary because beeline already does some readline-esque functions, and we want rlwrap to forceable override them (otherwise neither work very well). Or more formally (from man rlwrap):

Always remain in “readline mode”, regardless of command’s terminal settings. Use this option if you want to use rlwrap with commands that already use readline.

Alias

A useful thing to do is to add an alias directly in your profile so that it is always available to launch beeline under rlwrap, in this case as the rlbeeline command:

# this sets up "rlbeeline" as the command to run beeline
# under rlwrap, you can call it what you want though. 
cat >> ~/.bashrc<<EOF
alias rlbeeline='rlwrap -a beeline'
EOF
# example usage:
# rlbeeline /
# -u jdbc:hive2://bdanode1:10000 /
# -n rmoffatt -p password /
# -d org.apache.hive.jdbc.HiveDriver

If you want this alias available for all users on a machine create the above as a standalone .sh file in /etc/profile.d/.

Autocomplete

One possible downside of using rlwrap with beeline is that you lose the native auto-complete option within beeline for the HiveQL statements. But never fear – we can have the best of both worlds, with the -f argument for rlwrap, specifying a list of custom auto-completes. So this is even a level-up for beeline, because we could populate it with our own schema objects and so on that we want auto-completed.

As a quick-start, run beeline without rlwrap, hit tab-twice and then ‘y’ to show all options and paste the resulting list into a text file (eg beeline_autocomplete.txt). Now call beeline, via rlwrap, passing that file as an argument to rlwrap:

rlwrap -a -f beeline_autocomplete.txt beeline

Once connected, use auto-complete just as you would normally (hit tab after typing a character or two of the word you’re going to match):

Connecting to jdbc:hive2://bdanode1:10000
Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
[...]
Beeline version 0.12.0-cdh5.0.1 by Apache Hive
0: jdbc:hive2://bdanode1:10000> SE
SECOND        SECTION       SELECT        SERIALIZABLE  SERVER_NAME   SESSION       SESSION_USER  SET
0: jdbc:hive2://bdanode1:10000> SELECT

Conclusion

rlwrap is the tool that keeps on giving; just as I was writing this article, I noticed that it also auto-highlights opening parentheses when typing the closing one. Nice!

Categories: BI & Warehousing

First-timer tips for Oracle Open World

Wed, 2014-10-08 07:16

Last week I had the great pleasure to attend Oracle Open World (OOW) for the first time, presenting No Silver Bullets – OBIEE Performance in the Real World at one of the ODTUG user group sessions on the Sunday. It was a blast, as the saying goes, but the week before OOW I was more nervous about the event itself than my presentation. Despite having been to smaller conferences before, OOW is vast in its scale and I felt like the week before going to university for the first time, full of uncertainty about what lay ahead and worrying that everyone would know everyone else except you! So during the week I jotted down a few things that I’d have found useful to know ahead of going and hopefully will help others going to OOW take it all in their stride from the very beginning.

Coming and going

I arrived on the Friday at midday SF time, and it worked perfectly for me. I was jetlagged so walked around like a zombie for the remainder of the day. Saturday I had chance to walk around SF and get my bearings both geographically, culturally and climate. Sunday is “day zero” when all the user group sessions are held, along with the opening OOW keynote in the evening. I think if I’d arrived Saturday afternoon instead I’d have felt a bit thrust into it all straight away on the Sunday.

In terms of leaving, the last formal day is Thursday and it’s full day of sessions too. I left straight after breakfast on Thursday and I felt I was leaving too early. But, OOW is a long few days & nights so chances are by Thursday you’ll be beat anyway, so check the schedule and plan your escape around it.

Accomodation

Book in advance! Like, at least two months in advance. There are 60,000 people descending on San Francisco, all wanting some place to stay.

Get airbnb, a lot more for your money than a hotel. Wifi is generally going to be a lot better, and having a living space in which to exist is nicer than just a hotel room. Don’t fret about the “perfect” location – anywhere walkable to Moscone (where OOW is held) is good because it means you can drop your rucksack off at the end of the day etc, but other than that the events are spread around so you’ll end up walking further to at least some of them. Or, get an Uber like the locals do!

Sessions

Go to Oak Table World (OTW), it’s great, and free. Non-marketing presentations from some of the most respected speakers in the industry. Cuts through the BS. It’s also basically on the same site as the rest of OOW, so easy to switch back and forth between OOW/OTW sessions.

Go and say hi to the speakers. In general they’re going to want to know that you liked it. Ask questions — hopefully they like what they talk about so they’ll love to speak some more about it. You’ll get more out of a five minute chat than two hours of keynote. And on that subject, don’t fret about dropping sessions — people tweet them, the slides are usually available, and in fact you could be sat at your desk instead of OOW and have missed the whole lot so just be grateful for what you do see. Chance encounters and chats aren’t available for download afterwards; most presentations are. Be strict in your selection of “must see” sessions, lest you drop one you really really did want to see.

Use the schedule builder in advance, but download it to your calendar (watch out for line-breaks in the exported file that will break the import) and sync it to your mobile phone so you can see rapidly where you need to head next. Conference mobile apps are rarely that useful and frequently bloated and/or unstable.

Don’t feel you need to book every waking moment of every day to sessions. It’s not slacking if you go to half as many but are twice as effective from not being worn out!

Dress

Dress wise, jeans and polo is fine, company polo or a shirt for delivering presentations. Day wear is fine for evenings too, no need to dress up. Some people do wear shorts but they’re in the great minority. There are lots of suits around, given it is a customer/sales conference too.

Socialising

The sessions and random conversations with people during the day are only part of OOW — the geek chat over a beer (or soda) is a big part too. Look out for the Pythian blogger meetup, meetups from your country’s user groups, companies you work with, and so on.

Register for the evening events that you get invited to (ODTUG, Pythian, etc) because often if you haven’t pre-registered you can’t get in if you change your mind, whereas if you do register but then don’t go that’s fine as they’ll bank on no-shows. The evening events are great for getting to chat to people (dare I say, networking), as are the other events that are organised like the swim in the bay, run across the bridge, etc.

Sign up for stuff like swim in the bay,  it’s good fun – and I can’t even swim really. Run/Bike across the bridge are two other events also organised. Hang around on twitter for details, people like Yury Velikanov and Jeff Smith are usually in the know if not doing the actual organising.

General

When the busy days and long evenings start to take their toll don’t be afraid to duck out and go and decompress. Grab a shower, get a coffee, do some sight seeing. Don’t forget to drink water as well as the copious quantities of coffee and soda.

Get a data package for your mobile phone in advance of going eg £5 per day unlimited data. Conference wifi is just about OK at best, often flaky. Trying to organise short-notice meetups with other people by IM/twitter/email gets frustrating if you only get online half an hour after the time they suggested to meet!

Don’t pack extra clothes ‘just in case’. Pack minimally because (1) you are just around the corner from Market Street with Gap, Old Navy etc so can pick up more clothes cheaply if you need to and (2) you’ll get t-shirts from exhibitors, events (eg swim in the bay) and you’ll need the suitcase space to bring them all home. Bring a suitcase with space in or that expands, don’t arrive with a suitcase that’s already at capacity.

Food

So much good food and beer. Watch out for some of the American beers; they seem to start at about 5% ABV and go upwards, compared to around 3.6% ABV here in the UK. Knocking back this at the same rate as this will get messy.

In terms of food you really are spoilt, some of my favourites were:

  • Lori’s diner (map) : As a brit, I loved this American Diner, and great food - yum yum. 5-10 minutes walk from Moscone.
  • Mel’s drive-in (map) : Just round the corner from Moscone, very busy but lots of seats. Great american breakfast experience! yum
  • Grove (map) : Good place for breakfast if you want somewhere a bit less greasy than a diner (WAT!)

 

Categories: BI & Warehousing

Adding Oracle Big Data SQL to ODI12c to Enhance Hive Data Transformations

Sun, 2014-10-05 15:29

An updated version of the Oracle BigDataLite VM came out a couple of weeks ago, and as well as updating the core Cloudera CDH software to the latest release it also included Oracle Big Data SQL, the SQL access layer over Hadoop that I covered on the blog a few months ago (here and here). Big Data SQL takes the SmartScan technology from Exadata and extends it to Hadoop, presenting Hive tables and HDFS files as Oracle external tables and pushing down the filtering and column-selection of data to individual Hadoop nodes. Any table registered in the Hive metastore can be exposed as an external table in Oracle, and a BigDataSQL agent installed on each Hadoop node gives them the ability to understand full Oracle SQL syntax rather than the cut-down SQL dialect that you get with Hive.

NewImage

There’s two immediate use-cases that come to mind when you think about Big Data SQL in the context of BI and data warehousing; you can use Big Data SQL to include Hive tables in regular Oracle set-based ETL transformations, giving you the ability to reference Hive data during part of your data load; and you can also use Big Data SQL as a way to access Hive tables from OBIEE, rather than having to go through Hive or Impala ODBC drivers. Let’s start off in this post by looking at the ETL scenario using ODI12c as the data integration environment, and I’ll come back to the BI example later in the week.

You may recall in a couple of earlier posts earlier in the year on ETL and data integration on Hadoop, I looked at a scenario where I wanted to geo-code web server log transactions using an IP address range lookup file from a company called MaxMind. To determine the country for a given IP address you need to locate the IP address of interest within ranges listed in the lookup file, something that’s easy to do with a full SQL dialect such as that provided by Oracle:

NewImage

In my case, I’d want to join my Hive table of server log entries with a Hive table containing the IP address ranges, using the BETWEEN operator – except that Hive doesn’t support any type of join other than an equi-join. You can use Impala and a BETWEEN clause there, but in my testing anything other than a relatively small log file Hive table took massive amounts of memory to do the join as Impala works in-memory which effectively ruled-out doing the geo-lookup set-based. I then went on to do the lookup using Pig and a Python API into the geocoding database but then you’ve got to learn Pig, and I finally came up with my best solution using Hive streaming and a Python script that called that same API, but each of these are fairly involved and require a bit of skill and experience from the developer.

But this of course is where Big Data SQL could be useful. If I could expose the Hive table containing my log file entries as an Oracle external table and then join that within Oracle to an Oracle-native lookup table, I could do my join using the BETWEEN operator and then output the join results to a temporary Oracle table; once that’s done I could then use ODI12c’s sqoop functionality to copy the results back down to Hive for the rest of the ETL process. Looking at my Hive database using SQL*Developer 4.0.3’s new ability to work with Hive tables I can see the table I’m interested in listed there:

NewImage

and I can also see it listed in the DBA_HIVE_TABLES static view that comes with Big Data SQL on Oracle Database 12c:

SQL> select database_name, table_name, location
  2  from dba_hive_tables
  3  where table_name like 'access_per_post%';

DATABASE_N TABLE_NAME             LOCATION
---------- ------------------------------ --------------------------------------------------
default    access_per_post        hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post

default    access_per_post_categories     hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post_categories

default    access_per_post_full       hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post_full

There are various ways to create the Oracle external tables over Hive tables in the linked Hadoop cluster, including using the new DBMS_HADOOP package to create the Oracle DDL from the Hive metastore table definitions or using SQL*Developer Data Modeler to generate the DDL from modelled Hive tables, but if you know the Hive table definition and its not too complicated, you might as well just write the DDL statement yourself using the new ORACLE_HIVE external table access driver. In my case, to create the corresponding external table for the Hive table I want to geo-code, it looks like this:

CREATE TABLE access_per_post_categories(
  hostname varchar2(100), 
  request_date varchar2(100), 
  post_id varchar2(10), 
  title varchar2(200), 
  author varchar2(100), 
  category varchar2(100),
  ip_integer number)
organization external
(type oracle_hive
 default directory default_dir
 access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));

Then it’s just a case of importing the metadata for the external table over Hive, and the tables I’m going to join to and then load the results into, into ODI’s repository and then create a mapping to bring them all together.

NewImage

Importantly, I can create the join between the tables using the BETWEEN clause, something I just couldn’t do when working with Hive tables on their own.

NewImage

Running the mapping then joins the webserver log lookup table to the geocoding IP address range lookup table through the Oracle SQL engine, removing all the complexity of using Hive streaming, Pig or the other workaround solutions I used before. What I can then do is add a further step to the mapping to take the output of my join and use that to load the results back into Hive, like this:

NewImage

I’ll then use IKM SQL to to Hive-HBase-File (SQOOP) knowledge module to set up the export from Oracle into Hive.

NewImage

Now, when I run the mapping I can see the initial table join taking place between the Oracle native table and the Hive-sourced external table, and the results then being exported back into Hadoop at the end using the Sqoop KM.

NewImage

Finally, I can view the contents of the downstream Hive table loaded via Sqoop, and see that it does in-fact contain the country name for each of the page accesses.

NewImage

Oracle Big Data SQL isn’t a solution suitable for everyone; it only runs on the BDA and requires Exadata for the database access, and it’s an additional license cost on top of the base BDA software bundle. But if you’ve got it available it’s an excellent way to blend Hive and Oracle data, and a great way around some of the restrictions around HiveQL and the Hive JDBC/ODBC drivers. More on this topic later next week, when I’ll look at using Big Data SQL in conjunction with OBIEE 11g.

Categories: BI & Warehousing

News and Updates from Oracle Openworld 2014

Sat, 2014-10-04 08:48

It’s the Saturday after Oracle Openworld 2014, and I’m now home from San Francisco and back in the UK. It’s been a great week as usual, with lots of product announcements and updates to the BI, DW and Big Data products we use on current projects. Here’s my take on what was announced this last week.

New Products Announced

From a BI and DW perspective, the most significant product announcements were around Hadoop and Big Data. Up to this point most parts of an analytics-focused big data project required you to code the solution yourself, with the diagram below showing the typical three steps in a big data project – data ingestion, analysis and sharing the results.

NewImage

At the moment, all of these steps are typically performed from the command-line using languages such as Python, R, Pig, Hive and so on, with tools like Apache Flume and Apache Sqoop used to bring data into and out of the Hadoop cluster. Under the covers, these tools use technologies such as MapReduce or Spark to do their work, automatically running jobs in parallel across the cluster and making use of the easy scalability of Hadoop and NoSQL databases.

You can also neatly divide the work up on a big data project into two phases; the “discovery” phase typically performed by a data scientist where data is loaded, analysed, correlated and otherwise “understood” to provide the initial insights, and then an “exploitation” phase where we apply governance, provide the output data in a format usable by BI tools and otherwise share the results with the wider corporate audience. The updated Information Management Reference Architecture we collaborated on with Oracle and launched by in June this year had distinct discovery and exploitation phases, and the architecture itself made a clear distinction between the Innovation part that enabled the discovery phase of a project and the Execution part that delivered the insights and data in a more governed, production setting.

NewImage

This was the theme of the product announcements around analytics, BI, data warehousing and big data during Openworld 2014, with Oracle’s Omri Traub in the photo below taking us through Oracle’s big data product strategy. What Oracle are doing here is productising and “democratising” big data, putting it clearly in context of their existing database, engineered systems and BI products and linking them all together into an overall information management architecture and delivery process.

NewImage

So working through from ingestion through to data analysis, these steps have typically been performed by data scientists using scripting tools and rudimentary data visualisation engines, making them labour-intensive and reliant on a small set of people conversant with these tools and process. Oracle Big Data Discovery is aimed squarely at these steps, and combines Apache Spark-based data preparation and transformation capabilities with an analysis and visualisation engine based on Endeca Server.

NewImage

Key features of Big Data Discovery include:

  • Ability to analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based transformation engine
  • Create a catalog of the data on your Hadoop cluster, and then search that catalog using Endeca Server search technologies
  • Create recommendations of other datasets that might interest you, based on what you’re looking at now
  • Visualize your datasets to help understand what they contain, and discover new insights

Under the covers it comprises two parts; the data loading, transformation and profiling part that uses Apache Spark to do its work in parallel across all the nodes in the cluster, and the analysis part, which takes data prepared by Apache Spark and loads into the Endeca Server in-memory engine to perform the analysis, aggregation and data visualisation. Unlike the Spark part the Endeca server element runs just on one node and limits the size of the analysis dataset to what can run in-memory in the Endeca Server engine, but in practice you’re going to work with a sample of the data rather than the entire dataset at that stage (in time the assumption is that the Endeca Server engine will be unbundled and run natively on YARN, giving it the same scalability as the Spark-based data ingestion and transformation part). Initially Big Data Discovery will run on-premise with a cloud version later on, and it’s not dependent on Big Data Appliance – expect to see something later this year / early next year.

Another new product that addresses the discovery phase and discovery lab part of a big data project is Oracle Data Enrichment Cloud Service, from the Oracle Data Integration team and designed to complement ODI and Oracle EDQ. Whilst Oracle positioned ODECS as something you’d use as well as Big Data Discovery and typically upstream from BDD, to me there seemed to be a fair bit of overlap between the products, with both tools doing data profiling and transformation but BDD being more focused on the exploration and discovery part, and ODECS being more focused on early-stage data profiling and transformation.

NewImage

ODECS is clearly more of an ETL tool complement and runs natively in the cloud, right from the start. It’s most probably aimed at customers with their Hadoop dataset already in the cloud, maybe using Amazon Elastic MapReduce or Oracle’s new Hadoop-as-a-Service and has more in common with the old Data Quality Option for Oracle Warehouse Builder than Endeca’s search-first analytic interface. It’s got a very nice interface including a mobile-enabled website and the ability to include and merge in external datasets, including Oracle’s own Data as a Service platform offering. Along with the new Metadata Management tool Oracle also launched at Openworld it’s a great addition to the Oracle Data Integration product suite, but I can’t help thinking that its initial availability only on Oracle’s public cloud platform is going to limit its use with Oracle’s typical customers – we’ll have to just wait and see.

The other major product that addresses big data projects was Oracle Big Data SQL. Partly addressing the discovery phase of big data projects but mostly (to my mind) addressing the exploitation phase, and the execution part of the information management architecture, Big Data SQL gives Oracle Exadata the ability to return data from Hive and NoSQL on the Big Data Appliance as well as data from its normal relational store. I covered Big Data SQL on the blog a few weeks ago and I’ll be posting some more in-depth articles on it next week, but the other main technical innovation with the product is its bringing of Exadata’s SmartScan feature to Hadoop, projecting and filtering data at the Hadoop storage node level and also giving Hadoop the ability to understand regular Oracle SQL, rather than the cut-down version you get with HiveQL.

NewImage

Where this then leaves us is with the ability to do most of a big data project using (Oracle) tools, bringing big data analysis within reach of organisations with Oracle-style budgets but without access to rare data scientist-type resources. Going back to my diagram earlier, a post-OOW big data project using the new products launched in this last week could look something like this:

NewImage

Big Data SQL is out now and depends on BDA and Exadata for its use; Big Data Discovery should be out in a few months time, runs on-premise but doesn’t require BDA, whilst ODECS is cloud-only and runs on a BDA in the background. Expect more news and more integration/alignment from the products as 2014 ends and 2015 starts, and we’re looking forward to using them on Oracle-centric Hadoop projects in the near future. 

Product Updates for BI, Data Integration, Exalytics, BI Applications and OBIEE

Other news announced over the week for products we more commonly use on projects include:

Finally, something that we were particularly pleased to see was the updated Oracle Information Management Architecture I mentioned earlier referenced in most of the analytics sessions, with Oracle’s Balaji Yelamanchili for example introducing it in his big data and business analytics general session mid-way through the week. 

NewImage  

We love the way this brings together the big data components and puts them in the context of the wider data warehouse and analytic processes, and compared to a few years ago when Hadoop and big data was considered completely separate to data warehousing and BI and done by staff completely different to the core business analytics team, this new reference architecture puts it squarely within the world of BI and analytics we work in. It also emphasises the new abilities Hadoop, NoSQL databases and big data can bring us – support for wider sets of data sources with dynamic schemas, the ability to economically work with and analyse much larger datasets, and support discovery-type upfront analysis work. Finally, it recognises that to get true value out of analysis you start on Hadoop, you eventually need to add proper data governance, make the results more widely available using full SQL tools, and use the right tools – relational databases, OLAP servers and the like – to analyse the data once its in a more structured form. 

If you missed our write-up on the updated Information Management Reference Architecture you can can read our two-part blog post here and here, read the Oracle white paper, or listen to the podcast with OTN Archbeat’s Bob Rhubart. For now though I’m looking forward to seeing the family after a week and a half away in San Francisco – thanks to OTN and the Oracle ACE Director Program for sponsoring my visit over to SF for Openworld, and we’ll post our conference presentation slides later next week when we’re back in the UK and US offices.

Categories: BI & Warehousing

EPM and BI Meetup at Next Week’s Openworld (and details of our Oracle DI Speakeasy)

Fri, 2014-09-26 10:08

Just a short note to help publicise the Oracle Openworld 2014 EPM and BI Meetup that’s running next week, organised by Cameron Lackpour and Tim Tow from the ODTUG board.

This is an excellent opportunity for EPM and BI developers and customers to get together and network over drinks and food, and chat with members of the ODTUG board and maybe some of the EPM and BI product management team. It’s running at Piattini, located at 2331 Mission St. (between 19th St & 20th St), San Francisco, CA 94110 from 7pm to late and there’s more details at this blog post by Cameron. The turnout should be pretty good, and if you’re an EPM or BI developer looking to meet up with others in your area this is a great opportunity to do so. Attendance is free and you just need to register using this form.

Similarly, if you’re into data warehousing and data integration you might be interested in our Rittman Mead / Oracle Data Integration’s Speakeasy event, running on the same evening (Tuesday September 30th 2014) from 7pm – 9pm at Local Edition, 691 Market St, San Francisco, CA. Aimed at ODI, OWB and data integration developers and customers and featuring members of the Rittman Mead team and Oracle’s Data Integration product team, again this is a great opportunity to meet with your peers and share stories and experiences. Registration is free and done through this registration form, with spaces still open at the time of posting.

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Service Administration

Thu, 2014-09-25 20:17

Earlier in the week we’ve looked at the developer features within Oracle BI Cloud Service (BICS), aimed at departmental users who want the power of OBIEE 11g without the need to stand-up their own infrastructure. We looked at the process of uploading spreadsheets and other data to the Oracle Database Schema Service that accompanies BICS, how you create the BI Repository that translates the tables and columns you upload into measures, attributes and hierarchies, and then took a brief look at how dashboards and reports are created and then shared with other users in your department. If you’re coming in late, here’s the links to the previous posts in the series:

One of the design goals for BICS was to reduce the amount of administration work an end-user has to perform, and to simplify and consolidate any tasks that they do have to do. Behind the scenes BICS actually comprises a BI environment, and a database environment, with most of the administration work being concerned with the BI one. Let’s start by looking at the service administration page that you see when you first log into the BICS environment as an administrator, with the screenshot below showing the overview page for the overall service.

NewImage

Oracle BI Cloud Service is part of Oracle’s overall Oracle Platform-as-a-Service (PaaS) offering, with BICS being made up of a database service and a BI service. The screenshot above shows the overall availability of these two services over the past two weeks, and you click on either the database service or the BI service to drill into more detail. Let’s click on the BI service first.

NewImage

The BI service dashboard page shows the same availability statuses again, along with a few graphs to show usage over that period. Also on this page are details of the start and end date for the service contract, details of the SFTP user account you’ll need to for some import/archive operations, and a link to Presentation Services for this instance, to launch the OBIEE Home Page.

The OBIEE home page, as we saw in previous posts in this series, has menu items for model editing, data uploading and creating reports and dashboards. What it also has though is a Manage menu item, as shown in the screenshot below, that takes you through to an administration function that lets you set up application roles and backup/restore the system.

NewImage

Application roles are the way that OBIEE groups permissions and privileges and then assigns them to sets of users. With on-premise OBIEE the only way to manage application roles is through Enterprise Manager Fusion Middleware Control, but with BICS this functionality has been moved into OBIEE proper so that non-system administrators can perform this task. The list of users you work with are the ones defined for your service (tenancy) and using this tool you can assign them to existing application roles, create new ones, or group one set of roles within another. Users themselves are created as part of the instance creation process, with the minimum (license) number of users for an instance being 10.

NewImage

The Snapshots tab on this same Service Console page provides access to a new, system-wide snapshot and restore function that provides the means to version your system, restore it from a backup and transport a dev/test environment to your production instance. As I mentioned in previous postings in the series, each tenant for BICS comes with two instances, once for dev/test and one for prod, and the snapshot facility gives you a means to copy everything from one environment into another, for when you’ve completed development and testing and want to put your dashboards into production.

NewImage

Taking a snapshot, as shown in the screenshot above, creates an archive file containing your RPD, the catalog and all the security settings, and you can store a number of snapshots within each environments, giving you a (very coarse-grained) versioning ability. What you can also do is download these snapshots as what are called “BI Archive” files as shown in the screenshot below, and its these archive files that you can then upload into your other instance to give you your code promotion process – note however that applying an archive file overwrites everything that was there before, so you’ll need to be careful doing this when users start creating reports in your production environment – really, it’s just a once-only code promotion facility followed then by a way of backing up and restoring your environments.

NewImage

Note also that you’ll separately need to backup and restore any database elements, as these aren’t automatically included in the BI archive process. Backup and restoration of database elements is done via the separate database instance service page shown below, where you can export the whole schema or just parts of it, and then retrieve the export file via an SFTP transfer.

NewImage

So that’s in in terms of BICS administration, and for our initial look at the BI Cloud Service platform. Rittman Mead are of course offering services around BICS and cloud BI in-general so contact us if you’d like to give BICS a spin, and keep an eye on the blog over the next few weeks where we’ll take you through the example BICS application we built, reporting against Salesforce.com data using their REST API.

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Building Dashboards & Reports

Thu, 2014-09-25 04:00

This week we’ve been looking at the new Oracle BI Cloud Service (BICS), the cloud version of OBIEE11g that went GA at the start of this week. Rittman Mead were part of the beta program for BICS and spend a couple of weeks building a sample BICS application to put the product through its paces, creating a reporting application for Salesforce.com that pulled in its data via the Salesforce REST API and staged it in the Oracle Database Schema Service that comes with BICS. Earlier in the week we looked at how data was uploaded or transferred into the accompanying database schema, and yesterday looked at how the repository was created using the new thin-client data modeller. Today, we’ll look at how you create the dashboards and reports that your users will use, using the Analysis and Dashboard Editors that are part of the service. If you’re arriving mid-way through the series, here’s the links to the other posts in the series:

In fact creating analyses and dashboards is the part of BICS that has least changed compared to the on-premise version. In keeping with the “self-service” theme for BICS there’s an introductory set of guidance notes when you first connect to BICS, like this:   NewImage   and the dashboard and analysis editors are available as menu options on the Home page, along with a link to the Catalog view, like this:   NewImage   From that point on though it’s standard Answers and Dashboards, with the normal four-tab editor view within Answers (the Analysis Editor) and the ability to create views, calculations, filters and so on. Anyone familiar with Answers will be at home within the cloud version, and there’s a new visualisation – the heat map view, as shown in the final screenshot later in this article – that hints at other visualisations that we’ll see featured first in the cloud version of OBIEE, expected to be updated more frequently than the on-premise version (one of the major selling points for customers looking to adopt new features as soon as possible with OBIEE).   NewImage   What’s missing from this environment though are features like Agents and alerts, scorecards and BI Publisher, or the ability to create actions other than links to other web pages or catalog content.   NewImage   These are features that Oracle are saying they’ll add-back in time though as the underlying infrastructure for BICS builds-out, and of course the whole UI is likely to go through a rev with the 12c release of OBIEE due sometime in 2015. Dashboards are also created in the same way as with on-premise OBIEE, with the same Dashboard Editor and access to features like conditional display of sections and support for presentation variables.

NewImage

So, that wraps-up our quick tour around the analysis and dashboard creation parts of Oracle BI Cloud Service; tomorrow, to finish-up the series we’ll look at the administration elements of BICS including new self-service application role provisioning, tools for administering and monitoring the instance and for backing-up and migrating content from one instance to another.

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Creating the Repository

Wed, 2014-09-24 04:00

Earlier in this series we’ve looked at the overall product proposition for Oracle BI Cloud Service (BICS), and how you upload data to the Database Schema Service that comes with it. Today, we’re going to look at what’s involved in creating the BI Repository that holds the metadata about your logical tables, calculations and dimension hierarchies, using the new thin-client data modeller that like the rest of BICS runs entirely within your web browser. For anyone coming into the series mid-way, here’s the links to the other posts in the series:

So anyone familiar with OBIEE will know that a central part of the product, and the part of it that makes it easy for users to work with their data, is the business-orientated semantic model that you create over your source data. Held within what’s called the “BI Repository” and made-up of physical, logical and presentation layers, the semantic model turns what can be a complex set of source tables, joins and cross-application links into a simple to understand set of subject areas made up of fact tables and dimensions. Regular on-premise OBIEE semantic models can get pretty complex, with joins across different database types, logical tables with several different ways you can provide their data – for example, at detail-level from an Oracle data warehouse whilst at summary level, from an Essbase cube, and to edit them you use a dedicated Windows development tool called BI Administration.

Allowing these complex data models, and having a dependency on a Windows-based development tool, poses two main issues for any consumer-style version of OBIEE; first, if the aim of the service is to attract customers who want to create their systems “self-service”, you’ve got to made the repository development process a lot simpler than it currently is – you can’t expect customers to go on a course or buy my excellent book when they just want to get a dashboard up and running with the minimum fuss. You also can’t realistically expect them to install a Windows-only development tool back at the office as most of their target customers won’t have admin privileges on their workstations, or they might even be using Macs or work out of a browser; and then, even if they get it installed you’ll need to ensure there’s a network connection available to the BI Server in the cloud through their corporate firewall. Clearly, a browser-based repository creation tool was needed, ideally one that did some of the basic work automatically for the user and didn’t need hours or days of training to understand. Of course, the risk to this is that you create a repository editing tool that’s too “dumbed-down” for most developers to find useful, and we’ll consider that possibility later in the article.

So following the data upload process that we covered in yesterday’s post, we’re now in a position where we’ve got a number of tables sitting in Oracle Database Schema Service, and we’re ready to build a repository to report against them. To access the thin-client data modeller you click on the Model menu item on the BICS homepage, as shown in the screenshot below.

NewImage

The modeller itself supports a simplified subset of what you can create with the full BI Administration tool. You’ve got a single source, the Oracle Database Schema Service, and a single business model. Business model tables have a logical table source as you’d normally expect, but just the one LTS is currently supported. Calculations within logical tables are supported, but they’re logical-level only (i.e. post-aggregation) with no current support for physical-level (pre-aggregation) at this point.

NewImage

Level-based hierarchies within the business model are supported, including skip-level and ragged ones, and there’s support for time-series dimensions including their own editor.

NewImage

Where possible, introspection is used when creating the business model components, with table joins and matching column names used to create candidate logical joins. Static and dynamic repository variables, along with session variables are supported, with the front-end also supporting presentation and request variables – so all good there.

NewImage

Under the covers, each tenant within BICS has their own RPD and their own catalog, and any edits to the repository that you perform are effectively “online” edits. To make edits to an existing model the developer therefore has to first “lock” the model, make their changes and add their new entries and then validate them, and then either revert the model or publish the changes. 

NewImage

In the background BICS updates the RPD using the metadata web service API for the BI Server, with the RPD it creates the same format as the ones we create on-premise, just with a smaller set of features supported through the thin-client admin tool.

As I mentioned in the first post in the series, each tenant install of BICS comes with two instances; one for development or pre-prod and one for production. To move a completed repository out of one environment into another a new feature called a “BI Archive” is used, a snapshot of your BICS system that includes both the repository, the catalog and any security objects you create. In this first version of BICS each import is total and overwrites everything that was in the instance beforehand, so there’s no incremental import or ability to selectively import just certain objects or certain reports into a new environment, meaning that you’ll lose any reports or dashboards created in production if you subsequently refresh it from dev/pre-prod – something to bear in-mind.

One other thing to be aware of is that there’s no ability to create alias tables or opaque views in the thin-client modeller, so if you want to create additional copies of dimension table for more than one dimension role, or you want to create a table using an arbitrary SELECT statement you’ll need to go into ApEx and create a database view instead – not a huge imposition as ApEx comes with tools for creating these pretty easily, but something that will lead to a more complex database model in-time. The screenshot below shows one such database view then exposed through the thin-client modeller, where you can see the SELECT statement behind it (but not alter or amend it except through ApEx).

NewImage

Finally, the thin-client modeller supports row-level and subject area security, using filters or object permissions to set up manually or create by reference to application roles granted to your users. We’ll look at what’s involved in setting up security and application roles in the final post in this series, where we look at administering your BICS instance.

So, that’s a high-level view of the repository creation process; in tomorrow’s post, we’ll look at what’s involved in creating reports and dashboards.

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Provisioning Data

Tue, 2014-09-23 04:00

In the first post in this series I looked at the new Oracle BI Cloud Service, which went GA over the weekend and which Rittman Mead have been using these past few weeks as part of a beta release. In the first post I looked at what BICS is and who its aimed at in this initial release, and went through the features at a high-level; over the rest of the week I’ll be looking at the features in-detail, starting today with the data upload and provisioning process. Here’s the links to the rest of the series, with the items getting updated over the week as I post each entry in the series:

As I mentioned in that first post, “Introduction to Oracle BI Cloud Service : Product Overview”, BICS in this initial release to my mind is aimed at departmental use-cases where someone wants to quickly upload and analyse an offline dataset and share the results with other members of their team. BICS comes bundled with Oracle Database Schema Service and 50GB of storage, and OBIEE in this setup reports just against this data source with no ability to reach-out dynamically to other data sources or blend those sources with the main one in Oracle’s cloud database. It’s aimed really at users with a single source of data to work with, who’ve probably obtained it as an export from some other system and just want to be able to report against it, though as we’ll see later in this post it is possible to link to other SaaS sources with a bit of PL/SQL wizardry.

So the first task you’re likely to perform when working with BICS is to upload some data to report on. There are three main options for uploading data to BICS, two of which are browser-based and aimed at end-users, and one that uses SQL*Developer and more aimed at devs. BICS itself comes with a menu items on the home page for uploading data, and this is what we’ll think users will use most as it’s built-into the tool and fairly prominent.

NewImage

Clicking on this menu item launches an ApEx application hosted in the Database Schema Service that comes with BICS, and which allows you to upload and parse XLS and delimited file-types to the database cloud instance and then store the contents in database tables.

NewImage

Oracle Database Schema Service also comes with Application Express (ApEx) as a front-end, and ApEx has similar tools for upload datasets into the service, with additional features for creating views and PL/SQL packages to process and manipulate the data, something we used in our beta program example to connect to Salesforce.com and download data using their REST API. In-theory you shouldn’t need to use these features much, but SIs and partners such as ourselves will no doubt use ApEx a lot to build out the loading infrastructure, data cleansing and other features that you might want for a packaged cloud app – so get your PL/SQL books out and brush-up on ApEx development.

NewImage

The other way to get data into BICS is to use Oracle SQLDeveloper, which has a special Oracle Cloud connector type that allows you to view and work with database objects as if they were regular database ones, and upload data to the cloud in the form of “carts”. I’d imagine these options will get extended over time, either by tools or utilities Oracle release for this v1.0 BICS release, or by BICS eventually supporting the full Oracle Database Instance Service that’ll support regular SQLNet connections from ETL tools.

NewImage

So once you’ve got some data uploaded into Database Schema Services, you’ll end up with a set of source tables from which you can create your BI Repository. Check back tomorrow for more details on how BICS’s new thin-client data modeller works and how you create your business model against this cloud data source, including how the repository editing and checkout process works in this new potentially multi-user development environment.

 

Categories: BI & Warehousing

Introduction to Oracle BI Cloud Service : Product Overview

Mon, 2014-09-22 05:02

Long-term readers of this blog will probably know that I’m enthusiastic about the possibilities around running OBIEE in the cloud, and over the past few weeks Rittman Mead have been participating in the beta program for release one of Oracle’s Business Intelligence Cloud Service (BICS). BICS went GA over the weekend and is now live on Oracle’s public cloud site, so all of this week we’ll be running a special five-part series on what BI Cloud Service is, how it works and how you go about building a simple application. I’m also presenting on BICS and our beta program experiences at Oracle Openworld this week (Oracle BI in the Cloud: Getting Started, Deployment Scenarios, and Best Practices [CON2659], Monday Sep 29 10:15 AM – 11.00 AM Moscone West 3014), so if you’re at the event and want to hear our thoughts, come along.

Over the next five days I’ll be covering the following topics, and I’ll update the list with hyperlinks once the articles are published:

So what is Oracle BI Cloud Service, and how does it relate to regular, on-premise OBIEE11g?

On the Oracle BI Cloud Service homepage, Oracle position the product as “Agile Business Intelligence in the Cloud for Everyone”, and there’s a couple of key points in this positioning that describe the product well.

NewImage

The “agile” part is referring to the point that being cloud-based, there’s no on-premise infrastructure to stand-up, so you can get started a lot quicker than if you needed to procure servers, get the infrastructure installed, configure the software and get it accepted by the IT department. Agile also refers to the fact that you don’t need to purchase perpetual or one/two-year term licenses for the software, so you can use OBIEE for more tactical projects without having to worry about expensive long-term license deals. The final way that BICS is “agile” is in the simplified, user-focused tools that you use to build your cloud-based dashboards, with BICS adopting a more consumer-like user interface that in-theory should mean you don’t have to attend a course to use it.

BICS is built around standard OBIEE 11g, with an updated user interface that’ll roll-out across on-premise OBIEE in the next release and the standard Analysis Editor, Dashboard Editor and repository (RPD) under the covers. Your initial OBIEE homepage is a modified version of the standard OBIEE homepage that lists standard developer functions down the left-hand side as a series of menu items, and the BI Administration tool is replaced with an online, thin-client repository editor that provides a subset of the full BI Administration tool functionality.

NewImage

Customers who license BICS in this initial release get two environments (or instances) to work with; a pre-prod or development environment to create their applications in initially, and a production environment into which they deploy each release of their work. BICS is also bundled with Oracle Database Schema Service, a single-schema Oracle Database service with an ApEx front-end into which you store the data that BICS reports on, and with ApEx and BICS itself having tools to upload data into it; this is, however, the only data source that BICS in version 1 supports, so any data that your cloud-based dashboards report on has to be loaded into Database Schema Service before you can use it, and you have to use Oracle’s provided tools to do this as regular ETL tools won’t connect. We’ll get onto the data provisioning process in the next article in this five-part series.

BICS dashboards and reports currently support a subset of what’s available in the on-premise version. The Analysis Editor (“Answers”) is the same as on-premise OBIEE with the catalog view on the left-hand side, tabs for Results and so on, and the same set of view types (and in fact a new one, for heat maps). There’s currently no access to Agents, Scorecards, BI Publisher or any other Presentation Services features that require a database back-end though, or any Essbase database in the background as you get with on-premise OBIEE 11.1.1.7+.

NewImage

What does become easier to deploy though is Oracle BI Mobile HD as every BICS instance is, by definition, accessible over the internet. Last time I checked the current version of BI Mobile HD on Apple’s App Store couldn’t yet connect, but I’m presuming an update will be out shortly to deal with BICS’s login process, which gets you to enter a BICS username and password along with an “identity domain” that specifics the particular company tenant ID that you use.

NewImage

I’ll cover the thin-client data modeller later in this series in more detail, but at a high-level what this does is remove the need for you to download and install Oracle BI Administration to set up your BI Repository, something that would have been untenable for Oracle if they were serious about selling a cloud-based BI tool. The thin-client data modeller takes the most important (to casual users) features of BI Administration and makes them available in a browser-based environment, so that you can create simple repository models against a single data source and add features like dimension hierarchies, calculations, row-based and subject-area security using a point-and-click environment.

NewImage

Features that are excluded in this initial release include the ability to define multiple logical table sources for a logical table, creating multiple business areas, creating calculations using physical (vs. logical) tables and so on, and there’s no way to upload on-premise RPDs to BICS, or download BICS ones to use on-premise, at this stage. What you do get with BICS is a new import and export format called a “BI Archive” which bundles up the RPD, the catalog and the security settings into a single archive file, and which you use to move applications between your two instances and to store backups of what you’ve created.

So what market is BICS aimed at in this initial release, and what can it be used for? I think it’s fair to say that in this initial release, it’s not a drop-in replacement for on-premise OBIEE 11g, with only a subset of the on-premise features initially supported and some fairly major limitations such as only being able to report against a single database source, no access to Agents, BI Publisher, Essbase and so on. But like the first iteration of the iPhone or any consumer version of a previously enterprise-only tool, its trying to do a few things well and aiming at a particular market – in this case, departmental users who want to stand-up an OBIEE environment quickly, maybe only for a limited amount of time, and who are familiar with OBIEE and would like to carry on using it. In some ways its target market is those OBIEE customers who might otherwise have use Qlikview, Tableau or one of the new SaaS BI services such as Good Data, who most probably have some data exports in the form of Excel spreadsheets or CSV documents, want to upload them to a BI service without getting all of IT involved and then share the results in the form of dashboards and reports with their team. Pricing-wise this appears to be who Oracle are aiming the service at (minimum 10 users, $3500/month including 50GB of database storage) and with the product being so close to standard OBIEE functionality in terms of how you use it, it’s most likely to appeal to customers who already use OBIEE 11g in their organisation.

That said, I can see partners and ISVs adopting BICS to deliver cloud-based SaaS BI applications to their customers, either as stand-alone analysis apps or as add-ons to other SaaS apps that need reporting functionality. Oracle BI Cloud Service is part of the wider Oracle Platform-as-a-Service (PaaS) that includes Java (WebLogic), Database, Documents, Compute and Storage, so I can see companies such as ourselves developing reporting applications for the likes of Salesforce, Oracle Sales Cloud and other SaaS apps and then selling them, hosting included, through Oracle’s cloud platform; I’ll cover our initial work in this area, developing a reporting application for Salesforce.com data, later in this series.

NewImage

Of course it’s been possible to deploy OBIEE in the cloud for some while, with this presentation of mine from BIWA 2014 covering the main options; indeed, Rittman Mead host OBIEE instances for customers in Amazon AWS and do most of our development and training in the cloud including our exclusive “ExtremeBI in the Cloud” agile BI service; but BICS has two major advantages for customers looking to cloud-deploy OBIEE:

  • It’s entirely thin-client, with no need for local installs of BI Administration and so forth. There’s also no need to get involved with Enterprise Manager Fusion Middleware Control for adding users to application roles, defining application role mappings and so on
  • You can license it monthly, including data storage. No other on-premise license option lets you do this, with the shortest term license being one year

such that we’ll be offering it as an alternative to AWS hosting for our ExtremeBI product, for customers who in-particular want the monthly license option.

So, an interesting start. As I said, I’ll be covering the detail of how BICS works over the next five days, starting with the data upload and provisioning process in tomorrow’s post – check back tomorrow for the next instalment.

Categories: BI & Warehousing

Getting The Users’ Trust – Part 2

Thu, 2014-09-18 04:35

Last time I wrote about the performance aspects of a BI system and how they could affect a user’s confidence. I concluded by mentioning that incorrect data might be generated by poorly coded ETL routines causing data loss or duplication. This time I am looking more at the quality of the data we load (or don’t load).

Back in the 1990’s I worked with a 4.5 TB DWH that had a single source for fact and reference data, that is the data loaded was self-consistent. Less and less these days we find a single source DWH to be the case; we are adding multiple data sources (both internal and external). Customers can now appear on CRM, ERP, social media, credit referencing, loyalty, and a whole host of other systems. This proliferation of data sources gives rise to a variety of issues we need to be at least aware of, and in reality, should be actively managing. Some of these issues require us to work out processing rules within our data warehouse such as what do we do with fact data that arrives before its supporting reference data; I once had a system where our customer source could only be extracted once a week but purchases made by new customers would appear in our fact feed immediately after customer registration. Obviously, it is a business call on whether we publish facts that involve yet to be loaded customers straight away or defer those loads until the customer has been processed in the DWH. In the case of my example we needed to auto-create new customers in the data warehouse with just the minimum of data, the surrogate key and the business key and then do a SCD type 1update when the full customer data profile is loaded the following week. Technical issues such as these are trivial, we formulate and agree a business rule to define our actions and we implement it in our ETL or, possibly, the reporting code. In my opinion the bigger issues to resolve are in Data Governance and Data Quality.

Some people combine Data Quality and Governance together as a single topic and believe that a single solution will put all right. However, to my mind, they are completely separate issues. Data quality is about the content of the data and governance is about ownership, providence and business management of the data. Today, Data Governance is increasingly becoming a regulatory requirement, especially in finance.

Governance is much more than the data lineage tools we might access in ETL tools such as ODI and even OWB. ETL lineage is about source to target mappings; our ability to say that ‘bank branch name’ comes from this source attribute, travels through these multiple ODI mappings and finally updates that column in our BANK_BRANCH dimension table. In true Data Governance we probably do some or all of these:

  • Create a dictionary of approved business terms. This will define every attribute in business terms and also provide translations between geographic and business-unit centric ways of viewing data. In finance one division may talk about “customer”, another division will say “investor”, a third says “borrower”; in all three cases we are really talking about the same kind of object, a person. This dictionary should go down to the level of individual attribute and measures and include the type of data being held such as text, currency, date-time, these data types are logical types and not physical types as seen on the actual sources. It is important that this dictionary is shared throughout the organisation and is “the true definition” of what is reported.
  • Define ownership (or stewardship) for the approved business data item.
  • Map business data sources and targets to our approved list of terms (at attribute level). It is very possible that some attributes will have multiple potential sources, in such cases we must specify which source will be the master source.
  • Define processes to keep our business data aligned.  
  • Define ownership for the sources for design (and for static data such as ISO country codes, content) change accountability. Possibility integrate into change notification mechanism of change process.
  • Define data release processes for approved external reference data.
  • Define data access and redaction rules for compliance purposes.
  • Build-in audit and control.
As you can see we are not, in the main, talking data content, instead we are improving our description of the business data over that are already held in database data dictionaries and XSD files. This is still metadata and is almost certainly best managed in some kind of Data Governance application. One tool we might consider for this is Oracle Data Relationship Manager from the Hyperion family of products. If we want to go more DIY it may be possible to leverage some of the data responsibility features of Oracle SQL Developer Data Modeller.

Whereas governance is about using the right data and having processes and people to guarantee it is correctly sourced, Data Quality is much finer in grain and looks at the actual content. Here a tool such as Oracle Enterprise Data Quality is invaluable. By the way I have noticed that OEDQ version 12 has recently been released, I have a blog on this in the pipeline.

I tend to divide Data Quality into three disciplines:

  • Data Profiling is always going to be our first step. Before we fix things we need to know what to fix! Generally, we try to profile a sample of the data and assess it column by column, row by row to build a picture of the actual content. Typically we look at data range, nulls, number of distinct values and in the case of text data: character types used (alpha, letter case, numeric, accents, punctuation etc), regular expressions. From this we develop a plan to tackle quality, for example on a data entry web-page we may want to tighten processing rules to prevent certain “anticipated” errors; more usually we come up with business rules to apply in our next stage. 
  • Data Assessment. Here we test the full dataset against the developed rules to identify data that conforms or needs remedy. This remedy could be referring the data back to the source system owner for correction, providing a set of data fixes to apply to the source which can be validated and applied as a batch, creating processes to “fix” data on the source at initial data entry, or (and I would strongly advise against this for governance reasons) dynamically fix in an ETL process. The reason I am against fixing data downstream in ETL is that the data we report on in our Data Warehouse is not going to match the source and this will be problematic when we try to validate if our data warehouse fits reality.
  • Data de-duplication. This final discipline of our DQ process is the most difficult, identifying data that is potentially duplicated in our data feed. In data quality terms a duplicate is where two or more rows refer to what is probably (statistically) the same item, this is a lot more fuzzy than an exact match in database terms; people miskey data, call centre staff mis-hear names, companies merge and combine data sets, I have even seen customers registering a new email address because they can not be bothered to reset their password on a e-selling website. De-duplication is important to improve the accuracy of BI in general, it is nigh-on mandatory for organisations that need to manage risk and prevent fraud.
Data Quality is so important to trusted BI; without it we run the risk that our dimensions do not roll-up correctly and that we under-report by separating our duplicates. However, being correct at the data warehouse is only part of the story, these corrections also need to be on the sources; to do that we have to implement processes and disciplines throughout the organisation.   For BI that users can trust we need to combine both data management disciplines. From governance we need to be sure that we are using the correct business terms for all attributes and that the data displayed in those attributes has made the correct journey from the original source. From quality we gain confidence that we are correctly aggregating data in our reporting.   At the end of the day we need to be right to be trusted.

 

 

Categories: BI & Warehousing

Getting The Users’ Trust – Part 1

Wed, 2014-09-17 03:02

Looking back over some of my truly ancient Rittman Mead blogs (so old in fact that they came with me when I joined the company soon after Rittman Mead was launched), I see recurrent themes on why people “do” BI and what makes for successful implementations. After all, why would an organisation wish to invest serious money in a project if it does not give value either in terms of cost reduction or increasing profitability through smart decisions. This requires technology to provide answers and a workforce that is both able to use this technology and has faith that the answers returned allow them to do their jobs better. Giving users this trust in the BI platform generally boils down to resolving these three issues: ease of use of the reporting tool, quickness of data return and “accuracy” or validity of the response. These last two issues are a fundamental part of my work here at Rittman Mead and underpin all that I do in terms of BI architecture, performance, and data quality. Even today as we adapt our BI systems to include Big Data and Advanced Analytics I follow the same sound approaches to ensure usable, reliable data and the ability to analyse it in a reasonable time.

Storage is cheap so don’t aggregate away your knowledge. If my raw data feed is sales by item by store by customer by day and I only store it in my data warehouse as sales by month by state I can’t go back to do any analysis on my customers, my stores, my products. Remember that the UNGROUP BY only existed in my April Fools’ post. Where you choose to store your ‘unaggregated’ data may well be different these days; Hadoop and schema on read paradigms often being a sensible approach. Mark Rittman has been looking at architectures where both the traditional DWH and Big Data happily co-exist.

When improving performance I tend to avoid tuning specific queries, instead I aim to make frequent access patterns work well. Tuning individual queries is almost always not a sustainable approach in BI; this week’s hot, ‘we need the answer immediately’ query may have no business focus next week. Indexes that we create to make a specific query fly may have no positive effect on other queries; indeed, indexes may degrade other aspects of BI performance such as increased data load times and have subtle effects such as changing a query plan cost so that groups of materialized views are no longer candidates in query re-write (this is especially true when you use nested views and the base view is no longer accessed).

My favoured performance improvement techniques are: correctly placing the data be it clustering, partitioning, compressing, table pinning, in-memory or whatever, and making sure that the query optimiser knows all about the nature of the data; again and again “right” optimiser information is key to good performance. Right is not just about running DBMS_STATS.gather_XXX over tables or schemas every now and then; it is also about telling the optimiser about data relationships between data items. Constraints describe the data, for example which columns allow NULL values, which columns are part of parent-child relationships (foreign keys). Extended table statistics can help describe relationships between columns in a single table for example in a product dimensions table the product sub-category and the product category columns will have an interdependence, without that knowledge cardinality estimates can be very wrong and favour nested loop style plans that could be very poor performing on large data sets.

Sometimes we will need to create aggregates to answer queries quickly; I tend to build ‘generic’ aggregates, those that can be used by many queries. Often I find that although data is loaded frequently, even near-real-time, many business users wish to look at larger time windows such as week, month, or quarter; In practice I see little need for day level aggregates over the whole data warehouse timespan, however, there will always be specific cases that might require day-level summaries. If I build summary tables or use Materialized Views I would aim to make tables that are at least 80% smaller than the base table and to avoid aggregates that partially roll up many dimensional hierarchies; customer category by product category by store region by month would probably not be the ideal aggregate for most real-user queries. That said Oracle does allow us to use fancy grouping semantics in the building of aggregates (grouping sets, group by rollup and group by cube.) The in-database Oracle OLAP cube functionality is still alive and well (and was given a performance boost in Oracle 12c); it may be more appropriate to aggregate in a cube (or relational-look-alike) rather than individual summaries.

Getting the wrong results quickly is no good, we must be sure that the results we display are correct. As professional developers we test to prove that we are not losing or gaining data through incorrect joins and filters, but ETL coding is often the smallest factor in “incorrect results” and this brings me to part 2, Data Quality.

Categories: BI & Warehousing

Rittman Mead/Oracle Data Integration Speakeasy @ Oracle Open World

Thu, 2014-09-11 10:59

If you are attending Oracle Open World this year and fancy bit of a different experience, come and join Rittman Mead and Oracle’s Data Integration teams for drinks and networking at 7pm on Tuesday 30th September at the Local Edition speakeasy on Market Street.

We will be providing a couple of hours of free drinks with the opportunity to quiz our leading data integration experts and Oracle’s data integration team about any aspect of the data integration toolset, architecture and our innovative implementation approaches, and to relax and kick back at the end of a long day. So whether you want to know about how ODI can facilitate your big data strategy, or implement data quality and data governance across your enterprise data architecture, please come along.

The Local Edition is located at 691 Market St, San Francisco, CA and the event runs from 7pm to 9pm. Please register here.

For further information on this event and the sessions we are presenting at Oracle Open World contact us at info@rittmanmead.com.

Categories: BI & Warehousing

Using Oracle GoldenGate for Trickle-Feeding RDBMS Transactions into Hive and HDFS

Wed, 2014-09-10 15:13

A few months ago I wrote a post on the blog around using Apache Flume to trickle-feed log data into HDFS and Hive, using the Rittman Mead website as the source for the log entries. Flume is a good technology to use for this type of capture requirement as it captures log entries, HTTP calls, JMS queue entries and other “event” sources easily, has a resilient architecture and integrates well with HDFS and Hive. But what if the source you want to capture activity for is a relational database, for example Oracle Database 12c? With Flume you’d need to spool the database transactions to file, whereas what you really want is a way to directly connect to the database engine and capture the changes from source.

Which is exactly what Oracle GoldenGate does, and what most people don’t realise is that GoldenGate can also load data into HDFS and Hive, as well as the usual database targets. Hive and HDFS aren’t fully-supported targets yet, you can use the Oracle GoldenGate for Java adapter to act as the handler process and then land the data in HDFS files or Hive tables on your target Hadoop platform. My Oracle Support has two tech nodes, “Integrating OGG Adapter with Hive (Doc ID 1586188.1)” and “Integrating OGG Adapter with HDFS (Doc ID 1586210.1)” that give example implementations of the Java adapters you’d need for these two target types, with the overall end-to-end process for landing Hive data looking like the diagram below (and the HDFS one just swapping out HDFS for Hive at the handler adapter stage)

NewImage

This is also a good example of the sorts of technology we’d use to implement the “data factory” concept within the new Oracle Information Management Reference Architecture, the part of the architecture that moves data between the Hadoop and NoSQL-based Data Reservoir, and the relationally-stored enterprise information store; in this case, trickle-feeding transactional data from the Oracle database into Hadoop, perhaps to archive it at lower-cost than we could do in an Oracle database, or to add transaction activity data to a Hadoop-based application

NewImage

So I asked my colleague Nelio Guimaraes to set up a GoldenGate capture process on our Cloudera CDH5.1 Hadoop cluster, using GoldenGate 12.1.2.0.0 for our source Oracle 11gR2 database and Oracle GoldenGate for Java, downloadable separately on edelivery.oracle.com under Oracle Fusion Middleware > Oracle GoldenGate Application Adapters 11.2.1.0.0 for JMS and Flat File Media Pack. In our example, we’re going to capture activity on the SCOTT.EMP table in the Oracle database, and then perform the following step to set up replication from it into a replica Hive table:

  1. Create a table in Hive that corresponds to the table in Oracle database.
  2. Create a table in the Oracle database and prepare the table for replication.
  3. Configure the Oracle GoldenGate Capture to extract transactions from the Oracle database and create the trail file.
  4. Configure the Oracle GoldenGate Pump to read the trail and invoke the custom adapter
  5. Configure the property file for the Hive handler
  6. Code, Compile and package the custom Hive handler
  7. Execute a test. 
Setting up the Oracle Database Source Capture

Let’s go into the Oracle database first, check the table definition, and then connect to Hadoop to create a Hive table of the same column definition.

[oracle@centraldb11gr2 ~]$ sqlplus scott/tiger
SQL*Plus: Release 11.2.0.3.0 Production on Thu Sep 11 01:08:49 2014
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL> describe DEPT
 Name Null? Type
 ----------------------------------------- -------- ----------------------------
 DEPTNO NOT NULL NUMBER(2)
 DNAME VARCHAR2(14)
 LOC VARCHAR2(13)
SQL> exit
...
[oracle@centraldb11gr2 ~]$ ssh oracle@cdh51-node1
Last login: Sun Sep 7 16:11:36 2014 from officeimac.rittmandev.com
[oracle@cdh51-node1 ~]$ hive
...
create external table dept
(
 DEPTNO string, 
 DNAME string, 
 LOC string
) row format delimited fields terminated by '\;' stored as textfile
location '/user/hive/warehouse/department'; 
exit
...

Then I install Oracle Golden Gate 12.1.2 on the source Oracle database, just as you’d do for any Golden Gate install, and make sure supplemental logging is enabled for the table I’m looking to capture. Then I go into the ggsci Golden Gate command-line utility, to first register the user it’ll be connecting as, and what table it needs to capture activity for.

[oracle@centraldb11gr2 12.1.2]$ cd /u01/app/oracle/product/ggs/12.1.2/
[oracle@centraldb11gr2 12.1.2]$ ./ggsci
$ggsci> DBLOGIN USERID sys@ctrl11g, PASSWORD password sysdba
$ggsci> ADD TRANDATA SCOTT.DEPT COLS(DEPTNO), NOKEY

GoldenGate uses a number of components to replicate data from source to targets, as shown in the diagram below.

NewImageFor our purposes, though, there are just three that we need to configure; the Extract component, which captures table activity on the source; the Pump process that moves data (or the “trail”) from source database to the Hadoop cluster; and the Replicat component that takes that activity and applies it to the target tables. In our example, the extract and pump processes will be as normal, but we need to create a custom “handler” for the target Hive table that uses the Golden Gate Java API and the Hadoop FS Java API.

The tool we use to set up the extract and capture process is ggsci, the command-line Golden Gate Software Command Interface. I’ll use it first to set up the Manager process that runs on both source and target servers, giving it a port number and connection details into the source Oracle database.

$ggsci> edit params mgr
PORT 7809
USERID sys@ctrl11g, PASSWORD password sysdba
PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/12.1.2/dirdat/*, USECHECKPOINTS

Then I create two configuration files, one for the extract process and one for the pump process, and then use those to start those two processes.

$ggsci> edit params ehive
EXTRACT ehive
USERID sys@ctrl11g, PASSWORD password sysdba
EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, FORMAT RELEASE 11.2
TABLE SCOTT.DEPT;
$ggsci> edit params phive
EXTRACT phive
RMTHOST cdh51-node1.rittmandev.com, MGRPORT 7809
RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, FORMAT RELEASE 11.2
PASSTHRU
TABLE SCOTT.DEPT;
$ggsci> ADD EXTRACT ehive, TRANLOG, BEGIN NOW
$ggsci> ADD EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, EXTRACT ehive
$ggsci> ADD EXTRACT phive, EXTTRAILSOURCE /u01/app/oracle/product/ggs/12.1.2/dirdat/et
$ggsci> ADD RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, EXTRACT phive

As the Java event handler on the target Hadoop platform won’t be able to ordinarily get table metadata for the source Oracle database, we’ll use the defgen utility on the source platform to create the parameter file that the replicat process will need.

$ggsci> edit params dept
defsfile ./dirsql/DEPT.sql
USERID ggsrc@ctrl11g, PASSWORD ggsrc
TABLE SCOTT.DEPT;
./defgen paramfile ./dirprm/dept.prm NOEXTATTR

Note that NOEXTATTR means no extra attributes; because the version on target is a generic and minimal version, the definition file with extra attributes won’t be interpreted. Then, this DEPT.sql file will need to be copied across to the target Hadoop platform where you’ve installed Oracle GoldenGate for Java, to the /dirsql folder within the GoldenGate install. 

[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1
oracle@cdh51-node1's password: 
Last login: Wed Sep 10 17:05:49 2014 from centraldb11gr2.rittmandev.com
[oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/
[oracle@cdh51-node1 11.2.1]
$ pwd/u01/app/oracle/product/ggs/11.2.1
[oracle@cdh51-node1 11.2.1]$ ls dirsql/
DEPT.sql

Then, going back to the source Oracle database platform, we’ll start the Golden Gate Monitor process, and then the extract and pump processes.

[oracle@cdh51-node1 11.2.1]$ ssh oracle@centraldb11gr2
oracle@centraldb11gr2's password: 
Last login: Thu Sep 11 01:08:18 2014 from bdanode1.rittmandev.com
GGSCI (centraldb11gr2.rittmandev.com) 7> start mgr
Manager started.
 
GGSCI (centraldb11gr2.rittmandev.com) 8> start ehive
 
Sending START request to MANAGER ...
EXTRACT EHIVE starting
 
GGSCI (centraldb11gr2.rittmandev.com) 9> start phive
 
Sending START request to MANAGER ...
EXTRACT PHIVE starting

Setting up the Hadoop / Hive Replicat Process

Setting up the Hadoop side involves a couple of similar steps to the source capture side; first we configure the parameters for the Manager process, then configure the extract process that will pull table activity off of the trail file, sent over by the pump process on the source Oracle database.

[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1
oracle@cdh51-node1's password: 
Last login: Wed Sep 10 21:09:38 2014 from centraldb11gr2.rittmandev.com
[oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/
[oracle@cdh51-node1 11.2.1]$ ./ggsci
$ggsci> edit params mgr
PORT 7809
PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/11.2.1/dirdat/*, usecheckpoints, minkeepdays 3
$ggsci> add extract tphive, exttrailsource /u01/app/oracle/product/ggs/11.2.1/dirdat/rt
$ggsci> edit params tphive
EXTRACT tphive
SOURCEDEFS ./dirsql/DEPT.sql
CUserExit ./libggjava_ue.so CUSEREXIT PassThru IncludeUpdateBefores
GETUPDATEBEFORES
TABLE SCOTT.DEPT;

Now it’s time to create the Java hander that will write the trail data to the HDFS files and Hive table. The My Oracle Support Doc.ID 1586188.1 I mentioned at the start of the article has a sample Java program called SampleHandlerHive.java that writes incoming transactions into an HDFS file within the Hive directory, and also writes it to a file on the local filesystem. To get this working on our Hadoop system, we created a new java source code file from the content in SampleHandlerHive.java, updated the path from hadoopConf.addResource to point the the correct location of core-site.xml, hdfs-site.xml and mapred-site.xml, and then compiled it as follows:

export CLASSPATH=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/*
javac -d . SampleHandlerHive.java

Successfully executing the above command created the SampleHiveHandler.class under /u01/app/oracle/product/ggs/11.2.1//dirprm/com/mycompany/bigdata. To create the JAR file that the GoldenGate for Java adapter will need, I then need to change directory to the /dirprm directory under the Golden Gate install, and then run the following commands:

jar cvf myhivehandler.jar com
chmod 755 myhivehandler.jar

I also need to create a properties file for this JAR to use, in the same /dirprm directory. This properties file amongst other things tells the Golden Gate for Java adapter where in HDFS to write the data to (the location where the Hive table keeps its data files), and also references any other JAR files from the Hadoop distribution that it’ll need to get access to.

[oracle@cdh51-node1 dirprm]$ cat tphive.properties 
#Adapter Logging parameters. 
gg.log=log4j
gg.log.level=info
 
#Adapter Check pointing  parameters
goldengate.userexit.chkptprefix=HIVECHKP_
goldengate.userexit.nochkpt=true
 
# Java User Exit Property
goldengate.userexit.writers=jvm
jvm.bootoptions=-Xms64m -Xmx512M -Djava.class.path=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/u01/app/oracle/product/ggs/11.2.1/dirprm:/u01/app/oracle/product/ggs/11.2.1/dirprm/myhivehandler.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-common-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-configuration-1.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-logging-1.1.3.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-lang-2.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop/conf.dist:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/guava-11.0.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/hadoop-auth-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-hdfs-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/commons-cli-1.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/protobuf-java-2.5.0.jar
 
#Properties for reporting statistics
# Minimum number of {records, seconds} before generating a report
jvm.stats.time=3600
jvm.stats.numrecs=5000
jvm.stats.display=TRUE
jvm.stats.full=TRUE
 
#Hive Handler.  
gg.handlerlist=hivehandler
gg.handler.hivehandler.type=com.mycompany.bigdata.SampleHandlerHive
gg.handler.hivehandler.HDFSFileName=/user/hive/warehouse/department/dep_data
gg.handler.hivehandler.RegularFileName=cinfo_hive.txt
gg.handler.hivehandler.RecordDelimiter=;
gg.handler.hivehandler.mode=tx

Now, the final step on the Hadoop side is to start its Golden Gate Manager process, and then start the Replicat and apply process.

GGSCI (cdh51-node1.rittmandev.com) 5> start mgr
 
Manager started. 
 
GGSCI (cdh51-node1.rittmandev.com) 6> start tphive
 
Sending START request to MANAGER ...
EXTRACT TPHIVE starting

Testing it All Out

So now I’ve got the extract and pump processes running on the Oracle Database side, and the apply process running on the Hadoop side, let’s do a quick test and see if it’s working. I’ll start by looking at what data is in each table at the beginning.

SQL> select * from dept;     

    DEPTNO DNAME  LOC
 ---------- -------------- -------------

10 ACCOUNTING  NEW YORK
20 RESEARCH  DALLAS
30 SALES  CHICAGO
40 OPERATIONS  BOSTON
50 TESTE  PORTO
60 NELIO  STS
70 RAQUEL  AVES
 
7 rows selected.

Over on the Hadoop side, there’s just one row in the Hive table:

hive> select * from customer;

OK 80MARCIA   ST

Now I’ll go back to Oracle and insert a new row in the DEPT table:

SQL> insert into dept (deptno, dname, loc)
  2  values (75, 'EXEC','BRIGHTON'); 

1 row created. 
SQL> commit; 

Commit complete.

And, going back over to Hadoop, I can see Golden Gate has added that record to the Hive table, by the Golden Gate for Java adapter writing the transaction to the underlying HDFS file.

hive> select * from customer;

OK 80MARCIA   ST
75 EXEC       BRIGHTON

So there you have it; Golden Gate replicating Oracle RBDMS transactions into HDFS and Hive, to complement Apache Flume’s ability to replicate log and event data into Hadoop. Moreover, as Michael Rainey explained in this three part blog series, Golden Gate is closely integrated into the new 12c release of Oracle Data Integrator, making it even easier to manage Golden Gate replication processes into your overall data loading project, and giving Hadoop developers and Golden Gate users access to the full set of load orchestration and data quality features in that product rather than having to rely on home-grown scripting, or Oozie.

Categories: BI & Warehousing

OBIEE SampleApp in The Cloud: Importing VirtualBox Machines to AWS EC2

Wed, 2014-09-10 01:40

Virtualisation has revolutionised how we work as developers. A decade ago, using new software would mean trying to find room on a real tin server to install it, hoping it worked, and if it didn’t, picking apart the pieces probably leaving the server in a worse state than it was to begin with. Nowadays, we can just launch a virtual machine to give a clean environment and if it doesn’t work – trash it and start again.
The sting in the tail of virtualisation is that full-blown VMs are heavy – for disk we need several GB just for a blank OS, and dozens of GB if you’re talking about a software stack such as Fusion MiddleWare (FMW), and the host machine needs to have the RAM and CPU to support it all too. Technologies such as Linux Containers go some way to making things lighter by abstracting out a chunk of the OS, but this isn’t something that’s reached the common desktop yet.

So whilst VMs are awesome, it’s not always practical to maintain a library of all of them on your local laptop (even 1TB drives fill up pretty quickly), nor will your laptop have the grunt to run more than one or two VMs at most. VMs like this are also local to your laptop or server – but wouldn’t it be neat if you could duplicate that VM and make a server based on it instantly available to anyone in the world with an internet connection? And that’s where The Cloud comes in, because it enables us to store as much data as we can eat (and pay for), and provision “hardware” at the click of a button for just as long as we need it, accessible from anywhere.

Here at Rittman Mead we make extensive use of Amazon Web Services (AWS) and their Elastic Computing Cloud (EC2) offering. Our website runs on it, our training servers run on it, and it scales just as we need it to. A class of 3 students is as easy to provision for as a class of 24 – no hunting around for spare servers or laptops, no hardware sat idle in a cupboard as spare capacity “just in case”.

One of the challenges that we’ve faced up until now is that all servers have had to be built from scratch in the cloud. Obviously we work with development VMs on local machines too, so wouldn’t it be nice if we could build VMs locally and then push them to the cloud? Well, now we can. Amazon offer a route to import virtual machines, and in this article I’m going to show how that works. I’ll use the superb SampleApp v406 VM that Oracle provide, because this is a great real-life example of a VM that is so useful, but many developers can find too memory-intensive to be able to run on their local machines all the time.

This tutorial is based on exporting a Linux guest VM from a Linux host server. A Windows guest probably behaves differently, but a Mac or Windows host should work fine since VirtualBox is supported on both. The specifics are based on SampleApp, but the process should be broadly the same for all VMs. 

Obtain the VM

We’re going to use SampleApp, which can be downloaded from Oracle.

  1. Download the six-part archive from http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples–167534.html
  2. Verify the md5 checksums against those published on the download page:
    [oracle@asgard sampleapp406]$ ll
    total 30490752
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 01:33 SampleAppv406.zip.001
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 01:30 SampleAppv406.zip.002
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:03 SampleAppv406.zip.003
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:34 SampleAppv406.zip.004
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:19 SampleAppv406.zip.005
    -rw-r--r-- 1 oracle oinstall 4977591522 Sep  9 02:53 SampleAppv406.zip.006
    [oracle@asgard sampleapp406]$ md5sum *
    2b9e11f69ada5f889088dd74b5229322  SampleAppv406.zip.001
    f8a1a5ae6162b20b3e9c6c888698c071  SampleAppv406.zip.002
    68438cfea87e8d3a2e2f15ff00dadf12  SampleAppv406.zip.003
    b71d9ace4f75951198fc8197da1cfe62  SampleAppv406.zip.004
    4f1a5389c9e0addc19dce6bbc759ec20  SampleAppv406.zip.005
    2c430f87e22ff9718d5528247eff2da4  SampleAppv406.zip.006
  3. Unpack the archive using 7zip — the instructions for SampleApp are very clear that you must use 7zip, and not another archive tool such as winzip.
    [oracle@asgard sampleapp406]$ time 7za x SampleAppv406.zip.001</code>7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
    p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,80 CPUs)
    
    Processing archive: SampleAppv406.zip.001
    
    Extracting SampleAppv406Appliance
    Extracting SampleAppv406Appliance/SampleAppv406ga-disk1.vmdk
    Extracting SampleAppv406Appliance/SampleAppv406ga.ovf
    
    Everything is Ok
    
    Folders: 1
    Files: 2
    Size: 31191990916
    Compressed: 5242880000
    
    real 1m53.685s
    user 0m16.562s
    sys 1m15.578s
  4. Because we need to change a couple of things on the VM first (see below), we’ll have to import the VM to VirtualBox so that we can boot it up and make these changes.You can import using the VirtualBox GUI, or as I prefer, the VBoxManage command line interface. I like to time all these things (just because, numbers), so stick a time command on the front:
    time VBoxManage import --vsys 0 --eula accept SampleAppv406Appliance/SampleAppv406ga.ovf

    This took 12 minutes or so, but that was on a high-spec system, so YMMV.
    [...]
    0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
    Successfully imported the appliance.
    
    real    12m15.434s
    user    0m1.674s
    sys     0m2.807s
Preparing the VM

Importing Linux VMs to Amazon EC2 will only work if the kernel is supported, which according to an AWS blog post includes Red Hat Enterprise Linux 5.1 – 6.5. Whilst SampleApp v406 is built on Oracle Linux 6.5 (which isn’t listed by AWS as supported), we have the option of telling the VM to use a kernel that is Red Hat Enterprise Linux compatible (instead of the default Unbreakable Enterprise Kernel – UEK). There are some other pre-requisites that you need to check if you’re trying this with your own VM, including a network adaptor configured to use DHCP. The aforementioned blog post has details.

  1. Boot the VirtualBox VM, which should land you straight in the desktop environment, logged in as the oracle user.
  2. We need to modify a file as root (superuser). Here’s how to do it graphically, or use vi if you’re a real programmer:
    1. Open a Terminal window from the toolbar at the top of the screen
    2. Enter
      sudo gedit /etc/grub.conf

      The sudo bit is important, because it tells Linux to run the command as root. (I’m on an xkcd-roll here: 1, 2)

    3. In the text editor that opens, you will see a header to the file and then a set of repeating sections beginning with title. These are the available kernels that the machine can run under. The default is 3, which is zero-based, so it’s the fourth title section. Note that the kernel version details include uek which stands for Unbreakable Enterprise Kernel – and is not going to work on EC2.
    4. Change the default to 0, so that we’ll instead boot to a Red Hat Compatible Kernel, which will work on EC2
    5. Save the file
  3. Optional steps:
    1. Whilst you’ve got the server running, add your SSH key to the image so that you can connect to it easily once it is up on EC2. For more information about SSH keys, see my previous blog post here, and a step-by-step for doing it on SampleApp here.
    2. Disable non-SSH key logins (in /etc/ssh/sshd_config, set PasswordAuthentication no and PubkeyAuthentication yes), so that your server once on EC2 is less vulnerable to attack. Particularly important if you’re using the stock image with Admin123 as the root password.
    3. Set up screen, and OBIEE and the database as a Linux service, both covered in my article here.
  4. Shutdown the instance by entering this at a Terminal window:

    sudo shutdown -h now

Export the VirtualBox VM to Amazon EC2

Now we’re ready to really get going. The first step is to export the VirtualBox VM to a format that Amazon EC2 can work with. Whilst they don’t explicitly support VMs from VirtualBox, they do support the VMDK format – which VirtualBox can create. You can do the export from the graphical interface, or as before, from the command line:

time VBoxManage export "OBIEE SampleApp v406" --output OBIEE-SampleApp-v406.ovf

0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Successfully exported 1 machine(s).

real    56m51.426s
user    0m6.971s
sys     0m12.162s

If you compare the result of this to what we downloaded from Oracle it looks pretty similar – an OVF file and a VMDK file. The only difference is that the VMDK file is updated with the changes we made above, including the modified kernel settings which are crucial for the success of the next step.

[oracle@asgard sampleapp406]$ ls -lh
total 59G
-rw------- 1 oracle oinstall  30G Sep  9 10:55 OBIEE-SampleApp-v406-disk1.vmdk
-rw------- 1 oracle oinstall  15K Sep  9 09:58 OBIEE-SampleApp-v406.ovf

We’re ready now to get all cloudy. For this, you’ll need:

  1. An AWS account
    1. You’ll also need your AWS account’s Access Key and Secret Key
  2. AWS EC2 commandline tools installed, along with a Java Runtime Environment (JRE) 1.7 or greater:

    wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip
    sudo mkdir /usr/local/ec2
    sudo unzip ec2-api-tools.zip -d /usr/local/ec2
    # You might need to fiddle with the following paths and version numbers: 
    sudo yum install -y java-1.7.0-openjdk.x86_64
    cat >> ~/.bash_profile <<EOF
    export JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre"
    export EC2_HOME=/usr/local/ec2/ec2-api-tools-1.7.1.1/
    export PATH=$PATH:$EC2_HOME/bin
    EOF<

  3. Set your credentials as environment variables:
    export AWS_ACCESS_KEY=xxxxxxxxxxxxxx
    export AWS_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxx
  4. Ideally a nice fat pipe to upload the VM file over, because at 30GB it is not trivial (not in 2014, anyway)

What’s going to happen now is we use an EC2 command line tool to upload our VMDK (virtual disk) file to Amazon S3 (a storage platform), from where it gets converted into an EBS volume (Elastic Block Store, i.e. a EC2 virtual disk), and from there attached to a new EC2 instance (a “server”/”VM”).

Before we can do the upload we need an S3 “bucket” to put the disk image in that we’re uploading. You can create one from https://console.aws.amazon.com/s3/. In this example, I’ve got one called rmc-vms – but you’ll need your own.

Once the bucket has been created, we build the command line upload statement using ec2-import-instance:

time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY

Points to note:

  • m3.large is the spec for the VM. You can see the available list here. In the AWS blog post it suggests only a subset will work with the import method, but I’ve not hit this limitation yet.
  • region is the AWS Region in which the EBS volume and EC2 instance will be built. I’m using ew-west-1 (Ireland), and it makes sense to use the one geographically closest to where you or your users are located. Still waiting for uk-yorks-1
  • architecture and platform relate to the type of VM you’re importing.

The upload process took just over 45 minutes for me, and that’s from a data centre with a decent upload:

[oracle@asgard sampleapp406]$ time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY
Requesting volume size: 200 GB
TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       0       Status       active  StatusMessage   Pending : Downloaded 0
Creating new manifest at rmc-vms/d77672aa-0e0b-4555-b368-79d386842112/OBIEE-SampleApp-v406-disk1.vmdkmanifest.xml
Uploading the manifest file
Uploading 31191914496 bytes across 2975 parts
0% |--------------------------------------------------| 100%
   |==================================================|
Done
Average speed was 11.088 MBps
The disk image for import-i-fh08xcya has been uploaded to Amazon S3
where it is being converted into an EC2 instance.  You may monitor the
progress of this task by running ec2-describe-conversion-tasks.  When
the task is completed, you may use ec2-delete-disk-image to remove the
image from S3.

real    46m59.871s
user    10m31.996s
sys     3m2.560s

Once the upload has finished Amazon automatically converts the VMDK (now residing on S3) into a EBS volume, and then attaches it to a new EC2 instance (i.e. a VM). You can monitor the status of this task using ec2-describe-conversion-tasks, optionally filtered on the TaskId returned by the import command above:

ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       3898992128
Status  active  StatusMessage   Pending : Downloaded 31149971456

This is now an ideal time to mention as a side note the Linux utility watch, which simply re-issues a command for you every x seconds (2 by default). This way you can leave a window open and keep an eye on the progress of what is going to be a long-running job

watch ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya

Every 2.0s: ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya                                                             Tue Sep  9 12:03:24 2014

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       5848511808
Status  active  StatusMessage   Pending : Downloaded 31149971456

And whilst we’re at it, if you’re using a remote server to do this (as I am, to take advantage of the large bandwidth), you will find screen invaluable for keeping tasks running and being able to reconnect at will. You can read more about screen and watch here.

So back to our EC2 import job. To start with, the task will be Pending: (NB unlike lots of CLI tools, you read the output of this one left-to-right, rather than as columns with headings)

$ ec2-describe-conversion-tasks --region eu-west-1
TaskType        IMPORTINSTANCE  TaskId  import-i-ffvx6z86       ExpirationTime  2014-09-12T15:32:01Z    Status  active  StatusMessage   Pending InstanceID      i-b2245ef2
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5021144064      VolumeSize      60      AvailabilityZone        eu-west-1a      ApproximateBytesConverted       4707330352      Status  active  StatusMessage   Pending : Downloaded 5010658304

After a few moments it gets underway, and you can see a Progress percentage indicator: (scroll right in the code snippet below to see)

TaskType        IMPORTINSTANCE  TaskId  import-i-fgr0djcc       ExpirationTime  2014-09-15T15:39:28Z    Status  active  StatusMessage   Progress: 53%   InstanceID      i-c7692e87
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5582545920      VolumeId        vol-f71368f0    VolumeSize      20      AvailabilityZone        eu-west-1a      ApproximateBytesConverted       5582536640      Status  completed

Note that at this point you’ll see also see an Instance in the EC2 list, but it won’t launch (no attached disk – because it’s still being imported!)

If something goes wrong you’ll see the Status as cancelled, such as in this example here where the kernel in the VM was not a supported one (observe it is the UEK kernel, which isn’t supported by Amazon):

TaskType        IMPORTINSTANCE  TaskId  import-i-ffvx6z86       ExpirationTime  2014-09-12T15:32:01Z    Status  cancelled       StatusMessage   ClientError: Unsupported kernel version 2.6.32-300.32.1.el5uek       InstanceID      i-b2245ef2
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5021144064      VolumeId        vol-91b1c896    VolumeSize      60      AvailabilityZone        eu-west-1a      ApproximateBytesConverted    5021128688      Status  completed

After an hour or so, the task should complete:

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  completed       InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeId        vol-a383f8a4    VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBy
tesConverted    31191855472     Status  completed

At this point you can remove the VMDK from S3 (and should do, else you’ll continue to be charged for it), following the instructions for ec2-delete-disk-image

Booting the new server on EC2

Go to your EC2 control panel, where you should see an instance (EC2 term for “server”) in Stopped state and with no name.

Select the instance, and click Start on the Actions menu. After a few moments a Public IP will be shown in the details pane. But, we’re not home free quite yet…read on.

Firewalls

So this is where it gets a bit tricky. By default, the instance will have launched with Amazon’s Firewall (known as a Security Group) in place which – unless you have an existing AWS account and have modified the default security group’s configuration – is only open on port 22, which is for ssh traffic.

You need to head over to the Security Group configuration page, accessed in several ways but easiest is clicking on the security group name from the instance details pane:

Click on the Inbound tab and then Edit, and add “Custom TCP Rule” for the following ports:

  • 7780 (OBIEE front end)
  • 7001 (WLS Console / EM)
  • 5902 (oracle VNC)

You can make things more secure by allowing access to the WLS admin (7001) and VNC port (5902) to a specific IP address or range only.

Whilst we’re talking about security, your server is now open to the internet and all the nefarious persons out there, so you’ll be wanting to harden your server not least by resetting all the passwords to ones which aren’t publicly documented in the SampleApp user documentation!

Once you’ve updated your Security Group, you can connect to your server! If you installed the OBIEE and database auto start scripts (and if not, why not??) you should find OBIEE running just nicely on http://[your ip]:7780/analytics – note that the port is 7780, not 9704.

2014-09-09_20-21-23

If you didn’t install the script, you will need to start the services manually per the SampleApp documentation. To connect to the server you can ssh (using Terminal, PuTTY, etc) to the server or connect on VNC (Admin123 is the password). For VNC clients try Screen Share on Macs (installed by default), or RealVNC on Windows.

Caveats & Disclaimers
  • Running a server on AWS EC2 costs real money, so watch out. Once you’ve put your credit card details in, Amazon will continue to charge your card whilst there are chargeable items on your account (EBS volumes, instances – running or not- , and so on). You can get an idea of the scale of charges here.
  • As mentioned above, a server on the open internet is a lot more vulnerable than one virtualised on your local machine. You will get poked and probed, usually by automated scripts looking for open ports, weak passwords, and so on. SampleApp is designed to open the toybox of a pimped-out OBIEE deployment to you, it is not “hardened”, and you risk learning the tough way about the need for it if you’re not careful.
Cloning

Amazon EC2 supports taking a snapshot of a server, either for backup/rollback purposes or spinning up as a clone, using an Amazon Machine Image (AMI). From the Instances page, simply select “Create an Image” to build your AMI. You can then build another instance (or ten) from this AMI as needed, exact replicas of the server as it was at the point that you created the image.

Lather, Rinse, and Repeat

There’s a whole host of VirtualBox “appliances” out there, and some of them such as the developer-tools-focused ones only really make sense as local VMs. But there are plenty that would benefit from a bit of “Cloud-isation”, where they’re too big or heavy to keep on your laptop all the time, but are handy to be able to spin up at will. A prime example of this for me is the EBS Vision demo database that we use for our BI Apps training. Oracle used to provide an pre-built Amazon image (know as an AMI) of this, but since withdrew it. However, Oracle do publish Oracle VM VirtualBox templates for EBS 12.1.3 and 12.2.3 (related blog), so from this with a bit of leg-work and a big upload pipe, it’s a simple matter to brew your own AWS version of it — ready to run whenever you need it.

Categories: BI & Warehousing

Sunday Times Tech Track 100

Tue, 2014-09-09 14:35

Over the weekend, Rittman Mead was listed in the Sunday Times Tech Track 100. We are extremely proud to get recognition for the business as well as our technical capability and expertise.

A lot of the public face of Rittman Mead focuses on the tools and technologies we work with. Since day one we have had a core policy to share as much information as possible. Even before the advent of social media, we shared pretty much everything we knew through either our blog or by speaking at conferences, but we very rarely talk about the business itself. However, a lot of the journey we have gone through over the last 7 years has been about the growth and maintenance of a successful, sustainable, multi-national business. We have been able to talk about, educate and evangelise about the tools and technologies as a result of having the successful business to support this.

I remember during one interview we did several years ago the candidate asked (and I’m paraphrasing): “How do you guys make any money, all I see/read is people sitting in airports writing blog posts about leading edge technologies?”.

One massive benefit from this is we often face the same problems (albeit on a different scales) to those that we talk about with customers, so we have been able to better understand the underlying drivers and proposed solutions for our clients.

From a personal point of view, this has meant spending a lot more time looking at contracts as opposed to code and reading business books/blogs as opposed to technical ones. However, it has been well worth it and I would like to say thanks to all of those both inside and outside of the company who have helped contribute to this success.

Categories: BI & Warehousing

Analyzing Twitter Data using Datasift, MongoDB, Hive and ODI12c

Mon, 2014-09-08 14:39

Last week I posted an article on the blog around analysing Twitter data using Datasift, MongoDB and Pig, where I used the Datasift service to stream tweets about Rittman Mead into a MongoDB NoSQL database, and then queried the dataset using Pig. The context for this is the idea of a “data reservoir”, where we supplement the more traditional file and relational datasets we find in data warehouses with other data, typically machine generated, unstructured or very low-level, to add context to the numbers in our reporting system. In the example I quoted in the article, it’d be very interesting to take the activity we record against our blog and website and correlate that with the “conversation” that happens about it in the social media world; for example, were the hits for a particular article due to it been mentioned in a tweet, and did a spike in activity correspond to a particularly influential Twitter user retweeting something we’d tweeted?

NewImage

In that previous article I’d used Pig to access and analyse the data, in part because I saw a match between the nested datasets in a typical DataSift Twitter message and the relations, tuples and bags you get in a Pig schema. For example, if you look at the Tweet from Borkur in the screenshot below from RoboMongo, a Mac OS X client for MongoDB that I’ve found useful, you can see the author details nested inside the interaction details, and the Type attribute having many values under the Trends parent attribute – these map well onto Pig tuples and bags respectively.

NewImage

What I’d really like to do with this dataset, though, is to take certain elements of it and use that to supplement the data I’m loading using ODI12c. Whilst ODI can run arbitrary R, Pig and shell scripts using the ODI Procedure feature (as I did here to make use of Sqoop, before Oracle added Sqoop KMs to ODI12.1.3), it gets the best out of Hadoop when it can access data using Hive, the SQL layer over Hadoop that represents HDFS data as rows and columns, and allows us to SELECT and INSERT data using SQL commands – or to be precise, a dialect of SQL called HiveQL. But how will Hive cope with the nested and repeating data structures in a DataSift Twitter message, and allow us to get just the data out that we’re interested in?

In fact, the MongoDB connector for Hadoop that I used for Pig the other day also comes with Hive connectivity, in the form of a SerDe that lets Hive report against data in a MongoDB database (David Allen blogged about another MongoDB Hive storage handler a while ago, in an article about MongoDB and ODI). What’s more, this Hive connector for MongoDB is actually easier to work with that the Pig connector, as instead of worrying about Tuples and Bags you can just pick out the nested attributes that you’re interested in using a dot notation. For example, if I’m only interested in the InteractionID, username, tweet content and number of followers within a particular Twitter dataset, I can create a table that looks like this in Hive:

CREATE TABLE tweet_data(
  interactionId string,
  username string,
  content string,
  author_followers int)
ROW FORMAT SERDE 
  'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
  'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
  'mongo.columns.mapping'='{"interactionId":"interactionId",
  "username":"interaction.interaction.author.username",
  "content":\"interaction.interaction.content",
  "author_followers_count":"interaction.twitter.user.followers_count"}'
  )
TBLPROPERTIES (
  'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
  )

And at that point, it’s pretty easy to bring the dataset into ODI12c, through the IKM Hive to Hive Control Append knowledge module, and join up the Twitter dataset with the website log data that’s coming in via Flume. ODI can connect to Hive via JDBC drivers supplied with CDH4/5, and once you register the Hive connection and reverse-engineer the Hive metastore metadata into ODI’s repository, the complexity of the underlying Hive storage is hidden and you’re just presented with tables and columns, just like any other datastore type.

NewImage

Starting with the Twitter data first, I create a Hive table outside of ODI that returns the precise set of tweet attributes that I’m interested in, and then filter that dataset down to just the tweets that link to content on our website, by filtering on the tweet link’s URL matching the start of our website address.

NewImage

Then I load-up the hits from the Rittman Mead website, previously landed into Hadoop using Flume and exposed to ODI as another Hive table, filter out all the non-blog page accesses and keep just the URL part of the Apache Weblog request field, removing the transport mechanism and other bits around it.

NewImage

Then, I use a final ODI mapping to join the two datasets together, using ODI’s ability to apply HiveQL expressions to the incoming datasets so that’ve got the same format – trailing ‘/‘ at the end of the URL, no ampersand and query text at the end of the URL, and so on. Both this and the previous transformation are great examples of where ODI can help with this sort of work, making it pretty easy to munge and correct your data so that you’re then able to match-up the two different sources.

NewImage

Then it’s just a case of creating a package or load plan to sequence the mappings, and then run them using the local or standalone agent. You can see the individual KM steps running on the left-hand side, with ODI generating HiveQL queries which in turn are translated into MapReduce and run in parallel across the Hadoop cluster.

NewImage

And then, at the end of the process, I’ve got a Hive table of all of our blog articles that have been mentioned on Twitter (since we started consuming the tweet feed, a day or so ago), with the number of page requests and the number of times that page got mentioned in tweets.

NewImage

Obviously there’s a lot more we can do with this; we can access the number of followers each twitter user has, along with their location, gender and the sentiment (positive, negative, neutral) of the tweet. From that we can work out some impact from the twitter activity, and we can also add to it data from other sources such as Facebook, LinkedIn and so on to get a fuller picture of the activity around our site. Then, the data we’re gathering in can either be left in MongoDB, or I can use these ODI mappings to either archive it in Hive tables, or export the highlights out to Oracle Database using Sqoop or Oracle Loader for Hadoop.

Categories: BI & Warehousing

Analyzing Twitter Data using Datasift, MongoDB and Pig

Thu, 2014-09-04 17:08

If you followed our recent postings on the updated Oracle Information Management Reference Architecture, one of the key concepts we talk about is the “data reservoir”. This is a pool of additional data that you can add to your data warehouse, typically stored on Hadoop or NoSQL databases, where you store unstructured, semi-structured or unprocessed structured data in large volume and at low cost. Adding a data reservoir gives you the ability to leverage the types of data sources that previously were thought of as too “messy”, too high-volume to store for any serious amount of time, or require processing or storing by tools that aren’t in the usual relational data warehouse toolset.

NewImage

By formally including them in your overall information management architecture though, with common tools, security and data governance over the entire dataset, you give your users the ability to consider the whole “360-degree view” of their customers and their interactions with the market.

To take an example, a few weeks ago I posted a series of articles on the blog where I captured user activity on our website, http://www.rittmanmead.com, transported it to one of our Hadoop clusters using Apache Flume, and then analysed it using Hive, Pig and finally Spark. In one of the articles I used Pig and a geocoding API to determine the country that each website visitor came from, and then in a final five-part series I automated the whole process using ODI12c and then copied the final output tables to Oracle using Oracle Loader for Hadoop. This is quite a nice example of ETL-offloading into Hadoop, with an element of Hadoop-native event capture using Flume, but once the processing has finished the data moves out of Hadoop and into the Oracle database.

NewImage

What would be interesting though would be to start adding data into Hadoop that’s permanent, not transitory as part of an ETL process, to start building out this concept of the “data reservoir”. Taking our website activity dataset, something that would really add context to the visits to our site would be corresponding activity on social networks, to see who’s linking to our posts, who’s discussing them, whether those discussions are positive or negative, and which wider networks those people belong to. Twitter is a good place to start with this as it’s the place we see our articles and activities most discussed, but it’s be good to build out this picture over time to add in activity on social networks such as Facebook, Youtube, LinkedIn and Google+; if we did this, we’d be able to consider a much broader and richer picture when looking at activity around Rittman Mead, potentially correlating activity and visits to our website with mentions of us in the press, comments made by our team and the wider picture of what’s going on in our world.

NewImage

There are a number of ways you can bring Twitter data into your Hadoop cluster or data warehouse, but the most convenient way we’ve found is to use DataSift, a social media aggregation site and service that license raw feeds from the likes of Twitter, Facebook, WordPress and other social media platforms, enhancing the data feeds with sentiment scores and other attributes, and then sell access to the feeds via a number of formats and APIs. Accessing Twitter data through DataSift costs money, particularly if you want to go back and look at historical activity vs. just filtering on a few keywords in new Twitter activity, but they’re very developer-friendly and able to provide greater volumes of firehose activity than the standard Twitter developer API allows.

So assuming you can get access to a stream of Twitter data on a particular topic – in our case, all mentions of our website, our team’s Twitter handles, retweets of our content etc – the question then becomes one of how to store the data. Looking at the Datasift Sample Output page, each of these streams delivers their payloads via JSON documents, XML-like structures that nest categories of tweet metadata within parent structures that make up the total tweet data and metadata dataset.

NewImage

And there’s a good reason for this; individual tweets might not use every bit of possible tweet metadata, for example not including entries under “mentions” or “retweets” if those aren’t used in a  particular message. Certain bits of metadata might be repeated X numbers of times – @ mentions, for example, and the JSON document might have a different structure altogether if a different JSON schema version is used for a particular tweet. Altogether not an easy type of data structure for a relational database to hold – though Oracle 12.1.0.2.0 has just introduced native JSON support to the core Oracle database – but NoSQL databases in contrast find these sorts of data structures easy, and one of the most popular for this type of work is MongoDB.

MongoDB is a open-source “document” database that’s probably best known to the Oracle world through this internet cartoon; what the video is getting at is NoSQL advocates recommending databases such as MongoDB for large-scale web work when something much more mainstream like mySQL would do the job better, but where NoSQL and document-style database come into their own is storing just this type of semi-structured, schema-on-read datasets. In fact, Datasift support MongoDB as an API end-point for their Twitter feed, so let’s go ahead and set up a MongoDB database, prepare it for the Twitter data, and then set-up a Datasift feed into it.

MongoDB installation on Linux, for example to run alongside a Hadoop installation, is pretty straightforward and involves adding a YUM repository and then running “sudo yum install mongodb-org” (there’s an OS X installation too, but I wanted to run this server-side on my Hadoop cluster). Once you’ve installed the MongoDB software, you start the mongod service to enable the server element, and then log into the mongo command-shell to create a new database.

MongoDB, being a schema-on-read database, doesn’t require you to set up a database schema up-front; instead, the schema comes from the data you load into it, with MongoDB’s equivalent of tables called “collections”, and with those collections made up of documents, analogous to rows in Oracle. Where it gets interesting though is that collections and databases only get created when you first start using them, and individual documents can have slightly, or even completely, different schema structures to each other – which makes them ideal for holding the sorts of datasets generated by Twitter, Facebook and DataSift.

[root@cdh51-node1 ~]# mongo
MongoDB shell version: 2.6.4
connecting to: test
> use datasift2
switched to db datasift2

Let’s create a couple of simple documents, and then add those to a collection. Note that the document becomes available just by declaring it, as does the collection when I add documents to it. Note also that the query language we’re using to work with MongoDB is Javascript, again making it particularly suited to JSON documents, and web-type environments.

> a = { name : "mark" }
{ "name" : "mark" }
> b = { product : "chair", size : "L" }
{ "product" : "chair", "size" : "L" }
> db
datasift2
> db.testData.insert(a)
WriteResult({ "nInserted" : 1 })
> db.testData.insert(b)
WriteResult({ "nInserted" : 1 })
> db.testData.find()
{ "_id" : ObjectId("54094081b5b6021fe9bc8b10"), "name" : "mark" }
{ "_id" : ObjectId("54094088b5b6021fe9bc8b11"), "product" : "chair", "size" : "L" }

And note also how the second entry (document) in the collection has a different schema to the entry above it – perfect for our semi-structured Twitter data, and something we could store as-is in MongoDB in this loose data format and then apply more formal structures and schemas to when we come to access the data – as we’ll do in a moment using Pig, and more formally using ODI and Hive in the next article in this series.

Setting up the Twitter feed from DataSift is a two-stage process, once you’ve got an account with them and an API key; first you define your search terms against a nested document model for the data source, then you activate the feed, in this case into my MongoDB database, and wait for the tweets to roll in. For my feed I selected tweets written by myself and some of the Rittman Mead team, tweets mentioning us, and tweets that included links to our blog in the main tweet contents (there’s also a graphical query designer, but I prefer to write them by hand using what DataSift call their “curated stream definition language” (CSDL).

NewImage

You can then preview the feed, live, or go back and sample historic data if you’re interested in loading old tweets, rather than incoming new ones. Once you’re ready you then need to activate the feed, in my case by calling a URL using CURL with a bunch of parameters (our API key and other sensitive data has been masked):

curl -X POST 'https://api.datasift.com/v1/push/create' \
-d 'name=connectormongodb' \
-d 'hash=65bd9dc4943ec426b04819exxxxxxxxx' \
-d 'output_type=mongodb' \
-d 'output_params.host=rittmandev.com' \
-d 'output_params.port=27017' \
-d 'output_params.use_ssl=no' \
-d 'output_params.verify_ssl=no' \
-d 'output_params.db_name=datasiftmongodb' \
-d 'output_params.collection_name=rm_tweets' \
-H 'Auth: rittmanmead:xxxxxxxxxxxxxxxxxxxxxxxxxxx'

The “hash” in the parameter list is the specific feed to activate, and the output type is MongoDB. The collection name is new, and will be created by MongoDB when the first tweet comes in; let’s run the curl command now and sit back for a while, and wait for some twitter activity to arrive in MongoDB …

… and a couple of hours later, eight tweets have been captured by the DataSift filter, with the last of them being one from Michael Rainey about his trip tonight to the Seahawks game:

> db.rm_tweets.count()
8
> db.rm_tweets.findOne()
{
    "_id" : ObjectId("54089a879ad4ec99158b4d78"),
    "interactionId" : "1e43454b1a16a880e074e49c51369eac",
    "subscriptionId" : "f6cf211e03dca5da384786676c31fd3e",
    "hash" : "65bd9dc4943ec426b04819e6291ef1ce",
    "hashType" : "stream",
    "interaction" : {
        "demographic" : {
            "gender" : "male"
        },
        "interaction" : {
            "author" : {
                "avatar" : "http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "id" : 14551637,
                "language" : "en",
                "link" : "https://twitter.com/mRainey",
                "name" : "Michael Rainey",
                "username" : "mRainey"
            },
            "content" : "Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD",
            "created_at" : "Thu, 04 Sep 2014 16:58:29 +0000",
            "hashtags" : [
                "GoHawks",
                "kickoff2014",
                "GBvsSEA"
            ],
            "id" : "1e43454b1a16a880e074e49c51369eac",
            "link" : "https://twitter.com/mRainey/status/507573423334100992",
            "mention_ids" : [
                23642374
            ],
            "mentions" : [
                "Seahawks"
            ],
            "received_at" : 1409849909.2967,
            "schema" : {
                "version" : 3
            },
            "source" : "Instagram",
            "type" : "twitter"
        },
        "language" : {
            "tag" : "en",
                "tag_extended" : "en",
            "confidence" : 98
        },
        "links" : {
            "code" : [
                200
            ],
            "created_at" : [
                "Thu, 04 Sep 2014 16:58:29 +0000"
            ],
            "meta" : {
                "charset" : [
                    "CP1252"
                ],
                "lang" : [
                    "en"
                ],
                "opengraph" : [
                    {
                        "description" : "mrainey's photo on Instagram",
                        "image" : "http://photos-d.ak.instagram.com/hphotos-ak-xfa1/10655141_1470641446544147_1761180844_n.jpg",
                        "site_name" : "Instagram",
                        "type" : "instapp:photo",
                        "url" : "http://instagram.com/p/sh_h6sQBYT/"
                    }
                ]
            },
            "normalized_url" : [
                "http://instagram.com/p/sh_h6sQBYT"
            ],
            "title" : [
                "Instagram"
            ],
            "url" : [
                "http://instagram.com/p/sh_h6sQBYT/"
            ]
        },
        "salience" : {
            "content" : {
                "sentiment" : 0,
                "topics" : [
                    {
                        "name" : "Video Games",
                        "hits" : 0,
                        "score" : 0.5354745388031,
                        "additional" : "Greyson and I will be ready for the @Seahawks game tonight!"
                    }
                ]
            }
        },
        "trends" : {
            "type" : [
                "San Jose",
                "United States"
            ],
            "content" : [
                "seahawks"
            ],
            "source" : [
                "twitter"
            ]
        },
        "twitter" : {
            "created_at" : "Thu, 04 Sep 2014 16:58:29 +0000",
            "display_urls" : [
                "instagram.com/p/sh_h6sQBYT/"
            ],
            "domains" : [
                "instagram.com"
            ],
            "filter_level" : "medium",
            "hashtags" : [
                "GoHawks",
                "kickoff2014",
                "GBvsSEA"
            ],
            "id" : "507573423334100992",
            "lang" : "en",
            "links" : [
                "http://instagram.com/p/sh_h6sQBYT/"
            ],
            "mention_ids" : [
                23642374
            ],
            "mentions" : [
                "Seahawks"
            ],
            "source" : "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>",
            "text" : "Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD",
            "user" : {
                "created_at" : "Sat, 26 Apr 2008 21:18:01 +0000",
                "description" : "Data Integration (#ODI #GoldenGate #OBIA) consultant / blogger / speaker @RittmanMead.\nOracle ACE.\n#cycling #Seahawks #travel w/ @XiomaraRainey\n#GoCougs!",
                "favourites_count" : 746,
                "followers_count" : 486,
                "friends_count" : 349,
                "geo_enabled" : true,
                "id" : 14551637,
                "id_str" : "14551637",
                "lang" : "en",
                "listed_count" : 28,
                "location" : "Pasco, WA",
                "name" : "Michael Rainey",
                "profile_image_url" : "http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "profile_image_url_https" : "https://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "screen_name" : "mRainey",
                "statuses_count" : 8549,
                "time_zone" : "Pacific Time (US & Canada)",
                "url" : "http://www.linkedin.com/in/rainey",
                "utc_offset" : -25200,
                "verified" : false
            }
        }
    }
}

If you’ve not looked at Twitter metadata before, it’s surprising how much metadata accompanies what’s ostensibly an 140-character tweet. As well as details on the author, where the tweet was sent from, what Twitter client sent the tweet and details of the tweet itself, there’s details and statistics on the sender, the number of followers they’ve got and where they’re located, a list of all other Twitter users mentioned in the tweet and any URLs and images referenced.

Not every tweet will use every element of metadata, and some tweets will repeat certain attributes – other Twitter users you’ve mentioned in the tweet, for example – as many times as there are mentions. Which makes Twitter data a prime candidate for analysis using Pig and Spark, which handle easily the concept of nested data structures, tuples (ordered lists of data, such as attribute sets for an entity such as “author”), and bags (sets of unordered attributes, such as the list of @ mentions in a tweet).

There’s a MongoDB connector for Hadoop on Github which allows MapReduce to connect to MongoDB databases, running MapReduce jobs on MongoDB storage rather than HDFS (or S3, or whatever). This gives us the ability to use languages such as Pig and Hive to filter and aggregate our MongoDB data rather than MongoDB’s Javascript API, which isn’t as fully-featured and scaleable as MapReduce and has limitations in terms of the number of documents you can include in aggregations; let’s start then by connecting Pig to our MongoDB database, and reading in the documents with no Pig schema applied:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' using com.mongodb.hadoop.pig.MongoLoader;                                                                                                                                                  2014-09-05 06:40:51,773 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.                                                                                                                                        
2014-09-05 06:40:51,838 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> tweets_count = FOREACH (GROUP tweets ALL) GENERATE COUNT (tweets);                                             
2014-09-05 06:41:07,772 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:41:07,817 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> dump tweets_count
...
(9)
grunt>

So there’s nine tweets in the MongoDB database now. Let’s take a look at one of the documents by creating a Pig alias containing just a single record.

grunt> tweets_limit_1 = LIMIT tweets 1;
2014-09-05 06:43:12,351 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:43:12,443 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> dump tweets_limit_1
...
([interaction#{trends={source=(twitter), content=(seahawks), type=(San Jose,United States)}, twitter={filter_level=medium, text=Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD, mention_ids=(23642374), domains=(instagram.com), links=(http://instagram.com/p/sh_h6sQBYT/), lang=en, id=507573423334100992, source=<a href="http://instagram.com" rel="nofollow">Instagram</a>, created_at=Thu, 04 Sep 2014 16:58:29 +0000, hashtags=(GoHawks,kickoff2014,GBvsSEA), mentions=(Seahawks), user={profile_image_url_https=https://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, location=Pasco, WA, geo_enabled=true, statuses_count=8549, lang=en, url=http://www.linkedin.com/in/rainey, utc_offset=-25200, id=14551637, time_zone=Pacific Time (US & Canada), favourites_count=746, verified=false, friends_count=349, description=Data Integration (#ODI #GoldenGate #OBIA) consultant / blogger / speaker @RittmanMead.
Oracle ACE.
#cycling #Seahawks #travel w/ @XiomaraRainey
#GoCougs!, name=Michael Rainey, created_at=Sat, 26 Apr 2008 21:18:01 +0000, screen_name=mRainey, id_str=14551637, profile_image_url=http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, followers_count=486, listed_count=28}, display_urls=(instagram.com/p/sh_h6sQBYT/)}, salience={content={topics=([score#0.5354745388031,additional#Greyson and I will be ready for the @Seahawks game tonight!,hits#0,name#Video Games]), sentiment=0}}, links={created_at=(Thu, 04 Sep 2014 16:58:29 +0000), title=(Instagram), code=(200), normalized_url=(http://instagram.com/p/sh_h6sQBYT), url=(http://instagram.com/p/sh_h6sQBYT/), meta={lang=(en), charset=(CP1252), opengraph=([image#http://photos-d.ak.instagram.com/hphotos-ak-xfa1/10655141_1470641446544147_1761180844_n.jpg,type#instapp:photo,site_name#Instagram,url#http://instagram.com/p/sh_h6sQBYT/,description#mrainey's photo on Instagram])}}, interaction={schema={version=3}, id=1e43454b1a16a880e074e49c51369eac, content=Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD, author={id=14551637, username=mRainey, language=en, avatar=http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, name=Michael Rainey, link=https://twitter.com/mRainey}, received_at=1.4098499092967E9, source=Instagram, mention_ids=(23642374), link=https://twitter.com/mRainey/status/507573423334100992, created_at=Thu, 04 Sep 2014 16:58:29 +0000, hashtags=(GoHawks,kickoff2014,GBvsSEA), type=twitter, mentions=(Seahawks)}, language={tag=en, confidence=98, tag_extended=en}, demographic={gender=male}},interactionId#1e43454b1a16a880e074e49c51369eac,_id#54089a879ad4ec99158b4d78,hash#65bd9dc4943ec426b04819e6291ef1ce,subscriptionId#f6cf211e03dca5da384786676c31fd3e,hashType#stream])

And there’s Michael’s tweet again, with all the attributes from the MongoDB JSON document appended together into a single record. But in this format the data isn’t all that useful as we can’t easily access individual elements in the Twitter record; what would be better would be to apply a Pig schema definition to the LOAD statement, using the MongoDB document field listing that we saw when we displayed a single record from the MongoDB collection earlier.

I can start by referencing the document fields that become simple Pig dataypes; ID and interactionId, for example:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.my_first_test' using com.mongodb.hadoop.pig.MongoLoader('id:chararray,interactionId:chararray','id');
2014-09-05 06:57:57,985 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:57:58,022 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> describe tweets
2014-09-05 06:58:11,611 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
tweets: {id: chararray,interactionId: chararray}
grunt> tweets_limit_1 = LIMIT tweets 1;
...
(53fae22e9ad4ec93658b513e,1e42c2747542a100e074fff55100414a)
grunt>

Where the MongoDB document has fields nested within other fields, you can reference these as a tuple if they’re a set of attributes under a common header, or a bag if they’re just a list of values for a single attribute; for example, the “username” field is contained within the author tuple, which in-turn is contained within the interaction tuple, so to count tweets by author I’d need to first flatten the author tuple to turn its fields into scalar fields, then project out the username and other details; then I can group the relation in the normal way on those author details, and generate a count of tweets, like this:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' using com.mongodb.hadoop.pig.MongoLoader('id:chararray,interactionId:chararray,interaction:tuple(interaction:tuple(author:tuple(id:int,language:chararray,link:chararray,name:chararray,username:chararray)))','id');
grunt> tweets_author_tuple_flattened = FOREACH tweets GENERATE id, FLATTEN(interaction.$0);                                            
grunt> tweets_with_authors = FOREACH tweets_author_tuple_flattened GENERATE id, interaction::author.username, interaction::author.name;
grunt> tweets_author_group = GROUP tweets_with_authors by username; 
grunt> tweets_author_count = FOREACH tweets_author_group GENERATE group, COUNT(tweets_with_authors); 
...
(rmoff,1)
(dw_pete,1)
(mRainey,3)
(P_J_FLYNN,3)
(davidhuey,7)
(EdelweissK,1)
(JamesOickle,3)
(markrittman,3)
(rittmanmead,2)
(RedgraveChris,1)
grunt>

So there’s obviously a lot more we can do with the Twitter dataset as it stands, but where it’ll get really interesting is combining this with other social media interaction data – for example from Facebook, LinkedIn and so on – and then correlating that with out main site activity data. Check back in a few days when we’ll be covering this second stage in a further blog article, using ODI12c to orchestrate the process.

Categories: BI & Warehousing