Skip navigation.

Rittman Mead Consulting

Syndicate content Rittman Mead Consulting
Delivering Oracle Business Intelligence
Updated: 8 hours 58 min ago

Action Links in OBIEE 12c – Part 1

Mon, 2015-11-16 13:29


With the release of OBIEE 12c, let’s take a look at Action Links and how things may be different compared to the previous release, 11g. Over this three part blog series, we’re going to cover the more popular link types, which are navigating to BI content, and navigating to a web page. However, to sweeten the deal, I’ll also include some tricks for your tool belt which well enable you to do the following:


  • Navigate to a target report, while filtering it on parameters chosen in the source
  • Pass filter parameters via the GoURL syntax from a source report to another, target report
  • Become familiar with the GoURL structure and how to apply it to your business case


In the first installment of this three part series, we’re going look at how to navigate to other reports and dashboards in your catalog through the ‘Navigate to BI Content’ action. This will set you up for parts 2 and 3, wherein we show you some tricks using Action Links.


1. The Action Link UI

By now, there are likely lots of blogs talking about the new look and features of OBIEE 12c, so we can keep this bit short. Suffice to say it got a much needed face lift, with both changes in overall skinning and in its portfolio of icons. While this change in graphics may induce a bit of frustration on part of the developer, I believe this approach to design will end up being a good long term strategy to handle later releases of the product as trends in UX seem to have their feet firmly planted in the stripped down, the clean, and the subdued. Even with this shift, however, the basic processes and series of steps to implement most any of the features in Answers remains the same, Action Links being no different. Just follow these simple steps below to set up your Action Link! After you’ve got a hold of the basics, look to future posts in this series for some tips and tricks using Action Links.


In chosen column, go to the Column Properties menu:


Next, click on the Interaction tab:


Select ‘Action Links’ as the Primary Interaction value and click on the ‘+’ icon. This will display another dialogue box where we will set up the actual properties of the Action Link. Click on the running man icon (this little guy seems to be more intuitive than the green gear):



2. Navigate to BI Content

For the first example, we’re going to select the ‘Navigate to BI Content’ option. This simply allows us to go to another report or dashboard, as though you were clicking on a link in a web page. To implement this on your report, simply follow the steps above and then refer to the steps below.

After clicking on the running man icon, select the ‘Navigate to BI Content’ option. This will be followed by a dialogue box allowing you to select the object to which you want to navigate.


Confirm your selection and then click ‘OK’, not once, not twice, but thrice, at which point you’re taken back to the Criteria tab. From now on, this column will take you to the selected report.

And that’s it! Take a look back here for part 2 on Action Links in OBIEE 12c, which will outline a neat technique on how to implement what’s called a ‘driving document’ to filter values between disparate reports using the navigate action.

The post Action Links in OBIEE 12c – Part 1 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

OBIEE 11g and Essbase – Faking Federation Using the GoURL

Thu, 2015-11-12 14:56

This blog is going to address what happens when we can’t take advantage of the Admin tool’s powerful vertical federation capabilities when integrating relational stars and Essbase cubes. In the Admin tool, synonymously referred to as the RPD, vertical federation is the process of integrating an aggregate data source, in this case Essbase, with a detail level source from a data mart. This technique not only has the ability to increase query efficiency and decrease query time, it also has the added benefit of bringing together two powerful and dynamic reporting tools. But like most things, there is a pretty big caveat to this approach. But, before I jump into what that is, some housework. To start, let’s make sure things don’t get lost in translation when going back and forth between Essbase and OBIEE jargon. In Essbase speak, dimensions can be thought of as tables in a relational structure, whereas Essbase generations can be thought of as columns in each table, and members are the values in each column. Housework done, now the caveat. Often, dimensions in Essbase cubes are built in such a way as to not neatly support federation; that is, they are arranged so as to have an uneven number of generations relative to their corresponding relational dimension. It should be noted at this point that while federation is possible with a ragged hierarchical structure, it can get kind of messy, essentially ending up in a final product that doesn’t really look like something an Essbase-centric user community would readily and eagerly adopt. So what then, can we do when federation is out of the question? Let’s frame the solution in the form of a not-atypical client scenario. Say we’ve got a requirement per a large finance institution of a client to bring together their Essbase cubes they’ve used thus far for their standardized reporting, i.e. balance sheets, income statements and the like, with their relational source in order to drill to account detail information behind the numbers they’re seeing on said reports. They’ve got a pretty large user base that’s fairly entrenched and happy with their Smart View and Excel in getting what they want from their cubes. And why shouldn’t they be? OBIEE simply can’t support this level of functionality when reporting on an Essbase source, in most cases. And, in addition to these pretty big user adoption barriers to an OBIEE solution, now we’ve got technology limitations to contend with. So what are our options then when faced with this dilemma? How can we wow these skeptical users with near seamless functionality between sources? The secret lies with URL Action Links! And while this solution is great to go from summary level data in Essbase to its relational counterpart, it is also a great way to simply pass values from one subject area to another. There are definitely some tricks to set this up, but more on those later. Read on.

The Scenario

In order to best demonstrate this solution, let’s set up a dashboard with two pages, one for each report, and a corresponding dashboard prompt. The primary, source report, out of Essbase, will be something that could easily resemble a typical financial report, if not at least in structure. From this high-level chart, or similar summary level analysis, we’ll be able to drill to a detail report, out of a relational source, to identify the drivers behind any figures present on the analysis. In this example, we’re going to be using the Sample App, Sample Essbase subject area to go to the equivalent relational area, Sample Sales. Yes, you could federate these two, as they’ve done in Sample App, however they’ll serve well to demonstrate how the following concept could work for financial reporting against ragged or parent-child structures. Values for Product Type, in the following instance, could just as well be the descendants or children of a specific account, as an example. As well, there is no equivalent relational subject area to use for the sake of the SampleApp Essbase GL subject area. In the example below, we have a summary, month level pivot table giving us a monthly sales trend. The user, in the following example, can prompt on the Year and Customer segment through a dashboard prompt, but as you’ll see, this could easily be any number of prompts for your given scenario.

Monthly Trend Summary:

Solution 1:

In the sales trend example above, we are going to enable our user to click on a value for a revenue figure and then navigate to a detail report that shows products sold for the month by date. Again, this all must be done while passing any chosen parameters from both the dashboard prompt and analysis along to the detail analysis.

Proof of Concept

First, let’s start with the guts of the report example above. As you can see, there is quite a bit more under the hood than meets the eye. Let’s go over the approach piece by piece to help build a more thorough understanding of the method.

Step 1: Include the Columns!

So the idea here is that we want to pass any and all dimensional information associated with the revenue figure that we pick to a detail level report that will be filtered on the set of parameters at the chosen intersection. We can hide these columns later, so your report won’t be a mess. I’ll add here that you might want to set any promoted values to be equal to the presentation variable on its respective dashboard prompt with a default value set, as seen below. This will help to make the report digestible on the compound layout. The following picture shows the prompted values to drive our summary report on Year and Customer Segment. You can do this in the filters pane on the criteria tab with the following syntax:


                            All column values we want to pass need to be represented on the report:


                           Values that will be passed to detail report (in this case, BizTech, Communication, Active Singles, 2012, and 2012 / 11):

Step 2: More Columns!

In addition to the columns that comprise the report, we need to add an additional iteration of every column for all of those added to the report in the first place. In the pictures above, you can see that these are the columns titled with the ‘URL’ prefix. In the column editor, concatenate quotes to the column values by attaching the following string (this is a single quote followed by a double quote and another single quote w/ NO spaces between them):

‘ ” ‘ || “Table”.”Column Name” || ‘ ” ‘

While this step may seem extemporaneous, you’ll see a bit later that this step is all too necessary to successfully pass our column values through our URL Action Links. After you’ve created the custom columns, just group them along with their counterpart in the report, as in the pics above.

Step 3: An Approach to Handling Hierarchies

In the previous pictures, you can see the products hierarchy that comprises the rows to the report. In order to pass any value from the hierarchy as well as its members we are going to have to include its respective generations in the rows as well. For our example, we’re going to use Brand, LOB, and Product Type. In this way, a user can select any sales value and have all three of these values passed as filter parameters to the detail analysis through a URL. You’ll notice that we haven’t given these columns a counterpart wrapped in quotes as you were told to do previously. This is quite on purpose, as we’ll see later. These columns will provide for another example on how to pass values without having to implement a second column for the purpose of wrapping the value in quotes.


When first placing the hierarchy on your analysis and expanding it to where you’d like it for the sake of the report, you can simply select all the column values, right click and then select ‘Keep Only’. This will establish a selection step under the Products Hierarchy to ensure that the report always opens to the specified structure from now on. So, that’s good for now, let’s get to the magic of this approach.


Step 4. Set up the Action Link

In this case, we’re going to ‘drill’ off of the Sales column in our table, but we could really ‘drill’ off of anything, as you’ll see. So, pop open the Interaction tab for the column and select Action Links as our primary interaction. Edit that guy as follows (see URL procedure below). It used to be that we could do this via the ‘P’ parameters, however this method seems to be mostly deprecated in favor of the col/val method, as we shall utilize below.

URL Procedure – Server URL*
Portal&Path=@{1} – path to dashboard
&Page=@{2} – dashboard page
&Action=@{3} – action to perform, in this case navigate (there are others)
&col1=@{4} – column from target analysis we wish to manipulate (our sales detail analysis)
&val1=@{5} – column from source analysis with which we are going to pass a filter parameter to target
&val4=“@{11}” – will discuss these quoted parameters later on

*Note that this value can be made into a variable in order to be moved to different environments (DEV/TEST, etc…) while maintaining link integrity

The picture above details how to set up the URL link as described above. The col1 value is the column from the target analysis we want to filter using the value (val1) from our source. Be sure to qualify this column from the subject area from which it originates, in this case “A – Sample Sales”.

Ex: “A – Sample Sales”.”Time”.”T05 Per Name Year”

Val1, as these parameters exist in ‘sets’, is the column from our source analysis we want to use to filter the target analysis. This is where our custom, quoted columns come into play. Instead of using the original column from our analysis, we’re going to use its quoted counterpart. This will ensure that any values passed through the URL will be enclosed in quotes, as is required buy the URL. Note that we’re not using a value parameter in this case, but a column instead (the dropdown to the left of the text box).

Ex: ‘ ” ‘ || “Time”.”T05 Per Name Year” || ‘ ” ‘

You can proceed this way to pass as many values as you’d like to your detail analysis, with this coln, valn method. Again, just be sure that your columns are included in the source analysis or the values won’t get ported over. Once you’ve got all your columns and values set up, go ahead and enter them into the URL field in the Edit Action dialogue box, as above. Make sure you reference your variables using the proper syntax (similar to a presentation variable w/ an @ sign):

Ex: col1=@{4} – ‘4’ being the variable name (note that these can be named most anything)

Quoting Parameters

As an alternative to including an extra iteration of each column for the sake of passing quoted column values, we can instead, put quotes around the parameter in our URL, as in the example above. The limitation to this method, however, is that you can only pass a singular value, as in Year, for example. In later posts, we’ll address how to handle passing multiple values, as you might through a dashboard prompt.

Step 5. Set Up the Detail Analysis

For our detail analysis we’re going to set it up in much the same way as our summary. That is, we need to include the columns we want to filter on in the target report as. Unfortunately, our target report won’t simply pick them up as filters as you might put on your filters pane, without including them on the actual analysis. Again, any columns we don’t want visible to a user can be hidden. Below, we simply want to see the Calendar Date, Product, and Revenue, but filtered by all of our source analysis columns.

In the criteria view for our target, detail analysis, we need to make sure that we’re also setting any filtered columns to ‘is prompted’. This will ensure that our target analysis listens to any filter parameters passed through the URL from our source, summary analysis. As a last step, we must again fully qualify our filters, as in the picture below.

This picture shows our Year ‘is prompted’ filter on our target, detail analysis. Note that this column is also a column, albeit hidden, on this report as well. This will act as a filter on the analysis. It is being ‘prompted’ not by a dashboard prompt, in this instance, but by our source, summary analysis.

Step 6. Testing it All Out

Now that we’ve got all the pieces of the puzzle together, let’s see if it works! To QA this thing, let’s put a filter object on the target, detail analysis to make sure that the report is picking up on any values passed. So if we click on a sales value, we should be taken to the target analysis and see that all the parameters we set up were passed. The picture below confirms this!


Hopefully this can be one more trick to keep in the tool belt when faced with a similar scenario. If you have any hiccups in your implementation of this solution or other questions, please feel free to respond to this post. Stay tuned for additional articles related to this topic that go much more in depth. How do you handle passing multiple column values? How do I keep my report query time low with all those extra columns? How do I pass values using the presentation variable syntax? Can I use the Evaluate function to extract the descendants of a filtered column?



The post OBIEE 11g and Essbase – Faking Federation Using the GoURL appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Rittman Mead and Oracle Big Data Webcast Series – November 2015

Mon, 2015-11-02 12:45

We’re running a set of three webcasts together with Oracle on three popular use-cases for big data within an Oracle context – with the first one running tomorrow, November 3rd 2015 15:00 – 16:00 GMT / 16:00 – 17:00 CET on extending the data warehouse using Hadoop and NoSQL technologies.

The sessions are running over three weeks this month and look at ways we’re seeing Rittman Mead use big data technologies to extend the and capabilities of their data warehouse, create analysis sandpits for analysing customer behaviour, and taking data discovery into the Hadoop era using Oracle Big Data Discovery. All events are free to attend, we’re timing them to suit the UK,Europe and the US, with details of each webcast are as follows:


Extending and Enhancing Your Data Warehouse to Address Big Data

Organizations with data warehouses are increasingly looking at big data technologies to extend the capacity of their platform, offload simple ETL and data processing tasks and add new capabilities to store and process unstructured data along with their existing relational datasets. In this presentation we’ll look at what’s involved in adding Hadoop and other big data technologies to your data warehouse platform, see how tools such as Oracle Data Integrator and Oracle Business Intelligence can be used to process and analyze new “big data” data sources, and look at what’s involved in creating a single query and metadata layer over both sources of data.

Audience: DBAs, DW managers, architects Tuesday 3rd November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Audience : DBAs, DW managers, architects

What is Big Data Discovery and how does it complement traditional Business Analytics?

Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization. At the same time Oracle Big Data Discovery reduces the dependency on expensive and often difficult to find Data Scientists, opening up many Big Data tasks to “Citizen” Data Scientists. In this session we’ll look at Oracle Big Data Discovery and how it provides a “visual face” to your big data initiatives, and how it complements and extends the work that you currently do using business analytics tools.

Audience : Data analysts, market analysts, & Big Data project team members Tuesday 10th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Adding Big Data to your Organization to create true 360-Degree Customer Insight

Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately. In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.

Audience : MI Managers, CX Managers, CIOs, BI / Analytics Managers Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

The post Rittman Mead and Oracle Big Data Webcast Series – November 2015 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Oracle OpenWorld 2015 Roundup Part 2 : Data Integration, and Big Data (in the Cloud…)

Mon, 2015-11-02 01:27

In yesterdays part one of our three-part Oracle Openworld 2015 round-up, we looked at the launch of OBIEE12c just before Openworld itself, and the new Data Visualisation Cloud Service that Thomas Kurian demo’d in his mid-week keynote. In part two we’ll look at what happened around data integration both on-premise and in the cloud, along with big data – and as you’ll see they’re too topics that are very much linked this year.

First off, data integration – and like OBIEE12c, ODI 12.2.1 got released a day or so before Openworld as part of the wider Oracle Fusion Middleware 12c Release 2 platform rollout. Some of what was coming in ODI12.2.1 got back-ported to ODI 12.1 earlier in the year in the form of the ODI Enterprise Edition Big Data Options, and we covered the new capabilities it gave ODI in terms of generating Pig and Spark mappings in a series of posts earlier in the year – adding Pig as an execution language gives ODI an ability to create dataflow-style mappings to go with Hive’s set-based transformations, whilst also opening-up access to the wide range of Pig-specific UDF libraries such as DataFu for log analysis. Spark, in the meantime, can be useful for smaller in-memory data transformation jobs and as we’ll see in a moment, lays the foundation for streaming and real-time ingestion capabilities.


The other key feature that ODI12.2.1 provides though is better integration with external source control systems. ODI already has some element of version control built in, but as it’s based around ODI’s own repository database tables it’s hard to integrate with more commonly-used enterprise source control tools such as Subversion or Git, and there’s no standard way to handle development concepts like branching, merging and so on. ODI 12.2.1 adds these concepts into core ODI and initially focuses on SVN as the external source control tool, with Git support planned in the near future.


Updates to GoldenGate, Enterprise Data Quality and Enterprise Metadata Management were also announced, whilst Oracle Big Data Preparation Cloud Service got its first proper outing since release earlier in the year. Big Data Preparation Cloud Service (BDP for short) to my mind suffers a bit from confusion over what it does and what market it serves – at some point it’s been positioned as a tool for the “citizen data scientist” as it enables data domain experts to wrangle and prepare data for loading into Hadoop, whilst at other times it’s labelled a tool for production data transformation jobs under the control of IT. What is misleading is the “big data” label – it runs on Hadoop and Spark but it’s not limited to big data use-cases, and as the slides below show it’s a great option for loading data into BI Cloud Service as an alternative to more IT-centric tools such as ODI.


It was another announcement though at Openworld that made Big Data Prep Service suddenly make a lot more sense – the announcement of a new initiative called Dataflow ML, something Oracle describe as “ETL 2.0” with an entirely cloud-based architecture and heavy use of machine learning (the “ML” in “Dataflow ML”) to automate much of the profiling and discovery process – the key innovation on Big Data Prep Service.


It’s early days for Dataflow ML but clearly this is the direction Oracle will want to take as applications and platforms move to the cloud – I called-out ODI’s unsuitability for running in the cloud a couple of years ago and contrasted its architecture with that of cloud-native tools such as Snaplogic, and Dataflow ML is obviously Oracle’s bid to move data integration into the cloud – coupling that with innovations around Spark as the data processing platform and machine-learming to automate routine tasks and it sounds like it could be a winner – watch this space as they say.

So the other area I wanted to cover in this second of three update pieces was on big data. All of the key big data announcements from Oracle came in last year’s Openworld – Big Data Discovery, Big Data SQL, Big Data Prep Service (or Oracle Data Enrichment Cloud Service as it was called back then) and this year saw updates to Big Data SQL (Storage Indexes), Big Data Discovery (general fit-and-finish enhancements) announced at this event. What is probably more significant though is the imminent availability of all this – plus Oracle Big Data Appliance – in Oracle’s Public Cloud.


Most big data PoCs I see outside of Oracle start on Amazon AWS and build-out from there – starting at very low-cost and moving from Amazon Elastic MapReduce to Cloudera CDH (via Cloudera Director), for example, or going from cloud to on-premise as the project moves into production. Oracle’s Big Data Cloud Service takes a different approach – instead of using a shared cloud infrastructure and potentially missing the point of Hadoop (single user access to lots of machines, vs. cloud’s timeshared access to slices of machines) Oracle instead effectively lease you a Big Data Appliance along with a bundle of software; the benefits being around performance but with quite a high startup cost vs. starting small with AWS.

The market will tell which approach over time gets most traction, but where Big Data Cloud Service does help tools like Big Data Discovery is that theres much more opportunities for integration and customers will be much more open to an Oracle tool solution compared to those building on commodity hardware and community Hadoop distributions – to my mind every Big Data Cloud Service customer ought to buy BDD and most probably Big Data Prep Service, so as customers adopt cloud as a platform option for big data projects I’d expect an uptick in sales of Oracle’s big data tools.

On a related topic and looping back to Oracle Data Integration, the other announcement in this area that was interesting was around Spark Streaming support in Oracle Data Integrator 12c.


ODI12c has got some great batch-style capabilities around Hadoop but as I talked about earlier in the year in an article on Flume, Morphines and Cloudera Search the market is all about real-time data ingestion now, batch is more for one-off historical data loads. Again like Dataflow ML this feature is in beta and probably won’t be out for many months, but when it comes out it’ll complete ODI’s capabilities around big data ingestion – we’re hoping to take part in the beta so keep an eye on the blog for news as it comes out.

So that’s it for part 2 of our Oracle Openworld 2015 update – we’ll complete the series tomorrow with a look at Oracle BI Applications, Oracle Database 12cR2 “sharding” and something very interesting planned for a future Oracle 12c database release – “Analytic Views”.

The post Oracle OpenWorld 2015 Roundup Part 2 : Data Integration, and Big Data (in the Cloud…) appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Oracle OpenWorld 2015 Roundup Part 1 : OBIEE12c and Data Visualisation Cloud Service

Sun, 2015-11-01 16:51

Last week saw Oracle Openworld 2015 running in San Francisco, USA, with Rittman Mead delivering a number of sessions around BI, data integration, Big Data and cloud. Several of us took part in Partner Advisory Councils on the Friday before Openworld itself, and along with the ACE Director briefings earlier that week we went into Openworld with a pretty good idea already on what was being announced – but as ever there were a few surprises and some sessions hidden away that were actually very significant in terms of where Oracle might be going – so let’s go through what we thought were the key announcements first, then we’ll get onto the more interesting stuff at the end.

And of course the key announcement for us and our customers was the general availability of OBIEE12c 12.2.1, which we described in a blog post at the time as being focused primarily on business agility and self-service – the primary drivers of BI license spend today. OBIEE12c came out the Friday before Openworld with availability across all supported Unix platforms as well as Linux and Windows, with this initial release not seeming massively different to 11g for developers and end-users at least at first glance – RPD development through the BI Administration tool is largely the same as 11g, at least for now; Answers and Dashboards has had a face-lift and uses a new flatter UI style called “Oracle Alta” but otherwise is recognisably similar to 11g, and the installer lays down Essbase and BI Publisher alongside OBIEE.


Under the covers though there are some key differences and improvements that will only become apparent after a while, or are really a foundation for much wider changes and improvements coming later in the 12c product timeline. The way you upload RPDs gives some hint of what’s to come – with 11g we used Enterprise Manager to upload new RPDs to the BI Server which then had to be restarted to pick-up the new repository, whereas 12c has a separate utility for uploading RPDs and they’re not stored in quite the same way as before (more on this to come…). In addition there’s no longer any need to restart the BI Server (or cluster of BI Servers) to use the new repository, and the back-end has been simplified in lots of different ways all designed to enable cloning, provisioning and portability between on-premise and cloud based around two new concepts of “service instances” and “BI Modules” – expect to hear more about these over the next few years, and with the diagram below outlining 12c’s product architecture at a high-level.


Of course there are two very obvious new front-end features in OBIEE12c, Visual Analyzer and data-mashups, but they require an extra net-new license on-top of BI Foundation Suite to use in production. Visual Analyzer is Oracle’s answer to Tableau and adds data analysis, managed data discovery and data visualisation to OBIEE’s existing capabilities, but crucially uses OBIEE’s RPD as the primary data source for users’ analysis – in other words providing Tableau-like functionality but  with a trusted, managed single source of data managed and curated centrally. Visual Analyzer is all about self-service and exploring datasets, and it’s here that the new data-mashup feature is really aimed at – users can upload spreadsheets of additional measures and attributes to the core dataset used in their Visual Analyzer project, and blend or “mash-up” their data to create their own unique visualizations, as shown in the screenshot below:


Data Mashups are also available for the core Answers product as well but they’re primarily aimed at VA, and for more casual users where data visualisation is all they want and cloud is their ideal delivery platform, Oracle also released Data Visualisation Cloud Service (DVCS)– aka Visual-Analyzer-in-the-cloud.


To see DVCS in action, the Youtube video below shows just the business analytics part of Thomas Kurian’s session where DVCS links to Oracle’s Social Network Cloud Service to provide instant data visualisation and mashup capabilities all from the browser – pretty compelling if you ignore the Oracle Social Network part (is that ever used outside of Oracle?)

Think of DVCS as BICS with Answers, Dashboards and the RPD Model Builder stripped-out, all data instead uploaded from spreadsheets, half the price of BICS and first-in-line for new VA features as they become available. This “cloud first” strategy goes across the board for Oracle now – partly incentive to move to the cloud, mostly a reflection of how much easier it is to ship new features out when Oracle controls the installation, DVCS and BICS will see updates on a more or less monthly cycle now (see this MOS document that details new features added to BICS since initial availability, and this blog post from ourselves announcing VA and data mashups on BICS well before they became available on on-premise. In fact we’re almost at the point now where it’s conceivable that whole on-premise OBIEE systems can be moved into Oracle Cloud now, with my main Openworld session on just this topic – the primary end-user benefit being first to access the usability, self-service and data viz capabilities Oracle are now adding to their BI platform.


Moreover, DVCS is probably just the start of a number of standalone, on-premise and cloud VA derivates trying to capture the Tableau / Excel / PowerBI market – pricing is more competitive than with BICS but as Oracle move more downmarket with VA it’ll end-up competing more head-to-head with Tableau on features, and PowerBI is just a tenth of the cost of DVCS – I see it more as a “land-and-expand” play with the aim being to trade the customer up to full BICS, or at least capture the segment of the market who’d otherwise go to Excel or Tableau desktop – it’ll be interesting to see how this one plays out.

So that’s it for Part 1 of our Oracle Openworld 2015 roundup – tomorrow we’ll look at data integration and big data.

The post Oracle OpenWorld 2015 Roundup Part 1 : OBIEE12c and Data Visualisation Cloud Service appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Forays into Kafka – Enabling Flexible Data Pipelines

Tue, 2015-10-27 17:46

One of the defining features of “Big Data” from a technologist’s point of view is the sheer number of tools and permutations at one’s disposal. Do you go Flume or Logstash? Avro or Thrift? Pig or Spark? Foo or Bar? (I made that last one up). This wealth of choice is wonderful because it means we can choose the right tool for the right job each time.

Of course, we need to establish that have indeed chosen the right tool for the right job. But here’s the paradox. How do we easily work out if a tool is going to do what we want of it and is going to be a good fit, without disturbing what we already have in place? Particularly if it’s something that’s going to be part of an existing Productionised data pipeline, inserting a new tool partway through what’s there already is going to risk disrupting that. We potentially end up with a series of cloned environments, all diverging from each other, and not necessarily comparable (not to mention the overhead of the resource to host it all).

The same issue arises when we want to change the code or configuration of an existing pipeline. Bugs creep in, ideas to enhance the processing that you’ve currently got present themselves. Wouldn’t it be great if we could test these changes reliably and with no risk to the existing system?

This is where Kafka comes in. Kafka is very useful for two reasons:

  1. You can use it as a buffer for data that can be consumed and re-consumed on demand
  2. Multiple consumers can all pull the data, independently and at their own rate.

So you take your existing pipeline, plumb in Kafka, and then as and when you want to try out additional tools (or configurations of existing ones) you simply take another ‘tap’ off the existing store. This is an idea that Gwen Shapira put forward in May 2015 and really resonated with me.

I see Kafka sitting right on that Execution/Innovation demarcation line of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

Kafka enables us to build a pipeline for our analytics that breaks down into two phases:

  1. Data ingest from source into Kafka, simple and reliable. Fewest moving parts as possible.
  2. Post-processing. Batch or realtime. Uses Kafka as source. Re-runnable. Multiple parallel consumers: –
    • Productionised processing into Event Engine, Data Reservoir and beyond
    • Adhoc/loosely controlled Data Discovery processing and re-processing

These two steps align with the idea of “Obtain” and “Scrub” that Rittman Mead’s Jordan Meyer talked about in his BI Forum 2015 Masterclass about the Data Discovery:

So that’s the theory – let’s now look at an example of how Kafka can enable us to build a more flexible and productive data pipeline and environment.

Flume or Logstash? HDFS or Elasticsearch? … All of them!

Mark Rittman wrote back in April 2014 about using Apache Flume to stream logs from the Rittman Mead web server over to HDFS, from where they could be analysed in Hive and Impala. The basic setup looked like this:

Another route for analysing data is through the ELK stack. It does a similar thing – streams logs (with Logstash) in to a data store (Elasticsearch) from where they can be analysed, just with a different set of tools with a different emphasis on purpose. The input is the same – the web server log files. Let’s say I want to evaluate which is the better mechanism for analysing my log files, and compare the two side-by-side. Ultimately I might only want to go forward with one, but for now, I want to try both.

I could run them literally in parallel:

The disadvantage with this is that I have twice the ‘footprint’ on my data source, a Production server. A principle throughout all of this is that we want to remain light-touch on the sources of data. Whether a Production web server, a Production database, or whatever – upsetting the system owners of the data we want is never going to win friends.

An alternative to running in parallel would be to use one of the streaming tools to load data in place of the other, i.e.


The issue with this is I want to validate the end-to-end pipeline. Using a single source is better in terms of load/risk to the source system, but less so for validating my design. If I’m going to go with Elasticsearch as my target, Logstash would be the better fit source. Ditto HDFS/Flume. Both support connectors to the other, but using native capabilities always feels to me a safer option (particularly in the open-source world). And what if the particular modification I’m testing doesn’t support this kind of connectivity pattern?

Can you see where this is going? How about this:

The key points here are:

  1. One hit on the source system. In this case it’s flume, but it could be logstash, or another tool. This streams each line of the log file into Kafka in the exact order that it’s read.
  2. Kafka holds a copy of the log data, for a configurable time period. This could be days, or months – up to you and depending on purpose (and disk space!)
  3. Kafka is designed to be distributed and fault-tolerant. As with most of the boxes on this logical diagram it would be physically spread over multiple machines for capacity, performance, and resilience.
  4. The eventual targets, HDFS and Elasticsearch, are loaded by their respective tools pulling the web server entries exactly as they were on disk. In terms of validating end-to-end design we’re still doing that – we’re just pulling from a different source.

Another massively important benefit of Kafka is this:

Sooner or later (and if you’re new to the tool and code/configuration required, probably sooner) you’re going to get errors in your data pipeline. These may be fatal and cause it to fall in a heap, or they may be more subtle and you only realise after analysis that some of your data’s missing or not fully enriched. What to do? Obviously you need to re-run your ingest process. But how easy is that? Where is the source data? Maybe you’ll have a folder full of “.processed” source log files, or an HDFS folder of raw source data that you can reprocess. The issue here is the re-processing – you need to point your code at the alternative source, and work out the range of data to reprocess.

This is all eminently do-able of course – but wouldn’t it be easier just to rerun your existing ingest pipeline and just rewind the point at which it’s going to pull data from? Minimising the amount of ‘replumbing’ and reconfiguration to run a re-process job vs. new ingest makes it faster to do, and more reliable. Each additional configuration change is an opportunity to mis-configure. Each ‘shadow’ script clone for re-running vs normal processing is increasing the risk of code diverging and stale copies being run.

The final pipeline in this simple example looks like this:

  • The source server logs are streamed into Kafka, with a permanent copy up onto Amazon’s S3 for those real “uh oh” moments. Kafka, in a sandbox environment with a ham-fisted sysadmin, won’t be bullet-proof. Better to recover a copy from S3 than have to bother the Production server again. This is something I’ve put in for this specific use case, and wouldn’t be applicable in others.
  • From Kafka the web server logs are available to stream, as if natively from the web server disk itself, through Flume and Logstash.

There’s a variation on a theme of this, that looks like this:

Instead of Flume -> Kafka, and then a second Flume -> HDFS, we shortcut this and have the same Flume agent as is pulling from source writing to HDFS. Why have I not put this as the final pipeline? Because of this:

Let’s say that I want to do some kind of light-touch enrichment on the files, such as extracting the log timestamp in order to partition my web server logs in HDFS by the date of the log entry (not the time of processing, because I’m working with historical files too). I’m using a regex_extractor interceptor in Flume to determine the timestamp from the event data (log entry) being processed. That’s great, and it works well – when it works. If I get my regex wrong, or the log file changes date format, the house of cards comes tumbling down. Now I have a mess, because my nice clean ingest pipeline from the source system now needs fixing and re-running. As before, of course it is possible to write this cleanly so that it doesn’t break, etc etc, but from the point of view of decoupling operations for manageability and flexibility it makes sense to keep them separate (remember the Obtain vs Scrub point above?).

The final note on this is to point out that technically we can implement the pipeline using a Kafka Flume channel, which is a slightly neater way of doing things. The data still ends up in the S3 sink, and available in Kafka for streaming to all the consumers.

Kafka in Action

Let’s take a look at the configuration to put the above theory into practice. I’m running all of this on Oracle’s BigDataLite 4.2.1 VM which includes, amongst many other goodies, CDH 5.4.0. Alongside this I’ve installed into /opt :

  • apache-flume-1.6.0
  • elasticsearch-1.7.3
  • kafka_2.10-
  • kibana-4.1.2-linux-x64
  • logstash-1.5.4
The Starting Point – Flume -> HDFS

First, we’ve got the initial Logs -> Flume -> HDFS configuration, similar to what Mark wrote about originally:

source_agent.sources = apache_server  
source_agent.sources.apache_server.type = exec  
source_agent.sources.apache_server.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_server.batchSize = 1  
source_agent.sources.apache_server.channels = memoryChannel

source_agent.channels = memoryChannel  
source_agent.channels.memoryChannel.type = memory  
source_agent.channels.memoryChannel.capacity = 100

## Write to HDFS  
source_agent.sinks = hdfs_sink  
source_agent.sinks.hdfs_sink.type = hdfs = memoryChannel  
source_agent.sinks.hdfs_sink.hdfs.path = /user/oracle/incoming/rm_logs/apache_log  
source_agent.sinks.hdfs_sink.hdfs.fileType = DataStream  
source_agent.sinks.hdfs_sink.hdfs.writeFormat = Text  
source_agent.sinks.hdfs_sink.hdfs.rollSize = 0  
source_agent.sinks.hdfs_sink.hdfs.rollCount = 10000  
source_agent.sinks.hdfs_sink.hdfs.rollInterval = 600

After running this

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \
--conf-file flume_website_logs_02_tail_source_hdfs_sink.conf

we get the logs appearing in HDFS and can see them easily in Hue:

Adding Kafka to the Pipeline

Let’s now add Kafka to the mix. I’ve already set up and started Kafka (see here for how), and Zookeeper’s already running as part of the default BigDataLite build.

First we need to define a Kafka topic that is going to hold the log files. In this case it’s called apache_logs:

$ /opt/kafka_2.10- --zookeeper bigdatalite:2181 \
--create --topic apache_logs  --replication-factor 1 --partitions 1

Just to prove it’s there and we can send/receive message on it I’m going to use the Kafka console producer/consumer to test it. Run these in two separate windows:

$ /opt/kafka_2.10- \
--broker-list bigdatalite:9092 --topic apache_logs

$ /opt/kafka_2.10- \
--zookeeper bigdatalite:2181 --topic apache_logs

With the Consumer running enter some text, any text, in the Producer session and you should see it appear almost immediately in the Consumer window.

Now that we’ve validated the Kafka topic, let’s plumb it in. We’ll switch the existing Flume config to use a Kafka sink, and then add a second Flume agent to do the Kafka -> HDFS bit, giving us this:

The original flume agent configuration now looks like this:

source_agent.sources = apache_log_tail  
source_agent.channels = memoryChannel  
source_agent.sinks = kafka_sink

source_agent.sources.apache_log_tail.type = exec  
source_agent.sources.apache_log_tail.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_log_tail.batchSize = 1  
source_agent.sources.apache_log_tail.channels = memoryChannel

source_agent.channels.memoryChannel.type = memory  
source_agent.channels.memoryChannel.capacity = 100

## Write to Kafka = memoryChannel  
source_agent.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSink  
source_agent.sinks.kafka_sink.batchSize = 5  
source_agent.sinks.kafka_sink.brokerList = bigdatalite:9092  
source_agent.sinks.kafka_sink.topic = apache_logs

Restart the from above so that you can see what’s going into Kafka, and then run the Flume agent. You should see the log entries appearing soon after. Remember that is just one consumer of the logs – when we plug in the Flume consumer to write the logs to HDFS we can opt to pick up all of the entries in Kafka, completely independently of what we have or haven’t consumed in

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \ 
--conf-file flume_website_logs_03_tail_source_kafka_sink.conf

[oracle@bigdatalite ~]$ /opt/kafka_2.10- \
--zookeeper bigdatalite:2181 --topic apache_logs - - [06/Sep/2015:08:08:30 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; - free monitoring service;" - - [06/Sep/2015:08:08:35 +0000] "HEAD /blog HTTP/1.1" 301 - "" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" - - [06/Sep/2015:08:08:35 +0000] "GET /blog/ HTTP/1.0" 200 145999 "-" "Mozilla/5.0 (compatible; monitis - premium monitoring service;" - - [06/Sep/2015:08:08:36 +0000] "HEAD /blog/ HTTP/1.1" 200 - "" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" - - [06/Sep/2015:08:08:44 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; - free monitoring service;" - - [06/Sep/2015:08:08:58 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; monitis - premium monitoring service;" - - [06/Sep/2015:08:08:58 +0000] "GET / HTTP/1.1" 200 36946 "-" "Echoping/6.0.2"

Set up the second Flume agent to use Kafka as a source, and HDFS as the target just as it was before we added Kafka into the pipeline:

target_agent.sources = kafkaSource  
target_agent.channels = memoryChannel  
target_agent.sinks = hdfsSink 

target_agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource  
target_agent.sources.kafkaSource.zookeeperConnect = bigdatalite:2181  
target_agent.sources.kafkaSource.topic = apache_logs  
target_agent.sources.kafkaSource.batchSize = 5  
target_agent.sources.kafkaSource.batchDurationMillis = 200  
target_agent.sources.kafkaSource.channels = memoryChannel

target_agent.channels.memoryChannel.type = memory  
target_agent.channels.memoryChannel.capacity = 100

## Write to HDFS  
target_agent.sinks.hdfsSink.type = hdfs = memoryChannel  
target_agent.sinks.hdfsSink.hdfs.path = /user/oracle/incoming/rm_logs/apache_log  
target_agent.sinks.hdfsSink.hdfs.fileType = DataStream  
target_agent.sinks.hdfsSink.hdfs.writeFormat = Text  
target_agent.sinks.hdfsSink.hdfs.rollSize = 0  
target_agent.sinks.hdfsSink.hdfs.rollCount = 10000  
target_agent.sinks.hdfsSink.hdfs.rollInterval = 600

Fire up the agent:

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent -n target_agent \
-f flume_website_logs_04_kafka_source_hdfs_sink.conf

and as the website log data streams in to Kafka (from the first Flume agent) you should see the second Flume agent sending it to HDFS and evidence of this in the console output from Flume:

15/10/27 13:53:53 INFO hdfs.BucketWriter: Creating /user/oracle/incoming/rm_logs/apache_log/FlumeData.1445954032932.tmp

and in HDFS itself:

Play it again, Sam?

All we’ve done to this point is add Kafka into the pipeline, ready for subsequent use. We’ve not changed the nett output of the data pipeline. But, we can now benefit from having Kafka there, by re-running some of our HDFS load without having to go back to the source files. Let’s say we want to partition the logs as we store them. But, we don’t want to disrupt the existing processing. How? Easy! Just create another Flume agent with the additional configuration in to do the partitioning.

target_agent.sources = kafkaSource  
target_agent.channels = memoryChannel  
target_agent.sinks = hdfsSink

target_agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource  
target_agent.sources.kafkaSource.zookeeperConnect = bigdatalite:2181  
target_agent.sources.kafkaSource.topic = apache_logs  
target_agent.sources.kafkaSource.batchSize = 5  
target_agent.sources.kafkaSource.batchDurationMillis = 200  
target_agent.sources.kafkaSource.channels = memoryChannel  
target_agent.sources.kafkaSource.groupId = new = smallest  
target_agent.sources.kafkaSource.interceptors = i1

target_agent.channels.memoryChannel.type = memory  
target_agent.channels.memoryChannel.capacity = 1000

# Regex Interceptor to set timestamp so that HDFS can be written to partitioned  
target_agent.sources.kafkaSource.interceptors.i1.type = regex_extractor  
target_agent.sources.kafkaSource.interceptors.i1.serializers = s1  
target_agent.sources.kafkaSource.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer = timestamp  
# Match this format logfile to get timestamp from it:  
# - - [06/Apr/2014:03:38:07 +0000] "GET / HTTP/1.1" 200 38281 "-" "Pingdom.com_bot_version_1.4_("  
target_agent.sources.kafkaSource.interceptors.i1.regex = (\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2}\\s\\+\\d{4})  
target_agent.sources.kafkaSource.interceptors.i1.serializers.s1.pattern = dd/MMM/yyyy:HH:mm:ss Z  

## Write to HDFS  
target_agent.sinks.hdfsSink.type = hdfs = memoryChannel  
target_agent.sinks.hdfsSink.hdfs.path = /user/oracle/incoming/rm_logs/apache/%Y/%m/%d/access_log  
target_agent.sinks.hdfsSink.hdfs.fileType = DataStream  
target_agent.sinks.hdfsSink.hdfs.writeFormat = Text  
target_agent.sinks.hdfsSink.hdfs.rollSize = 0  
target_agent.sinks.hdfsSink.hdfs.rollCount = 0  
target_agent.sinks.hdfsSink.hdfs.rollInterval = 600

The important lines of note here (as highlighted above) are:

  • the regex_extractor interceptor which determines the timestamp of the log event, then used in the hdfs.path partitioning structure
  • the groupId and configuration items for the kafkaSource.
    • The groupId ensures that this flume agent’s offset in the consumption of the data in the Kafka topic is maintained separately from that of the original agent that we had. By default it is flume, and here I’m overriding it to new. It’s a good idea to specify this explicitly in all Kafka flume consumer configurations to avoid complications.
    • tells the consumer that if no existing offset is found (which is won’t be, if the groupId is new one) to start from the beginning of the data rather than the end (which is what it will do by default).
    • Thus if you want to get Flume to replay the contents of a Kafka topic, just set the groupId to an unused one (eg ‘foo01’, ‘foo02’, etc) and make sure the is smallest

Now run it (concurrently with the existing flume agents if you want):

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent -n target_agent \
-f flume_website_logs_07_kafka_source_partitioned_hdfs_sink.conf

You should see a flurry of activity (or not, depending on how much data you’ve already got in Kafka), and some nicely partitioned apache logs in HDFS:

Crucially, the existing flume agent and non-partitioned HDFS pipeline stays in place and functioning exactly as it was – we’ve not had to touch it. We could then run two two side-by-side until we’re happy the partitioning is working correctly and then decommission the first. Even at this point we have the benefit of Kafka, because we just turn off the original HDFS-writing agent – the new “live” one continues to run, it doesn’t need reconfiguring. We’ve validated the actual configuration we’re going to use for real, we’ve not had to simulate it up with mock data sources that then need re-plumbing prior to real use.

Clouds and Channels

We’re going to evolve the pipeline a bit now. We’ll go back to a single Flume agent writing to HDFS, but add in Amazon’s S3 as the target for the unprocessed log files. The point here is not so much that S3 is the best place to store log files (although it is a good option), but as a way to demonstrate a secondary method of keeping your raw data available without impacting the source system. It also fits nicely with using the Kafka flume channel to tighten the pipeline up a tad:

Amazon’s S3 service is built on HDFS itself, and Flume can use the S3N protocol to write directly to it. You need to have already set up your S3 ‘bucket’, and have the appropriate AWS Access Key ID and Secret Key. To get this to work I added these credentials to /etc/hadoop/conf.bigdatalite/core-site.xml (I tried specifying them inline with the flume configuration but with no success):


Once you’ve set up the bucket and credentials, the original flume agent (the one pulling the actual web server logs) can be amended:

source_agent.sources = apache_log_tail  
source_agent.channels = kafkaChannel  
source_agent.sinks = s3Sink

source_agent.sources.apache_log_tail.type = exec  
source_agent.sources.apache_log_tail.command = tail -f /home/oracle/website_logs/access_log  
source_agent.sources.apache_log_tail.batchSize = 1  
source_agent.sources.apache_log_tail.channels = kafkaChannel

## Write to Kafka Channel = kafkaChannel  
source_agent.channels.kafkaChannel.type =  
source_agent.channels.kafkaChannel.topic = apache_logs  
source_agent.channels.kafkaChannel.brokerList = bigdatalite:9092  
source_agent.channels.kafkaChannel.zookeeperConnect = bigdatalite:2181

## Write to S3 = kafkaChannel  
source_agent.sinks.s3Sink.type = hdfs  
source_agent.sinks.s3Sink.hdfs.path = s3n://rmoff-test/apache  
source_agent.sinks.s3Sink.hdfs.fileType = DataStream  
source_agent.sinks.s3Sink.hdfs.filePrefix = access_log  
source_agent.sinks.s3Sink.hdfs.writeFormat = Text  
source_agent.sinks.s3Sink.hdfs.rollCount = 10000  
source_agent.sinks.s3Sink.hdfs.rollSize = 0  
source_agent.sinks.s3Sink.hdfs.batchSize = 10000  
source_agent.sinks.s3Sink.hdfs.rollInterval = 600

Here the source is the same as before (server logs), but the channel is now Kafka itself, and the sink S3. Using Kafka as the channel has the nice benefit that the data is now already in Kafka, we don’t need that as an explicit target in its own right.

Restart the source agent using this new configuration:

$ /opt/apache-flume-1.6.0-bin/bin/flume-ng agent --name source_agent \
--conf-file flume_website_logs_09_tail_source_kafka_channel_s3_sink.conf

and you should get the data appearing on both HDFS as before, and now also in the S3 bucket:

Didn’t Someone Say Logstash?

The premise at the beginning of this exercise was that I could extend an existing data pipeline to pull data into a new set of tools, as if from the original source, but without touching that source or the existing configuration in place. So far we’ve got a pipeline that is pretty much as we started with, just with Kafka in there now and an additional feed to S3:

Now we’re going to extend (or maybe “broaden” is a better term) the data pipeline to add Elasticsearch into it:

Whilst Flume can write to Elasticsearch given the appropriate extender, I’d rather use a tool much closer to Elasticsearch in origin and direction – Logstash. Logstash supports Kafka as an input (and an output, if you want), making the configuration ridiculously simple. To smoke-test the configuration just run Logstash with this configuration:

input {  
        kafka {  
                zk_connect => 'bigdatalite:2181'  
                topic_id => 'apache_logs'  
                codec => plain {  
                        charset => "ISO-8859-1"  
                # Use both the following two if you want to reset processing  
                reset_beginning => 'true'  
                auto_offset_reset => 'smallest'


output {  
        stdout {codec => rubydebug }  

A few of things to point out in the input configuration:

  • You need to specify plain codec (assuming your input from Kafka is). The default codec for the Kafka plugin is json, and Logstash does NOT like trying to parse plain text and json as I found out: - - [06/Sep/2015:08:08:30 +0000] "GET / HTTP/1.0" 301 235 "-" "Mozilla/5.0 (compatible; - free monitoring service;" {:exception=>#<NoMethodError: undefined method `[]' for 37.252:Float>, :backtrace=>["/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/event.rb:73:in `initialize'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-codec-json-1.0.1/lib/logstash/codecs/json.rb:46:in `decode'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-1.0.0/lib/logstash/inputs/kafka.rb:169:in `queue_event'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-1.0.0/lib/logstash/inputs/kafka.rb:139:in `run'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:177:in `inputworker'", "/opt/logstash-1.5.4/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.4-java/lib/logstash/pipeline.rb:171:in `start_input'"], :level=>:error}

  • As well as specifying the codec, I needed to specify the charset. Without this I got \\u0000\\xBA\\u0001 at the beginning of each message that Logstash pulled from Kafka

  • Specifying reset_beginning and auto_offset_reset tell Logstash to pull everything in from Kafka, rather than starting at the latest offset.

When you run the configuration file above you should see a stream of messages to your console of everything that is in the Kafka topic:

$ /opt/logstash-1.5.4/bin/logstash -f logstash-apache_10_kafka_source_console_output.conf

The output will look like this – note that Logstash has added its own special @version and @timestamp fields:

       "message" => " - - [09/Oct/2015:04:13:23 +0000] \"GET /wp-content/uploads/2014/10/JFB-View-Selector-LowRes-300x218.png HTTP/1.1\" 200 53295 \"\" \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",  
      "@version" => "1",  
    "@timestamp" => "2015-10-27T17:29:06.596Z"  

Having proven the Kafka-Logstash integration, let’s do something useful – get all those lovely log entries streaming from source, through Kafka, enriched in Logstash with things like geoip, and finally stored in Elasticsearch:

input {  
        kafka {  
                zk_connect => 'bigdatalite:2181'  
                topic_id => 'apache_logs'  
                codec => plain {  
                        charset => "ISO-8859-1"  
                # Use both the following two if you want to reset processing  
                reset_beginning => 'true'  
                auto_offset_reset => 'smallest'  

filter {  
        # Parse the message using the pre-defined "COMBINEDAPACHELOG" grok pattern  
        grok { match => ["message","%{COMBINEDAPACHELOG}"] }

        # Ignore anything that's not a blog post hit, characterised by /yyyy/mm/post-slug form  
        if [request] !~ /^\/[0-9]{4}\/[0-9]{2}\/.*$/ { drop{} }

        # From the blog post URL, strip out the year/month and slug  
        #     year  => 2015  
        #     month =>   02  
        #     slug  => obiee-monitoring-and-diagnostics-with-influxdb-and-grafana  
        grok { match => [ "request","\/%{NUMBER:post-year}\/%{NUMBER:post-month}\/(%{NUMBER:post-day}\/)?%{DATA:post-slug}(\/.*)?$"] }

        # Combine year and month into one field  
        mutate { replace => [ "post-year-month" , "%{post-year}-%{post-month}" ] }

        # Use GeoIP lookup to locate the visitor's town/country  
        geoip { source => "clientip" }

        # Store the date of the log entry (rather than now) as the event's timestamp  
        date { match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]}  

output {  
        elasticsearch { host => "bigdatalite"  index => "blog-apache-%{+YYYY.MM.dd}"}  

Make sure that Elasticsearch is running and then kick off Logstash:

$ /opt/logstash-1.5.4/bin/logstash -f logstash-apache_01_kafka_source_parsed_to_es.conf

Nothing will appear to happen on the console:

log4j, [2015-10-27T17:36:53.228]  WARN: org.elasticsearch.bootstrap: JNA not found. native methods will be disabled.  
Logstash startup completed

But in the background Elasticsearch will be filling up with lots of enriched log data. You can confirm this through the useful kopf plugin to see that the Elasticsearch indices are being created:

and directly through Elasticsearch’s RESTful API too:

$ curl -s -XGET http://bigdatalite:9200/_cat/indices?v|sort  
health status index                  pri rep docs.count docs.deleted store.size  
yellow open   blog-apache-2015.09.30   5   1      11872            0       11mb           11mb  
yellow open   blog-apache-2015.10.01   5   1      13679            0     12.8mb         12.8mb  
yellow open   blog-apache-2015.10.02   5   1      10042            0      9.6mb          9.6mb  
yellow open   blog-apache-2015.10.03   5   1       8722            0      7.3mb          7.3mb

And of course, the whole point of streaming the data into Elasticsearch in the first place – easy analytics through Kibana:


Kafka is awesome :-D

We’ve seen in this article how Kafka enables the implementation of flexible data pipelines that can evolve organically without requiring system rebuilds to implement or test new methods. It allows the data discovery function to tap in to the same source of data as the more standard analytical reporting one, without risking impacting the source system at all.

The post Forays into Kafka – Enabling Flexible Data Pipelines appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Oracle Business Intelligence 12c Now Available – Improving Agility and Enabling Self-Service for BI Users

Mon, 2015-10-26 00:08

Oracle Business Intelligence 12c became available for download last Friday and is being officially launched at Oracle Openworld next week. Key new features in 12c include an updated and cleaner look-and-feel, Visual Analyser that brings Tableau-style reporting to OBIEE users along with another new feature called “data-mashups”,   enables users to upload spreadsheets of their own data to combine with their main curated datasets.


Behind the scenes the back-end of OBIEE has been overhauled with simplification aimed at making it easier to clone, provision and backup BI systems, whilst other changes are laying the foundation for future public and private cloud features that we’ll see over the coming years – and expect Oracle BI Cloud Service to be an increasingly important part of Oracle’s overall BI offering over the next few years as innovation comes more rapidly and “cloud-first”.

So what does Oracle Business Intelligence 12c offer customers currently on the 11g release, and why would you want to upgrade? In our view, the new features in 12c come down to two main areas – “agility”, and “self-service” – two major trends that having been driving spend and investment in BI over the past few years.

OBIEE 12c for Business Agility – Giving Users the Ability to complete the “Last Mile” in Reporting, and Moving towards “BI-as-a-Service” for IT

A common issue that all BI implementors have had over many years is the time it takes to spin-up new environments, create reports for users, to respond to new requirements and new opportunities. OBIEE12c new features such as data mashups make it easier for end-users to complete the “last mile” in reporting by adding particular measures and attribute values to the reports and subject areas provided by IT, avoiding the situation where they instead export all data to Excel or wait for IT to add the data they need into the curated dataset managed centrally.


From an IT perspective, simplifications to the back-end of OBIEE such as bringing all configuration files into one place, deprecating the BI Systems Management API and returning to configuration files, simpler upgrades and faster installation make it quicker and easier to provision new 12c environments and move workloads between on-premise and in the cloud. The point of these changes is to enable organisations to respond to opportunities faster, and make sure IT isn’t the thing that’s slowing the reporting process down.

OBIEE 12c for Self-Service – Recognising the Shift in Ownership from IT to the End-Users

One of the biggest trends in BI, and in computing in-general over the past few years, is the consumerization of IT and expectations around self-service. A big beneficiary of that trend has been vendors such as Tableau and Qlikview who’ve delivered BI tools that run on the desktop, make everything point-and-click and are the equivalent to the PC vendors when IT used to run mainframes; data and applications became a bit of a free-for-all but users were able to get things done now, rather than having to wait for IT to provide the service. Similar to the data upload feature I mentioned in the context of agility, the new Visual Analyser feature in OBIEE12c brings those same self-service, point-and-click data analysis features to OBIEE users – but crucially with a centrally managed, single-version-of-the-truth business semantic model at the centre of things.


Visual Analyser comes with the same data-mashup features as Answers, and new advanced analytics capabilities in Logical SQL and Answers’s query builder bring statistical functions like trend analysis and clustering into the hands of end-users, avoiding the need to involve DBAs or data scientists to provide complex SQL functions. If you do have a data scientist and you want to re-use their work without learning another tool, OBIEE12c makes it possible to call external R functions within Answers separate to the Oracle R Enterprise integration in OBIEE11g.

We’ll be covering more around the OBIEE12c launch over the coming weeks building on these themes of enabling business agility, and putting more self-service tools into the hands of users. We’ll also be launching our new OBIEE12c course over the next couple of days, with the first runs happening in Brighton and Atlanta in January 2016 – watch this space for more details.

The post Oracle Business Intelligence 12c Now Available – Improving Agility and Enabling Self-Service for BI Users appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Rittman Mead at Oracle Openworld 2015, San Francisco

Thu, 2015-10-22 13:21

Oracle Openworld 2015 is running next week in San Francisco, USA, and Rittman Mead are proud to be delivering a number of sessions over the week of the conference. We’ll also be taking part in a number of panel sessions, user group events and networking sessions, and running 1:1 sessions with anyone interested in talking to us about the solutions and services we’re talking about during the week.


Sessions at Oracle Openworld 2015 from Rittman Mead are as follows:

  • A Walk Through the Kimball ETL Subsystems with Oracle Data Integration Solutions [UGF6311] – Michael Rainey, Sunday, Oct 25, 12:00 p.m. | Moscone South—301
  • Oracle Business Intelligence Cloud Service—Moving Your Complete BI Platform to the Cloud [UGF4906] – Mark Rittman, Sunday, Oct 25, 2:30 p.m. | Moscone South—301
  • Developer Best Practices for Oracle Data Integrator Lifecycle Management [CON9611] – Jerome Francoisse + others, Thursday, Oct 29, 2:30 p.m. | Moscone West—2022
  • Oracle Data Integration Product Family: a Cornerstone for Big Data [CON9609] – Mark Ritman + others, Wednesday, Oct 28, 12:15 p.m. | Moscone West—2022
  • Empowering Users: Oracle Business Intelligence Enterprise Edition 12c Visual Analyzer [UGF5481] – Edel Kammermann, Sunday, Oct 25, 10:00 a.m. | Moscone West—3011
  • No Big Data Hacking—Time for a Complete ETL Solution with Oracle Data Integrator 12c [UGF5827] – – Jerome Francoisse, Sunday, Oct 25, 8:00 a.m. | Moscone South—301

We’ll be at Openworld all week and available at various times to talk through topics we covered in our sessions, or any aspect of Oracle BI, DW and Big Data implementations you might be planning or currently running. Drop us an email at to set something up during the week, or come along to any of our sessions and meet us in person

The post Rittman Mead at Oracle Openworld 2015, San Francisco appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Introducing the Rittman Mead OBIEE Performance Analytics Service

Wed, 2015-10-21 04:30
Fix Your OBIEE Performance Problems Today

OBIEE is a powerful analytics tool that enables your users to make the most of the data in your organisation. Ensuring that expected response times are met is key to driving user uptake and successful user engagement with OBIEE.

Rittman Mead can help diagnose and resolve performance problems on your OBIEE system. Taking a holistic, full-stack view, we can help you deliver the best service to your users. Fast response times enable your users to do more with OBIEE, driving better engagement, higher satisfaction, and greater return on investment. We enable you to :

  • Create a positive user experience
  • Ensure OBIEE returns answers quickly
  • Empower your BI team to identify and resolve performance bottlenecks in real time
Rittman Mead Are The OBIEE Performance Experts

Rittman Mead have many years of experience in the full life cycle of data warehousing and analytical solutions, especially in the Oracle space. We know what it takes to design a good system, and to troubleshoot a problematic one.

We are firm believers in a practical and logical approach to performance analytics and optimisation. Eschewing the drunk man anti-method of ‘tuning’ configuration settings at random, we advocate making a clear diagnosis and baseline of performance problems before changing anything. Once a clear understanding of the situation is established, steps are taken in a controlled manner to implement and validate one change at a time.

Rittman Mead have spoken at conferences, produced videos, and written many blogs specifically on the subject of OBIEE Performance.

Performance Analytics is not a dark art. It is not the blind application of ‘best practices’ or ‘tuning’ configuration settings. It is the logical analysis of performance behaviour to accurately determine the issue(s) present, and the possible remedies for them.

Diagnose and Resolve OBIEE Performance Problems with Confidence

When you sign up for the Rittman Mead OBIEE Performance Analytics Service you get:

  1. On-site consultancy from one of our team of Performance experts, including Mark Rittman (Oracle ACE Director), and Robin Moffatt (Oracle ACE).
  2. A Performance Analysis Report to give you an assessment of the current performance and prioritised list of optimisation suggestions, which we can help you implement.
  3. Use of the Performance Diagnostics Toolkit to measure and analyse the behaviour of your system and correlate any poor response times with the metrics from the server and OBIEE itself.
  4. Training, which is vital for enabling your staff to deliver optimal OBIEE performance. We work with your staff to help them understand the good practices to be looking for in design and diagnostics. Training is based on formal courseware along with workshops based on examples from your OBIEE system where appropriate
Let Us Help You, Today!

Get in touch now to find out how we can help improve your OBIEE system’s performance. We offer a free, no-obligation sample of the Performance Analysis Report, built on YOUR data.

Don’t just call us when performance may already be problematic – we can help you assess your OBIEE system for optimal performance at all stages of the build process. Gaining a clear understanding of the performance profile of your system and any potential issues gives you the confidence and ability to understand any potential risks to the success of your project – before it gets too late.

The post Introducing the Rittman Mead OBIEE Performance Analytics Service appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

News on Three Big Data Webcasts with Oracle, and a Customer Case-Study at Cloudera Sessions

Tue, 2015-10-13 15:18


This week I’m presenting along with Liberty Global at the Cloudera Sessions event in Amsterdam on October 15th 2015, on their implementation of Cloudera Enterprise on Oracle Big Data Appliance for a number of big data and advanced analytics initiatives around their cable TV, mobile and internet business.

We’ve been working with Liberty Global for a number of years and helped them get started with their move into big data a year or so ago, and it’s great to see them speaking at this Cloudera event and the success they’ve had with this joint Oracle+Cloudera platform. Andre Lopes and Roberto Manfredini from Liberty Global will talk about the business drivers and initial PoC scenario that then paid for the first main stage of the project, and I’ll talk about how we worked with their implementation team and senior managers to implement Cloudera’s enterprise Hadoop platform on Oracle engineered systems. 

Rittman Mead and Oracle Big Data Webcast Series – November 2015

We’re also running a set of three webcasts together with Oracle on three use-cases for big data in an Oracle context. The sessions will run over three weeks  in November 2015 and will look at three ways we’re seeing Rittman Mead big data customers use the platform to extend the storage and capabilities of their data warehouse, creating repositories and analysis sandpits for customer behaviour analysis, and taking data discovery into the Hadoop era using Big Data Discovery.

All events are free to attend, we’re timing them to suit the UK,Europe and the US, and details of each webcast are as follows:

Extending and enhancing your Data Warehouse to address Big Data

Organizations with data warehouses are increasingly looking at big data technologies to extend the capacity of their platform, offload simple ETL and data processing tasks and add new capabilities to store and process unstructured data along with their existing relational datasets. In this presentation we’ll look at what’s involved in adding Hadoop and other big data technologies to your data warehouse platform, see how tools such as Oracle Data Integrator and Oracle Business Intelligence can be used to process and analyze new “big data” data sources, and look at what’s involved in creating a single query and metadata layer over both sources of data.

Audience: DBAs, DW managers, architects 
Tuesday 3rd November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Audience : DBAs, DW managers, architects 

What is Big Data Discovery and how does it complement traditional Business Analytics?

Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization. At the same time Oracle Big Data Discovery reduces the dependency on expensive and often difficult to find Data Scientists, opening up many Big Data tasks to “Citizen” Data Scientists.

In this session we’ll look at Oracle Big Data Discovery and how it provides a “visual face” to your big data initiatives, and how it complements and extends the work that you currently do using business analytics tools.

Audience : Data analysts, market analysts, & Big Data project team members 
Tuesday 10th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Adding Big Data to your Organization to create true 360-Degree Customer Insight

Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately.

In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.

Audience : MI Managers, CX Managers, CIOs, BI / Analytics Managers
Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately.
In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.
Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Register for Customer Insight

The post News on Three Big Data Webcasts with Oracle, and a Customer Case-Study at Cloudera Sessions appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Forays into Kafka – Logstash transport / centralisation

Tue, 2015-10-13 12:14

The holy trinity of Elasticsearch, Logstash, and Kibana (ELK) are a powerful trio of tools for data discovery and systems diagnostics. In a nutshell, they enable you to easily search through your log files, slice & dice them visually, drill into problem timeframes, and generally be the boss of knowing where your application’s at.

Getting application logs into ELK in the most basic configuration means doing the processing with Logstash local to the application server, but this has two overheads – the CPU required to do the processing, and (assuming you have more than one application server) the management of multiple configurations and deployments across your servers. A more flexible and maintainable architecture is to ship logs from the application server to a separate ELK server with something like Logstash-forwarder (aka Lumberjack), and do all your heavy ELK-lifting there.

In this article I’m going to demonstrate an alternative way of shipping and centralising your logs for Logstash processing, with Apache Kafka.

Kafka is a “publish-subscribe messaging rethought as a distributed commit log”. What does that mean in plainer English? My over-simplified description would be that it is a tool that:

  1. Enables one or more components, local or across many machines, to send messages (of any format) to …
  2. …a centralised store, which may be holding messages from other applications too…
  3. …from where one or more consumers can independently opt to pull these messages in exact order, either as they arrive, batched, or ‘rewinding’ to a previous point in time on demand.

Kafka has been designed from the outset to be distributed and fault-tolerant, and for very high performance (low latency) too. For a good introduction to Kafka and its concepts, the introduction section of the documentation is a good place to start, as is Gwen Shapira’s Kafka for DBAs presentation.

If you’re interested in reading more about Kafka, the article that really caught my imagination with its possibilities was by Martin Kleppmann in which he likens (broadly) Kafka to the unix Pipe concept, being the joiner between components that never had to be designed to talk to each other specifically.

Kafka gets a lot of press in the context of “Big Data”, Spark, and the like, but it also makes a lot of sense as a “pipe” between slightly more ‘mundane’ systems such as Logstash…


In this article we’re using Kafka at its very simplest – one Producer, one Topic, one Consumer. But hey, if it works and it is a good use of technology who cares if it’s not a gazillion message throughput per second to give us bragging rights on Hacker News

We’re going to run Logstash twice; once on the application server to simply get the logfiles out and in to Kafka, and then again to pull the data from Kafka and process it at our leisure:

Once Logstash has processed the data we’ll load it into Elasticsearch, from where we can do some nice analysis against it in Kibana.


This article was written based on three servers:

  1. Application server (OBIEE)
  2. Kafka server
  3. ELK server

In practice, Kafka could run on the ELK server if you needed it to and throughput was low. If things got busier, splitting them out would make sense as would scaling out Kafka and ELK across multiple nodes each for capacity and resilience. Both Kafka and Elasticsearch are designed to be run distributed and are easy to do so.

The steps below show how to get the required software installed and running.

Networking and Host Names

Make sure that each host has a hostname that is proper (not ‘demo’) and can be resolved from all the other hosts being used. Liberal use of /etc/hosts hardcoding of IP/hostnames and copying to each host is one way around this in a sandbox environment. In the real world use DNS CNAMEs to resolve the static ip of each host.

Make sure that the hostname is accessible from all other machines in use. That is, if you type hostname on one machine:

rmoff@ubuntu-03:~$ hostname  

Make sure that you can ping it from another machine:

rmoff@ubuntu-02:/opt$ ping ubuntu-03  
PING ubuntu-03 ( 56(84) bytes of data.  
64 bytes from ubuntu-03 ( icmp_seq=1 ttl=64 time=0.530 ms  
64 bytes from ubuntu-03 ( icmp_seq=2 ttl=64 time=0.287 ms  

and use netcat to hit a particular port (assuming that something’s listening on that port):

rmoff@ubuntu-02:/opt$ nc -vz ubuntu-03 9200  
Connection to ubuntu-03 9200 port [tcp/*] succeeded!

Application Server – log source (“sampleappv406”)

This is going to be the machine from which we’re collecting logs. In my example it’s OBIEE that’s generating the logs, but it could be any application. All we need to install is Logstash, which is going to ship the logs – unprocessed – over to Kafka. Because we’re working with Kafka, it’s also useful to have the console scripts (that ship with the Kafka distribution) available as well, but strictly speaking, we don’t need to install Kafka on this machine.

  • Downloadkafka is optional, but useful to have the console scripts there for testing
  • Install
    unzip logstash*.zip  
    tar -xf kafka*
    sudo mv kafka* /opt  
    sudo mv logstash* /opt
Kafka host (“ubuntu-02”)

This is our kafka server, where Zookeeper and Kafka run. Messages are stored here before being passed to the consumer.

  • Download
  • Install
    tar -xf kafka*  
    sudo mv kafka* /opt
  • ConfigureIf there’s any funny business with your networking, such as a hostname on your kafka server that won’t resolve externally, make sure you set the value in /opt/kafka*/config/ to a hostname/IP for the kafka server that can be connected to externally.
  • RunUse separate sessions, or even better, screen, to run both these concurrently:
    • Zookeeper
      cd /opt/kafka*  
      bin/ config/
    • Kafka Server
      cd /opt/kafka*  
      bin/ config/
ELK host (“ubuntu-03”)

All the logs from the application server (“sampleappv406” in our example) are destined for here. We’ll do post-processing on them in Logstash to extract lots of lovely data fields, store it in Elasticsearch, and produce some funky interactive dashboards with Kibana. If, for some bizarre reason, you didn’t want to use Elasticsearch and Kibana but had some other target for your logs after Logtash had parsed them you could use one of the many other output plugins for Logstash.

  • Downloadkafka is optional, but useful to have the console scripts there for testing
  • Install
    tar -xf kibana*  
    unzip elastic*.zip  
    unzip logstash*.zip  
    tar -xf kafka*
    sudo mv kafka* /opt  
    sudo mv kibana* /opt  
    sudo mv elastic* /opt  
    sudo mv logstash* /opt
    # Kopf is an optional, but very useful, Elasticsearch admin web GUI  
    /opt/elastic*/bin/plugin --install lmenezes/elasticsearch-kopf
  • RunUse separate sessions, or even better, screen, to run both these concurrently:
Configuring Kafka

Create the topic. This can be run from any machine with kafka console tools available. The important thing is that you specify the --zookeeper correctly so that it knows where to find Kafka.

cd /opt/kafka*  
bin/ --create --zookeeper ubuntu-02:2181 --replication-factor 1 --partitions 1 --topic logstash

Smoke test
  1. Having created the topic, check that the other nodes can connect to zookeeper and see it. The point is less about viewing the topic as checking that the connectivity between the machines is working.
    $ cd /opt/kafka*  
    $ ./bin/ --list --zookeeper ubuntu-02:2181  

    If you get an error then check that the host resolves and the port is accessible:
    $ nc -vz ubuntu-02 2181
    found 0 associations  
    found 1 connections:  
          1: flags=82<CONNECTED,PREFERRED>  
         outif vboxnet0  
         src port 63919  
         dst port 2181  
         rank info not available  
         TCP aux info available
    Connection to ubuntu-02 port 2181 [tcp/eforward] succeeded!
  2. Set up a simple producer / consumer test
    1. On the application server node, run a script that will be the producer, sending anything you type to the kafka server:
      cd /opt/kafka*  
      --broker-list ubuntu-02:9092   
      --topic logstash

      (I always get the warning WARN Property topic is not valid (kafka.utils.VerifiableProperties); it seems to be harmless so ignore it…)

      This will sit waiting for input; you won’t get the command prompt back.

    2. On the ELK node, run a script that will be the consumer:

      cd /opt/kafka*  
      --zookeeper ubuntu-02:2181   
      --topic logstash

    3. Now go back to the application server node and enter some text and press enter. You should see the same appear shortly afterwards on the ELK node. This is demonstrating Producer -> Kafka -> Consumer
    4. Optionally, run on a second machine (either the kafka host itself, or on a Mac where you’ve run brew install kafka). Now when you enter something on the Producer, you see both Consumers receive it

If the two above tests work, then you’re good to go. If not, then you’ve got to sort this out first because the later stuff sure isn’t going to.

Configuring Logstash on the Application Server (Kafka Producer)

Logstash has a very simple role on the application server – to track the log files that we want to collect, and pass new content in the log file straight across to Kafka. We’re not doing any fancy parsing of the files this side – we want to be as light-touch as possible. This means that our Logstash configuration is dead simple:

input {  
    file {  
        path =>  ["/app/oracle/biee/instances/instance1/diagnostics/logs/*/*/*.log"]  

output {  
    kafka {  
        broker_list => 'ubuntu-02:9092'  
        topic_id => 'logstash'  

Notice the wildcards in the path variable – in this example we’re going to pick up everything related to the OBIEE system components here, so in practice you may want to restrict it down a bit at least during development. You can specify multiple path patterns by comma-separating them within the square brackets, and you can use the exclude parameter to (…drum roll…) exclude specific paths from the wildcard match.

If you now run Logstash with the above configuration (assuming it’s saved as logstash-obi-kafka-producer.conf)

/opt/logstash*/bin/logstash -f logstash-obi-kafka-producer.conf

Logstash will now sit and monitor the file paths that you’ve given it. If they don’t exist, it will keep checking. If they did exist, and got deleted and recreated, or truncated – it’ll still pick up the differences. It’s a whole bunch more smart than your average bear^H^H^H^H tail -f.

If you happen to have left your Kafka console consumer running you might be in for a bit of a shock, depending on how much activity there is on your application server:

Talk about opening the floodgates!

Configuring Logstash on the ELK server (Kafka Consumer)

Let’s give all these lovely log messages somewhere to head. We’re going to use Logstash again, but on the ELK server this time, and with the Kafka input plugin:

input {  
    kafka {  
            zk_connect => 'ubuntu-02:2181'  
            topic_id => 'logstash'  

output {  
    stdout { codec => rubydebug }  

Save and run it:

/opt/logstash*/bin/logstash -f logstash-obi-kafka-consumer.conf

and assuming the application server is still writing new log content we’ll get it written out here:

So far we’re doing nothing fancy at all – simply dumping to the console whatever messages we receive from kafka. In effect, it’s the same as the script that we ran as part of the smoke test earlier. But now we’ve got the messages come in to Logstash we can do some serious processing on them with grok and the like (something I discuss and demonstrate in an earlier article) to pull out meaningful data fields from each log message. The console is not the best place to write this all too – Elasticsearch is! So we specify that as the output plugin instead. An extract of our configuration looks something like this now:

input {  
    kafka {  
            zk_connect => 'ubuntu-02:2181'  
            topic_id => 'logstash'  

filter {  
    grok {  
        match => ["file", "%{WLSSERVER}"]  

    geoip { source => "saw_http_RemoteIP"}


output {  
    elasticsearch {  
        host => "ubuntu-03"  
        protocol=> "http"  

Note the [...] bit in the filter section – this is all the really cool stuff where we wring every last bit of value from the log data and split it into lots of useful data fields…which is why you should get in touch with us so we can help YOU with your OBIEE and ODI monitoring and diagnostics solution!

Advert break over, back to the blog. We’ve set up the new hyper-cool config file, we’ve primed the blasters, we’ve set the “lasers” to 11 … we hit run … and …

…nothing happens. “Logstash startup completed” is the last sign of visible life we see from this console. Checking our we can still see the messages are flowing through:

But Logstash remains silent? Well, no – it’s doing exactly what we told it to, which is to send all output to Elasticsearch (and nowhere else), which is exactly what it’s doing. Don’t believe me? Add back in to the output stanza of the configuration file the output to stdout (console in this case):

output {  
    elasticsearch {  
        host => "ubuntu-03"  
        protocol=> "http"  
    stdout { codec => rubydebug }  

(Did I mention Logstash is mega-powerful yet? You can combine, split, and filter data streams however you want from and to mulitple sources. Here we’re sending it to both elasticsearch and stdout, but it could easily be sending it to elasticsearch and then conditionally to email, or pagerduty, or enriched data back to Kafka, or … you get the idea)

Re-run Logstash with the updated configuration and sure enough, it’s mute no longer:

(this snippet gives you an idea of the kind of data fields that can be extracted from a log file, and this is one of the less interesting ones, difficult to imagine, I know).

Analysing OBIEE Log Data in Elasticsearch with Kibana

The kopf plugin provides a nice web frontend to some of the administrative functions of Elasticsearch, including a quick overview of the state of a cluster and number of documents. Using it we can confirm we’ve got some data that’s been loaded from our Logstash -> Kafka -> Logstash pipeline:

and now in Kibana:

You can read a lot more about Kibana, including the (minimal) setup required to get it to show data from Elasticsearch, in other articles that I’ve written here, here, and here.

Using Kibana we can get a very powerful but simple view over the data we extracted from the log files, showing things like response times, errors, hosts, data models used, and so on:

MOAR Application Servers

Let’s scale this thing out a bit, and add a second application server into the mix. All we need to do is replicate the Logstash install and configuration on the second application server – everything else remains the same. Doing this we start to see the benefit of centralising the log processing, and decoupling it from the application server.

Set the Logstash ‘producer’ running on the second application server, and the data starts passing through, straight into Elasticsearch and Kibana at the other end, no changes needed.

Reprocessing data

One of the appealing features of Kafka is that it stores data for a period of time. This means that consumers can stream or batch as they desire, and that they can also reprocess data. By acting as a durable ‘buffer’ for the data it means that recovering from a client crash, such as a Logstash failure like this:

Error: Your application used more memory than the safety cap of 500M.  
Specify -J-Xmx####m to increase it (#### = cap size in MB).  
Specify -w for full OutOfMemoryError stack trace

is really simple – you just restart Logstash and it picks up processing from where it left off. Because Kafka tracks the last message that a consumer (Logstash in this case) read, it can scroll back through its log to pass to the consumer just messages that have accumulated since that point.

Another benefit of the data being available in Kafka is the ability to reprocess data because the processing itself has changed. A pertinent example of this is with Logstash. The processing that Logstash can do on logs is incredibly powerful, but it may be that a bug is there in the processing, or maybe an additional enrichment (such as geoip) has been added. Instead of having to go back and bother the application server for all its logs (which may have since been housekept away) we can just rerun our Logstash processing as the Kafka consumer and re-pull the data from Kafka. All that needs doing is telling the Logstash consumer to reset its position in the Kafka log from which it reads:

input {  
    kafka {  
            zk_connect => 'ubuntu-02:2181'  
            topic_id => 'logstash'  
            # Use the following two if you want to reset processing  
            reset_beginning => 'true'  
            auto_offset_reset => 'smallest'  

Kafka will keep data for the length of time, or size of data, as defined in the log.retention.minutes and log.retention.bytes configuration settings respectively. This is set globally by default to 7 days (and no size limit), and can be changed globally or per topic.


Logstash with Kafka is a powerful and easy way to stream your application log files off the application server with minimal overhead and then process them on a dedicated host. Elasticsearch and Kibana are a great way to visualise, analyse, and diagnose issues within your application’s log files.

Kafka enables you to loosely couple your application server to your monitoring and diagnostics with minimal overhead, whilst adding the benefit of log replay if you want to reprocess them.

The post Forays into Kafka – Logstash transport / centralisation appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Once A Week Is Never Enough

Thu, 2015-10-01 04:48

Over the last few months we have been looking at User Engagement and recently asked people take part in a global usage survey so we could create a series of benchmark metrics detailing how organisations used OBIEE.

We had a great response and have started to process the data. One of the first things that struck us was that the average user logs into their OBIEE system about once per week.



We derive the once per week figure from the DAU/MAU metric we have talked about here and here. DAU/MAU shows us the % of your user community repeatably using the tool – so a kind of stickiness metric.

DAU and MAU stand for Daily Average Users and Monthly Average Users and are the unique number of users who login and use the system over the period. The definition of DAU/MAU we are using is (30 day moving average of DAU)/(MAU) – using a moving average levels out any usage spikes on particular days of the week.

The reason this measure is significant is that is tells us of how integrated the system is into people’s (working) lives. If you believe that the information in your OBIEE system is invaluable to your employees carrying out their jobs, then you would expect your DAU/MAU to tend to 100%.

Note: DAU/MAU won’t identify users who don’t into your system at all, there is an assumption that they will login at least once over the period of a month. If this is not happening then you have an adoption problem, not an engagement problem.

Is More Engagement Better?

My view is that the higher the engagement in the system, the more valuable the system is. So the questions become whether one visit per week enough; what does good look like; and what did you expect when you designed the system?

If you follow this reasoning, then we could assume that if we could get people using their OBIEE systems more, then their organisations would be better off.

Raising Engagement

We have previous discussed 5 levers that we think drive user engagement:

  • User interface and user experience
  • Quality, relevance, and confidence in data
  • Performance
  • Ability to use the system
  • Accessibility – is the system available in the right way, at the right time?

I want to talk about a couple of them here and how they can effect engagement.

User Interface And User Experience

We have been looking at report and dashboard design for a while now and believe that every dashboard should tell a story. When a user goes to a screen it must support the job they are doing. This means that dashboards have to be designed first and foremost focusing on user’s business processes. We see too many dashboards that contain loosely coupled sets of data that developers thought should be in the same place.

If you need to understand what data a user or analyst needs to do their job, then observe them, look which reports and spreadsheets they use, follow their eyes to see which cells on the various tables they monitor. Are they looking at trends, comparisions or exceptions?

We cover this in depth through the Visual Redesign phase of our User Engagement Service.

The following quote is from one of our client who has used the service:

The new designs provide a simple and at the same time powerful view of data.

Quality, relevance, and confidence in data

However good your dashboard and report design is, if the underlying data is incorrect or inconsistent, or if the user thinks it is, then the system will not be valuable. Users will go to other sources for the data, and they will stop using the unstrusted source. Too many times trust is purely subjective. This is one of the reasons why Excel starts to take root in organisations as a reporting tool. It is somewhere users can self source data, transform and aggregate it as they please and, maybe misguidedly, trust the results.

This starts to open up a much wider conversation about Data Governance that I will address in future posts. We have spoken to a lot of organisations recently where the issue is not that their users don’t have access to data, its that fact there are multiple systems that they could get the answer from and multiple routes around their information architecture that the data flows. You need a clear business intelligence and analytics architecture and roadmap.

Business intelligence and analytics systems are typically judged by return on investment (ROI), however I wonder if organisations with mature information architectures supported by complex data flows could be better off by looking at the cost and complexity of their data architecture. By looking at this they could address both data governance and the total cost of ownership (TCO).

The first step should be to analyse the analytics driven data flows, find out where they apply business logic, where there is duplication, and where there are silos. If you can simplify this process then you start to build the foundation for building a better ROI business case.

So addressing quality, relevance, and confidence in data is a big undertaking. There are some simple things you can do, such as maintain a taxonomy of reporting attributes across the organisation, however in most organisations with mature information architectures the answer is likely to be more deep rooted.


Our goal at Rittman Mead is to help companies harness data in and around their organisations through a range of analytics strategies, processes and tools. We see User Engagement as a key part of this. We believe in our User Engagement Service we have the right tools to help you raise your DAU/MAU.

We have a fundamental belief that the more people embed business intelligence and analytics into their work lives, the better job they will do their company.

The great thing is that in user engagement in OBIEE is measurable. Usage Tracking provides unequivocal details of how your system is performing, who is using it and how often, now and after any changes.

The post Once A Week Is Never Enough appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Taking a Look at Oracle Big Data Preparation Cloud Service – Spark-Based Data Transformation in the Cloud

Sun, 2015-09-27 04:54

One of the sessions I’m delivering at the upcoming Oracle Openworld 2015 in San Francisco is entitled “Oracle Business Intelligence Cloud Service—Moving Your Complete BI Platform to the Cloud [UGF4906]”, and looks at how you can now migrate your entire OBIEE11g platform into Oracle Public Cloud including data warehouse and data integration routines. Using Oracle BI Cloud Services’ new on-premise RPD upload feature you can upload an existing RPD into BICS and run it from there, with the accompanying data warehouse database moving into Oracle’s Database Cloud Service (and with the restriction that you can’t then edit the repository within BICS, you need to do that on-premise and upload again). For ETL and data integration you can carry on using ODI12c which now has the ability to load data into Oracle Storage Cloud (for file sources) and BICS (via a REST API) as well as the full Oracle DBaaS, but another Oracle option for doing purely cloud-based data processing enrichment has recent become available – Oracle Big Data Preparation Cloud Service. So what is it, how does it work and how is it different to ODI12c?

Oracle Big Data Preparation Cloud Service (“BDP”) is a thin-client application within Oracle Cloud for ingesting, preparing and enriching datasets that don’t have a predefined schema and may well need certain fields obfuscated or cleansed. Being integrated with Oracle Storage Cloud and other infrastructure and platform services within Oracle cloud it’s obviously aimed mainly at data transformation tasks within the Oracle Cloud enviroment, but you can upload and download datasets from your browser for use with on-premise applications. Unlike the more general-purpose Oracle Data Integrator it’s aimed instead at a particular use-case – non-technical information analysts who need to get data transformed, wrangled and enriched before they can make use of it in an environment like Hadoop. In fact the product name is a bit misleading – it runs on a big data platform within Oracle Cloud and like Oracle Big Data Discovery uses Apache Spark for its data processing – but it could potentially be useful for a business analyst to prepare data for loading into Oracle BI Cloud Service, and I’ll cover this angle when I talk about data loading options in by Oracle Openworld session.

Within a logical architecture for a typical big data DW and BI system, BDP sits alongside ODI within the Data Factory and provides self-service, agile transformation capabilities to more business-orientated users. 


Oracle Big Data Cloud Preparation Service shares quite a bit of functionality and underlying technology, with Oracle Big Data Discovery – both run on Hadoop, they both use Apache Spark for data processing and transformation, and both offer data transformation and “wrangling” features aimed at non-technical users. Oracle are positioning Big Data Preparation Service as something you’d use in the execution layer of the Oracle Information Management Reference Architecture whereas Big Data Discovery is associated more with the discovery layer – I’d mostly agree but I can see a role for BDD even within the execution layer, as a front-end to the data reservoir that typically now runs alongside relationally-stored data warehouses.


Looking back at the slides from one of the recent Strata conferences, for example, sees Oracle positioning BDP as the “operational data preparation” tool for structured and unstructured data – with no defined schema – coming into your information platform, with the enriched output then being used BI tools, enterprise reporting and data discovery tools.



Apart from the scalability benefits of running BDP on Apache Spark, the other interesting feature in BDP is how it uses Spark’s machine learning capabilities to try to automate as much of the data preparation process as possible, for example detecting credit card numbers in data fields and recommending you obfuscate that column. Similar to BICS and how Oracle have tried to simplify the process of creating reports and dashboards for a small team, BDP runs in the cloud tries to automate and simplify as much of the data preparation and enrichment process as possible, with ODI12c still available for ETL developers to develop more complex transformations.

The development lifecycle for BDP (from the Oracle Big Data Preparation Cloud Service e-book on Oracle’s website) uses a cycle of ingesting, cleaning, enriching and then publishing data using scripts authored using the tool and run on the Apache Spark platform. The diagram below shows the BDP development lifecycle from Oracle’s Big Data Preparation Cloud Service Handbook, and shows how ingestion, enrichment, publishing and governance go in a cycle with the common foundation of the transformation scripts that you build using BDP’s web interface.


So let’s walk through an example data preparation exercise using a file of data stored initially in Oracle Storage Cloud Service. After logging into BDP via Oracle Cloud you’re first presented with the Catalog view, listing out all your previous transformations and showing you when they were last used to process some data.


To create a transformation you first give it a name, then select the data source and then the file you’re interested in. In my environment I’ve got Oracle Storage Cloud and HDFS available as my main data sources, or I could upload a file from my desktop and start from there.


BDP then ingests the file and then uses its machine learning features to process and classify data in each column, recommending column names such as “gender”, “city” and cc_number based on (presumably) some sort of classification model. In the screenshot below you can see a set of these recommendations on the left-hand side of the screen, with the columns themselves listed centre and a summary of the file profiling on the right.


Taking a closer look at the profile results panel you can see two of the columns have alerts raised, in red. Clicking on the alert shows that the two columns have credit card data stored in clear text, with the recommendation being to obfuscate or otherwise secure these fields. Clicking on a field then shows the various transformation options, with the obvious choice here being to automatically obfuscate the data in those fields.


Once you’ve worked through all the recommendations and added any transformations you choose to add yourself, the final step is to publish your transformation to one of the available targets. In the example below we’ve got Oracle Storage Cloud and HDFS again as potential targets; I’d imagine Oracle will add a connector to BICS soon, for example, so that you can use BDP as a data prep tool for file data that will then be added to your dataset in BICS.


So … it’ll be interesting to see where this one goes. Its interesting that Oracle have split out data preparation and data discovery into two tools whilst others are saying theirs can do both, and you’ll still need ODI for the more complex integration jobs. But I like the innovative use of machine learning to do away with much of the manual work required for classification of incoming data fields, and running the whole thing on Spark certainly gives it the potential of scale. A couple of years ago I was worried Oracle didn’t really have a strategy for data integration and ETL in the cloud, but we’re starting to see something happen now.

There’s a big push from the Oracle field at the moment to move customers into the cloud, and I can see BDP getting bundled in with Big Data Cloud Service and BICS as the accompanying cloud data preparation tool. The danger then of course is that Big Data Discovery starts to look less useful, especially with Visual Analyzer already available within BICS and coming soon on-premise with OBIEE12c. My guess is that what we’re seeing now with these initial releases of BDP and BDD is just the start, with BDP adding more automatic enrichment “smarts” and starting to cover integration use-cases too, whilst BDD will put more focus on data visualization and analytics on the data reservoir.

Categories: BI & Warehousing

Using the BI Server Metadata Web Service for Automated RPD Modifications

Fri, 2015-09-25 06:33

A little-known new feature of OBIEE 11g is a web service interface to the BI Server. Called the “BI Server Metadata Web Service” it gives a route into making calls into the BI Server using SOAP-based web services calls. Why is this useful? Because it means you can make any call to the BI Server (such as SAPurgeAllCache) from any machine without needing to install any OBIEE-related artefacts such as nqcmd, JDBC drivers, etc. The IT world has been evolving over the past decade or more towards a more service-based architecture (remember the fuss about SOA?) where we can piece together the functionality we need rather than having one monolithic black box trying to do everything. Being able to make use of this approach in our BI deployments is a really good thing. We can do simple things like BI Server cache management using a web service call, but we can also do more funky things, such as actually updating repository variable values in real time – and we can do it from within our ETL jobs, as part of an automated deployment script, and so on.

Calling the BI Server Metadata Web Service

First off, let’s get the web service configured and working. The documentation for the BI Server Metadata Web Service can be found here, and it’s important to read it if you’re planning on using this. What I describe here is the basic way to get it up and running.

Configuring Security

We need to configure the security against the web service to define what kind of authentication is required by it to use. If you don’t do this, you won’t be able to make calls to it. Setting up the security is a case of attaching a security policy in WebLogic Server to the web service (called AdminService) itself. I’ve used oracle/wss_http_token_service_policy which means that the credentials can be passed through using standard HTTP Basic authentication.

You can do this through Enterprise Manager:

Or you can do it through WLST using the attachWebServicePolicy call.

You also need to configure WSM, adding some entries to the credential store as detailed here.

Testing the Web Service

The great thing about web services is that they can be used from pretty much anywhere. Many software languages will have libraries for making SOAP calls, and if they don’t, it’s just HTTP under the covers so you can brew your own. For my testing I’m using the free version of SoapUI. In anger, I’d use something like curl for a standalone script or hook it up to ODI for integration in the batch.

Let’s fire up SoapUI and create a new project. Web services provide a Web Service Definition Language (WSDL) that describes what and how they work, which is pretty handy. We can pass this WSDL to SoapUI for it to automatically build some sample requests for us. The WSDL for this web service is


Where biserver is your biserver (duh) and port is the managed server port (usually 9704, or 7780).

We’ve now got a short but sweet list of the methods we can invoke:

Expand out callProcedureWithResults and double click on Request 1. This is a template SOAP message that SoapUI has created for you, based on the WSDL.

Edit the XML to remove the parameters section, and just to test things out in procedureName put GetOBISVersion(). Your XML request should look like this:

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  

If you try and run this now (green arrow or Cmd-Enter on the Mac) you’ll see the exact same XML appear on the right pane, which is strange… but click on Raw and you’ll see what the problem is :

Remember the security stuff we set up on the server previously? Well as the client we now need to keep our side of the bargain, and pass across our authentication with the SOAP call. Under the covers this is a case of sending HTTP Basic auth (since we’re using oracle/wss_http_token_service_policy), and how you do this depends on how you’re making the call to the web service. In SoapUI you click on the Auth button at the bottom of the screen, Add New Authentication, Type: Basic, and then put your OBIEE server username/password in.

Now you can click submit on the request again, and you should see in the response pane (on the right) in the XML view details of your BI Server version (you may have to scroll to the right to see it)

This simple test is just about validating the end-to-end calling of a BI Server procedure from a web service. Now we can get funky with it….

Updating Repository Variables Programatically

Repository variables in OBIEE are “global”, in that every user session sees the same value. They’re often used for holding things like “what was the last close of business date”, or “when was the data last updated in the data warehouse”. Traditionally these have been defined as dynamic repository variables with an initialisation block that polled a database query on a predefined schedule to get the current value. This meant that to have a variable showing when your data was last loaded you’d need to (a) get your ETL to update a row in a database table with a timestamp and then (b) write an init block to poll that table to get the value. That polling of the table would have to be as frequent as you needed in order to show the correct value, so maybe every minute. It’s kinda messy, but it’s all we had. Here I’d like to show an alternative approach.

Let’s say we have a repository variable, LAST_DW_REFRESH. It’s a timestamp that we use in report predicates and titles so that when they are run we can show the correct data based on when the data was last loaded into the warehouse. Here’s a rather over-simplified example:

The Title view uses the code:

Data correct as of: @{biServer.variables['LAST_DW_REFRESH']}

Note that we’re referencing the variable here. We could also put it in the Filter clause of the analysis. Somewhat tenuously, let’s imagine we have a near-realtime system that we’re federating a DW across direct from OLTP, and we just want to show data from the last load into the DW:

For the purpose of this example, the report is less important than the diagnostics it gives us when run, in nqquery.log. First we see the inbound logical request:

-------------------- SQL Request, logical request hash:  
   0 s_0,  
   "A - Sample Sales"."Time"."T05 Per Name Year" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  
("Time"."T00 Calendar Date" <  VALUEOF("LAST_DW_REFRESH"))  

Note the VALUEOF clause. When this is parsed out we get to see the actual value of the repository variable that OBIEE is going to execute the query with:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-24 23:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

We can see the value through the Administration Tool Manage Sessions page too, but it’s less convenient for tracking in testing:

If we update the RPD online with the Administration Tool (nothing fancy at this stage) to change the value of this static repository variable :


And then rerun the report, we can see the value has changed:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-25 00:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

Now let’s do this programatically. First off, the easy stuff, that’s been written about plenty before. Using biserverxmlgen we can create a XUDML version of the repository. Searching through this we can pull out the variable definition:

<Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
<Expr><![CDATA[TIMESTAMP '2015-09-25 00:30:00']]></Expr>  

and then wrap it in the correct XML structure and update the timestamp we want to use:

<?xml version="1.0" encoding="UTF-8" ?>  
<Repository xmlns:xsi="">  
        <Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
        <Expr><![CDATA[TIMESTAMP '2015-09-26 00:30:00']]></Expr>  

(you can also get this automagically by modifying the variable in the RPD, saving the RPD file, and then comparing it to the previous copy to generate a patch file with the Administration Tool or comparerpd)

Save this XML snippet, with the updated timestamp value, as last_dw_refresh.xml. Now to update the value on the BI Server in-flight, first using the OBIEE tool biserverxmlcli. This is available on all OBIEE servers and client installations – we’ll get to web services for making this update call remotely in a moment.

biserverxmlcli -D AnalyticsWeb -R Admin123 -U weblogic -P Admin123 -I last_dw_refresh.xml

Here -D is the BI Server DSN, -U / -P are the username/password credentials for the server, and the -R is the RPD password.

Running the analysis again shows that it is now working with the new value of the variable:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
    D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-26 00:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

Getting Funky – Updating RPD from Web Service

Let’s now bring these two things together – RPD updates (in this case to update a variable value, but could be anything), and BI Server calls.

In the above web service example I called the very simple GetOBISVersion. Now we’re going to use the slightly more complex NQSModifyMetadata. This is actually documented, and what we’re going to do is pass across the same XUDML that we sent to biserverxmlcli above, but through the web service. As a side note, you could also do this over JDBC if you wanted (under the covers, AdminService is just a web app with a JDBC connector to the BI Server).

I’m going to do this in SoapUI here for clarity but I actually used SUDS to prototype it and figure out the exact usage.

So as a quick recap, this is what we’re going to do:

  1. Update the value of a repository variable, using XUDML. We generated this XUDML from biserverxmlgen and wrapped it in the correct XML structure.
    We could also have got it through comparerpd or the Administration Tool ‘create patch’ function
  2. Use the BI Server’s NQSModifyMetadata call to push the XUDML to the BI Server online.
    We saw that biserverxmlcli can also be used as an alternative to this for making online updates through XUDML.
  3. Use the BI Server Metadata Web Service (AdminService) to invoke the NQSModifyMetadata call on the BI Server, using the callProcedureWithResults method

In SoapUI create a new request:

Set up the authentication:

Edit the XML SOAP message to remove parameters and specify the basic BI Server call:

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  

Now the tricky bit – we need to cram the XUDML into our XML SOAP message. But, XUDML is its own kind of XML, and all sorts of grim things happen here if we’re not careful (because the XUDML gets swallowed up into the SOAP message, it all being XML). The solution I came up with (which may not be optimal…) is to encode all of the HTML entities, which a tool like this does (and if you’re using a client library like SUDS will happen automagically). So our XUDML, with another new timestamp for testing:

<?xml version="1.0" encoding="UTF-8" ?>  
<Repository xmlns:xsi="">  
        <Variable name="LAST_DW_REFRESH" id="3031:286125" uid="00000000-1604-1581-9cdc-7f0000010000">  
        <Expr><![CDATA[TIMESTAMP '2015-09-21 00:30:00']]></Expr>  


&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP '2015-09-21 00:30:00']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;

We’re not quite finished yet. Because this is actually a call (NQSModifyMetadata) nested in a call (callProcedureWithResults) we need to make sure NQSModifyMetadata gets the arguments (the XUDML chunk) through intact, so we wrap it in single quotes – which also need encoding (&apos;):

&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP '2015-09-21 23:30:00']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;

and then for final good measure, the single quotes around the timestamp need double-single quoting:

&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP ''2015-09-21 23:30:00'']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;

Nice, huh? The WSDL suggests that parameters for these calls should be able to be placed within the XML message as additional entities, which I wonder if would allow for proper encoding, but I couldn’t get it to work (I kept getting java.sql.SQLException: Parameter 1 is not bound).

So, stick this mess of encoding plus twiddles into your SOAP message and it should look like this: (watch out for line breaks; these can break things)

<soapenv:Envelope xmlns:soapenv="" xmlns:ws="">  
         <procedureName>NQSModifyMetadata(&apos;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt; &lt;Repository xmlns:xsi=&quot;;&gt; &lt;DECLARE&gt; &lt;Variable name=&quot;LAST_DW_REFRESH&quot; id=&quot;3031:286125&quot; uid=&quot;00000000-1604-1581-9cdc-7f0000010000&quot;&gt; &lt;Expr&gt;&lt;![CDATA[TIMESTAMP ''2015-09-21 23:30:00'']]&gt;&lt;/Expr&gt; &lt;/Variable&gt; &lt;/DECLARE&gt; &lt;/Repository&gt;&apos;)</procedureName>  

Hit run, and with a bit of luck you’ll get a “nothing to report” response :

If you look in nqquery.log you’ll see:

[...] NQSModifyMetadata started.  
[...] NQSModifyMetadata finished successfully.

and all-importantly when you run your report, the updated variable will be used:

-------------------- Logical Request (before navigation): [[
    0 as c1 GB,  
    D0 Time.T05 Per Name Year as c2 GB,  
    1- Revenue:[DAggr(F0 Sales Base Measures.1- Revenue by [ D0 Time.T05 Per Name Year] )] as c3 GB  
D0 Time.T00 Calendar Date < TIMESTAMP '2015-09-21 23:30:00.000'  
OrderBy: c1 asc, c2 asc NULLS LAST

If this doesn’t work … well, best of luck. Use nqquery.log, bi_server1.log (where the AdminService writes some diagnostics) to try and trace the issue. Also test calling NQSModifyMetadata from JDBC (or ODBC) directly, and then add in the additional layer of the web service call.

So, in the words of Mr Rittman, there you have it. Programatically updating the value of repository variables, or anything else, done online, and for bonus points done through a web service call making it possible to use without any local OBIEE client/server tools.

Categories: BI & Warehousing

Managing the OBIEE BI Server Cache from ODI 12c

Thu, 2015-09-24 13:47

I wrote recently about the OBIEE BI Server Cache and how useful it can be, but how important it is to manage it properly, both in the purging of stale data and seeding of new. In this article I want to show how to walk-the-walk and not just talk-the-talk (WAT? But you’re a consultant?!). ODI is the premier data integration tool on the market and one that we are great fans of here at Rittman Mead. We see a great many analytics implementations built with ODI for the data load (ELT, strictly speaking, rather than ETL) and then OBIEE for the analytics on top. Managing the BI Server cache from within your ODI batch makes a huge amount of sense. By purging and reseeding the cache directly after the data has been loaded into the database we can achieve optimal cache usage with no risk of stale data.

There are two options for cleanly hooking into OBIEE from ODI 12c with minimal fuss: JDBC, and Web Services. JDBC requires the OBIEE JDBC driver to be present on the ODI Agent machine, whilst Web Services have zero requirement on the ODI side, but a bit of config on the OBIEE side.

Setting up the BI Server JDBC Driver and Topology

Here I’m going to demonstrate using JDBC to connect to OBIEE from ODI. It’s a principle that was originally written up by Julien Testut here. We take the OBIEE JDBC driver bijdbc.jar from $FMW_HOME/Oracle_BI1/bifoundation/jdbc and copy it to our ODI machine. I’m just using a local agent for my testing, so put it in ~/.odi/oracledi/userlib/. For a standalone agent it should go in $AGENT_HOME/odi/agent/lib.

[oracle@ip-10-103-196-207 ~]$ cd /home/oracle/.odi/oracledi/userlib/  
[oracle@ip-10-103-196-207 userlib]$ ls -l  
total 200  
-rw-r----- 1 oracle oinstall    332 Feb 17  2014 additional_path.txt  
-rwxr-xr-x 1 oracle oinstall 199941 Sep 22 14:50 bijdbc.jar

Now fire up ODI Studio, sign in to your repository, and head to the Topology pane. Under Physical Architecture -> Technologies and you’ll see Oracle BI

Right click and select New Data Server. Give it a sensible name and put your standard OBIEE credentials (eg. weblogic) under the Connection section. Click the JDBC tab and click the search icon to the right of the JDBC Driver text box. Select the default,, and then in the JDBC Url box put your server and port (9703, unless you’ve changed the listen port of OBIEE BI Server)

Now click Test Connection (save the data server when prompted, and click OK at the message about creating a physical schema), and select the Local Agent with which to run it. If you get an error then click Details to find out the problem.

One common problem can be the connection through to the OBIEE server port, so to cut ODI out of the equation try this from the command prompt on your ODI machine (assuming it’s *nix):

nc -vz 9703

If the host resolves correctly and the port is open then you should get:

Connection to 9703 port [tcp/*] succeeded!

If not you’ll get something like:

nc: port 9703 (tcp) failed: Connection refused

Check the usual suspects – firewall (eg iptables) on the OBIEE server, firewalls on the network between the ODI and OBIEE servers, etc.

Assuming you’ve got a working connection you now need to create a Physical Schema. Right click on the new data server and select New Physical Schema.

OBIEE’s BI Server acts as a “database” to clients, within which there are “schemas” (Subject Areas) and “tables” (Presentation Tables). On the New Physical Schema dialog you just need to set Catalog (Catalog), and when you click the drop-down you should see a list of the Subject Areas within your RPD. Pick one – it doesn’t matter which.

Save the physical schema (ignore the context message). At this point your Physical Architecture for Oracle BI should look like this:

Now under Logical Architecture locate the Oracle BI technology, right click on it and select New Logical Schema. From the Physical Schemas dropdown select the one that you’ve just created. Give a name to the Logical Schema.

Your Logical Architecture for Oracle BI should look like this:

Building the Cache Management Routine Full Cache Purge

Over in the Designer tab go to your ODI project into which you want to integrate the OBIEE cache management functions. Right click on Procedures and select Create New Procedure. Give it a name such as OBIEE Cache – Purge All and set the Target Technology to Oracle BI

Switch to the Tasks tab and add a new Task. Give it a name, and set the Schema to the logical schema that you defined above. Under Target Command enter the call you want to make to the BI Server, which in this case is

call SAPurgeAllCache();

Save the procedure and then from the toolbar menu click on Run. Over in the Operator tab you should see the session appear and soon after complete – all being well – successfully.

You can go and check your BI Server Cache from the OBIEE Administration Tool to confirm that it is now empty:

And confirm it through Usage Tracking:

From what I can see at the default log levels, nothing gets written to either nqquery.log or nqserver.log for this action unless there is an error in your syntax in which case it is logged in nqserver.log:

(For more information on that particular error see here)

Partial Cache Purge

This is the same pattern as above – create an ODI Procedure to call the relevant OBIEE command, which for purging by table is SAPurgeCacheByTable. We’re going to get a step more fancy now, and add a variable that we can pass in so that the Procedure is reusable multiple times over throughout the ODI execution for different tables.

First off create a new ODI Variable that will hold the name of the table to purge. If you’re working with multiple RPD Physical Database/Catalog/Schema objects you’ll want variables for those too:

Now create a Procedure as before, with the same settings as above but a different Target Command, based on SAPurgeCacheByTable and passing in the four parameters as single quoted, comma separated values. Note that these are the Database/Catalog/Schema/Table as defined in the RPD. So “Database” is not your TNS or anything like that, it’s whatever it’s called in the RPD Physical layer. Same for the other three identifiers. If there’s no Catalog (and often there isn’t) just leave it blank.

When including ODI Variable(s) make sure you still single-quote them. The command should look something like this:

Now let’s seed the OBIEE cache with a couple of queries, one of which uses the physical table and one of which doesn’t. When we run our ODI Procedure we should see one cache entry go and the other remain. Here’s the seeded cache:

And now after executing the procedure:

And confirmation through Usage Tracking of the command run:

Cache Seeding

As before, we use an ODI Procedure to call the relevant OBIEE command. To seed the cache we can use SASeedQuery which strictly speaking isn’t documented but a quick perusal of the nqquery.log when you run a cache-seeding OBIEE Agent shows that it is what is called in the background, so we’re going to use it here (and it’s mentioned in support documents on My Oracle Support, so it’s not a state secret). The documentation here gives some useful advice on what you should be seeding the cache with — not necessarily only exact copies of the dashboard queries that you want to get a cache hit for.

Since this is a cookie-cutter of what we just did previously you can use the Duplicate Selection option in ODI Designer to clone one of the other OBIEE Cache procedures that you’ve already created. Amend the Target Command to:

When you run this you should see a positive confirmation in the nqserver.log of the cache seed:

[2015-09-23T23:23:10.000+01:00] [OracleBIServerComponent] [TRACE:3]  
[USER-42] [] [ecid: 005874imI9nFw000jzwkno0007q700008K,0] [tid: 9057d700]  
[requestid: 477a0002] [sessionid: 477a0000] [username: weblogic]  
Query Result Cache: [59124]  
The query for user 'weblogic' was inserted into the query result cache.  
The filename is '/app/oracle/biee/instances/instance1/bifoundation/OracleBIServerComponent/coreapplication_obis1/cache/NQS__735866_84190_2.TBL'. [[

A very valid alternative to calling SASeedQuery would be to call the OBIEE SOA Web Service to trigger an OBIEE Agent that populated the cache (by setting ‘Destination’ to ‘Oracle BI Server Cache (For seeding cache)’). OBIEE Agents can also be ‘daisy chained’ so that one Agent calls another on completion, meaning that ODI could kick off a single ‘master’ OBIEE Agent which then triggered multiple ‘secondary’ OBIEE Agents. The advantage of this approach over SASeedQuery is that cache seeding is more likely to change as OBIEE usage patterns do, and it is easier for OBIEE developers to maintain all the cache seeding code within ‘their’ area (OBIEE Presentation Catalog) than put in a change request to the ODI developers each time to change a procedure.

Integrating it in the ODI batch

You’ve two options here, using Packages or Load Plans. Load Plans were introduced in ODI and are a clearer and more flexible of orchestrating the batch.

To use it in a load plan create a serial step that will call a mapping followed by the procedure to purge the affected table. In the procedure step in the load plan set the value for the variable. At the end of the load plan, call the OBIEE cache seed step:

Alternatively, to integrate the above procedures into a Package instead of a load plan you need to add two steps per mapping. First, the variable is updated to hold the name of the table just loaded, and then the OBIEE cache is purged for the affected table. At the end of the flow a call is made to reseed the cache:

These are some very simple examples, but hopefully illustrate the concept and the powerful nature of integrating OBIEE calls directly from ODI. For more information about OBIEE Cache Management, see my post here.

Categories: BI & Warehousing

OBIEE BI Server Cache Management Strategies

Wed, 2015-09-23 11:36

The OBIEE BI Server cache can be one of the most effective ways of improving response times of OBIEE dashboards. By using data already in the cache it reduces load on the database, the network, and the BI Server.

Should you be using it? I always describe it as the “icing on the cake” – it’s not a fix for a badly-designed OBIEE system, but it does make a lot of sense to use once you’re happy that the foundations for the system are in place. If the foundations are not not in place? Then you’re just papering over the cracks and at some point it’s probably going to come back to bite you. As Mark Rittman put it nearly seven years ago, it’s “[…]usually the last desperate throw of the dice”. The phrase “technical debt”? Yeh, that. But, BI Server caching used after performance review and optimisation rather than instead of – then it’s a Good Thing.

So you’ve decided to use the BI Server cache, and merrily trotted over to Enterprise Manager to enable it, restarted the BI Server, and now your work is done, right? Not quite. Because the BI Server cache will start to store data from all the queries that you run, and use it to satisfy subsequent queries. Not only will it match on a direct hit for the same query, it will use a subset of an existing cache entry where appropriate, and can even aggregate up from what’s in the cache to satisfy a query at a higher level. Clever stuff. But, what happens when you load new data into your data warehouse? Well, the BI Server continues to serve requests out of the cache, because why shouldn’t it? And herein lies the problem with “just turn caching on”. You have to have a cache management strategy.

A cache management strategy sounds grand doesn’t it? But it boils down to two things:

  1. Accuracy – Flush any data from the cache that is now stale
  2. Speed – Prime the cache so that as many queries get a hit on it, first time
Maintaining an Accurate Cache

Every query that is run through the BI Server, whether from a Dashboard, Answers, or more funky routes such as custom ODBC clients or JDBC, will end up in cache. It’s possible to “seed” (“prime”/“warmup”) the cache explicitly, and this is discussed later. The only time you won’t see data in the cache is if (a) you have BI Server caching disabled, or (b) you’ve disabled the Cacheable option for a physical table that is involved in providing the data for the query being run.

You can see metadata for the current contents of the cache in the Administration Tool when connected online to the BI Server, through the Manage -> Cache menu option. This gives you lots of useful information (particularly when you come to optimising cache usage) including the size of each entry, when it was created, when it was last used, and so on.

Purging Options

So we’ve a spread of queries run that hit various dimension and fact tables and created lots of cache entries. Now we’ve loaded data into our underlying database, so we need to make sure that the next time a user runs an OBIEE query that uses the new data they can see it. Otherwise we commit the cardinal sin of any analytical system and show the user incorrect data which is a Bad Thing. It may be fast, but it’s WRONG….

We can purge the whole cache, but that’s a pretty brutal approach. The cache is persisted to disk and can hold lots of data stretching back months – to blitz all of that just because one table has some new data is overkill. A more targeted approach is to purge by physical database, physical table, or even logical query. When would you use these?

  • Purge entire cache – the nuclear option, but also the simplest. If your data model is small and a large proportion of the underlying physical tables may have changed data, then go for this
  • Purge by Physical Database – less brutal that clearing the whole cache, if you have various data sources that are loaded at different points in the batch schedule then targeting a particular physical database makes sense.
  • Purge by Physical Table – if many tables within your database have remained unchanged, whilst a large proportion of particular tables have changed (or it’s a small table) then this is a sensible option to run for each affected table
  • Purge by Query – If you add a few thousand rows to a billion row fact table, purging all references to that table from the cache would be a waste. Imagine you have a table with sales by day. You load new sales figures daily, so purging the cache by query for recent data is obviously necessary, but data from previous weeks and months may well remain untouched so it makes sense to leave queries against those in the cache. The specifics of this choice are down to you and your ETL process and business rules inherent in the data (maybe there shouldn’t be old data loaded, but what happens if there is? See above re. serving wrong data to users). This option is the most complex to maintain because you risk leaving behind in the cache data that may be stale but doesn’t match the precise set of queries that you purge against.

Which one is correct depends on

  1. your data load and how many tables you’ve changed
  2. your level of reliance on the cache (can you afford low cache hit ratio until it warms up again?)
  3. time to reseed new content

If you are heavily dependant on the cache and have large amounts of data in it, you are probably going to need to invest time in a precise and potentially complex cache purge strategy. Conversely if you use caching as the ‘icing on the cake’ and/or it’s quick to seed new content then the simplest option is to purge the entire cache. Simple is good; OBIEE has enough moving parts without adding to its complexity unnecessarily.

Note that OBIEE itself will perform cache purges in some situations including if a dynamic repository variable used by a Business Model (e.g. in a Logical Column) gets a new value through a scheduled initialisation block.

Performing the Purge

There are several ways in which we can purge the cache. First I’ll discuss the ones that I would not recommend except for manual testing:

  1. Administration Tool -> Manage -> Cache -> Purge. Doing this every time your ETL runs is not a sensible idea unless you enjoy watching paint dry (or need to manually purge it as part of a deployment of a new RPD etc).
  2. In the Physical table, setting Cache persistence time. Why not? Because this time period starts from when the data was loaded into the cache, not when the data was loaded into your database.
    An easy mistake to make would be to think that with a daily ETL run, setting the Cache persistence time to 1 day might be a good idea. It’s not, because if your ETL runs at 06:00 and someone runs a report at 05:00, there is a going to be a stale cache entry present for another 23 hours. Even if you use cache seeding, you’re still relinquishing control of the data accuracy in your cache. What happens if the ETL batch overruns or underruns?
    The only scenario in which I would use this option is if I was querying directly against a transactional system and wanted to minimise the number of hits OBIEE made against it – the trade-off being users would deliberately be seeing stale data (but sometimes this is an acceptable compromise, so long as it’s made clear in the presentation of the data).

So the two viable options for cache purging are:

  1. BI Server Cache Purge Procedures
  2. Event Polling Table
BI Server Cache Purge Procedures

These are often called “ODBC” Procedures but technically ODBC is just one – of several – ways that the commands can be sent to the BI Server to invoke.

As well as supporting queries for data from clients (such as Presentation Services) sent as Logical SQL, the BI Server also has its own set of procedures. Many of these are internal and mostly undocumented (Christian Berg does a great job of explaining them here, and they do creep into the documentation here and here), but there are some cache management ones that are fully supported and documented. They are:

  • SAPurgeCacheByQuery
  • SAPurgeCacheByTable
  • SAPurgeCacheByDatabase
  • SAPurgeAllCache
  • SAPurgeCacheBySubjectArea (>=
  • SAPurgeCacheEntryByIDVector (>=

The names of these match up to the purge processes that I describe above. The syntax is in the documentation, but what I am interested in here is how you can invoke them. They are my preferred method for managing the BI Server cache because they enable you to tightly couple your data load (ETL) to your cache purge. Setting the cache to purge based on a drop-dead timer (whether crontab, tivoli, Agent/iBot, whatever) gives you a huge margin of error if your ETL runtime does not remain consistent. Whether it organically increases in runtime as data volumes increase, or it fails and has to be fixed and restarted, ETL does not always finish bang-on when it is ‘supposed’ to.

You can call these procedures in the several ways, including:

  1. nqcmd – one of the most common ways, repeated on many a blog, but requires nqcmd/OBIEE to be installed on the machine running it. nqcmd is a command-line ODBC client for connecting to the BI Server
  2. ODBC – requires BI to be installed on the machine running it in order to make the OBIEE ODBC driver available
  3. JDBC – just requires the OBIEE JDBC driver, which is a single .jar file and thus portable
  4. Web Service – the OBIEE BI Server Web Service can be used to invoke these procedures from any machine with no dependencies other than some WSM configuration on the OBIEE server side.

My preference is for JDBC or Web Service, because they can be called from anywhere. In larger organisations the team building the ETL may have very little to do with OBIEE, and so asking them to install OBIEE components on their server in order to trigger cache purging can be quite an ask. Using JDBC only a single .jar needs copying onto the server, and using the web service not even that:

curl --silent --header "Content-Type: text/xml;charset=UTF-8" 
--user weblogic:Admin123 
--data @purge_cache_soap.xml

[59118] Operation SAPurgeAllCache succeeded!

For details of configuring ODI to use the BI Server JDBC driver in order to tightly couple the cache management into an existing ODI load job, stay tuned for a future blog!

Event Polling Tables (EPT)

NB Not Event “Pooling” Tables as I’ve often seen this called

The second viable approach to automated cache purging is EPT, which is a decoupled approach to managing the cache purge, with two components:

  1. An application (your ETL) inserts a row into the table S_NQ_EPT (which is created at installation time by the RCU in the BIPLATFORM schema) with the name of the physical table in which data has been changed
  2. The BI Server polls (hence the name) the S_NQ_EPT table periodically, and if it finds entries in it, purges the cache of data that is from those tables.

So EPT is in a sense the equivalent of using SAPurgeCacheByTable, but in a manner that is not tightly coupled. It relies on configuring the BI Server for EPT, and there is no easy way to know from your ETL if the cache purge has actually happened. It also means that the cache remains stale potentially as long as the polling interval that you’ve configured. Depending on when you’re running your ETL and the usage patterns of your users this may not be an issue, but if you are running ETL whilst users are on the system (for example intra-day micro ETL batches) you could end up with users seeing stale data. Oracle themselves recommend not setting the polling interval any lower than 10 minutes.
EPT has the benefit of being very easy to implement on the ETL side, because it is simply a database table into which the ETL developers need to insert a row for each table that they update during the ETL.

Seeding the Cache

Bob runs an OBIEE dashboard, and the results are added to the cache so that when Bill runs the same dashboard Bill gets a great response rate because his dashboard runs straight from cache. Kinda sucks for Bob though, because his query ran slow as it wasn’t in the cache yet. What’d be nice would be that for the first user on a dashboard the results were already in cache. This is known as seeding the cache, or ‘priming’ it. Because the BI Server cache is not dumb and will hit the cache for queries that aren’t necessarily direct replicas of what previously ran working out the optimal way to seed the cache can take some trial and error careful research. The documentation does a good job of explaining what will and won’t qualify for a cache hit, and it’s worth reading this first.

There are several options for seeding the cache. These all assume you’ve figured out the queries that you want to run in order to load the results into cache.

  1. Run the analysis manually, which will return the analysis data to you and insert it into the BI Server Cache too.
  2. Create an Agent to run the analysis with destination set to Oracle BI Server Cache (For seeding cache), and then either:
    1. Schedule the analysis to run from an Agent on a schedule
    2. Trigger it from a Web Service in order to couple it to your ETL data load / cache purge batch steps.
  3. Use the BI Server Procedure SASeedQuery (which is what the Agent does in the background) to load the given query into cache without returning the data to the client. This is useful for doing over JDBC/ODBC/Web Service (as discussed for purging above). You could just run the Logical SQL itself, but you probably don’t want to pull the actual data back to the client, hence using the procedure call instead.
Sidenote – Checking the RPD for Cacheable Tables

The RPD Query Tool is great for finding objects matching certain criteria. However, it seems to invert results when looking for Cacheable Physical tables – if you add a filter of Cacheable = false you get physical tables where Cacheable is enabled! And the same in reverse (Cacheable = true -> shows Physical tables where Cacheable is disabled)

Day in the Life of an OBIEE Cache Entry (Who Said BI Was Boring?)

In this example here I’m running a very simple report from SampleApp v406:

The Logical SQL for this is:

   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Why’s that useful to know? Because when working with the cache resubmitting queries is needed frequently and doing so directly from an interface like nqcmd is much faster (for me) than a web GUI. Horses for courses…

So I’ve run the query and now we have a cache entry for it. How do we know? Because we see it in the nqquery.log (and if you don’t have it enabled, go and enable it now):

[2015-09-23T15:58:18.000+01:00] [OracleBIServerComponent] [TRACE:3] [USER-42]  
[] [ecid: 00586hFR07mFw000jzwkno0005Qx00007U,0] [tid: 84a35700]  
[requestid: a9730015] [sessionid: a9730000] [username: weblogic]  
Query Result Cache: [59124] The query for user 'weblogic' was inserted into 
the query result cache.  
The filename is '/app/oracle/biee/instances/instance1/bifoundation/OracleBIServerComponent/coreapplication_obis1/cache/NQS__735866_57498_0.TBL'.

We see it in Usage Tracking (again, if you don’t have this enabled, go and enable it now):


We can also see it in the Administration Tool (when connected online to the BI Server):

We can even see it and touch it (figuratively) on disk:

So we have the data in the cache. The same query run again will now use the cache entry, as seen in nqquery.log:

[2015-09-23T16:09:24.000+01:00] [OracleBIServerComponent] [TRACE:3] [USER-21]
[] [ecid: 11d1def534ea1be0:6066a19d:14f636f1dea:-8000-000000000000b948,0:1:1:5]  
[tid: 87455700] 
[requestid: a9730017] [sessionid: a9730000] [username: weblogic]  
Cache Hit on query: [[  
Matching Query: SET VARIABLE QUERY_SRC_CD='Report',SAW_SRC_PATH='/users/weblogic/Cache Test 01',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Created by:     weblogic

and in Usage Tracking:

“Interestingly” Usage Tracking shows a count of 1 for number of DB queries run, which we would not expect for a cache hit. The nqquery.log shows the same, but no query logged as being sent to the database, so I’m minded to dismiss this as an instrumentation bug.

Now what about if we want to run a query but not use the BI Server Cache? This is an easy one, plenty blogged about it elsewhere – use the Request Variable DISABLE_CACHE_HIT=1. This overrides the built in system session variable of the same name. Here I’m running it directly against the BI Server, prefixed onto my Logical SQL – if you want to run it from within OBIEE you need the Advanced tab in the Answers editor.

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

Now we get a cache ‘miss’, because we’ve specifically told the BI Server to not use the cache. As you’d expect, Usage Tracking shows no cache hit, but it does show a cache insert – because why shouldn’t it?

If you want to run a query without seeding the cache either, you can use DISABLE_CACHE_SEED=1:

SET VARIABLE SAW_SRC_PATH='/users/weblogic/Cache Test 01',
   0 s_0,  
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,  
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2  
FROM "A - Sample Sales"  

These request variables can be set per analysis, or per user by creating a session initialisation block to assign the required values to the respective variables.

Cache Location

The BI Server cache is held on disk, so it goes without saying that storing it on fast (eg SSD) disk is a Good Idea. There’s no harm in giving it its own filesystem on *nix to isolate it from other work (in terms of filesystems filling up) and to make monitoring it super easy.

Use the DATA_STORAGE_PATHS configuration element in NQSConfig.ini to change the location of the BI Server cache.

  1. Use BI Server Caching as the ‘icing on the cake’ for performance in your OBIEE system. Make sure you have your house in order first – don’t use it to try to get around bad design.
  2. Use the SAPurgeCache procedures to directly invoke a purge, or the Event Polling Tables for a more loosely-coupled approach. Decide carefully which purge approach is best for your particular caching strategy.
  3. If using the SAPurgeCache procedures, use JDBC or Web Services to call them so that there is minimal/no installation required to call them from your ETL server.
  4. Invest time in working out an optimal cache seeding strategy, making use of Usage Tracking to track cache hit ratios.
  5. Integrate both purge and seeding into your ETL. Don’t use a schedule-based approach because it will come back to haunt you with its inflexibility and scope for error.
Categories: BI & Warehousing

Tracing Data Lineage and Impact Analysis in Oracle Enterprise Metadata Management 12c

Fri, 2015-09-18 22:11

At the beginning of the year, I wrote a couple of posts about Oracle Enterprise Metadata Management, one about how to install the tool and another one explaining the harvesting (importing) and stitching (relating) processes. A lot of things have happened since then, including a new version of OEMM released on May with some nice features as HTML 5 support, a workflow for business glossary and including the concept of groups in security.

OEMM is the Oracle solution for data governance and lifecycle management.  “Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise”. So, the process of knowing how my data is related, the integrity of it, who has access to which part of the data and with which application, from where people can access the data, defining the stewards of the data, etc are part of data governance.

To refresh the main concepts, the first thing that you have to do when you start working with OEMM is the harvesting or importing process. We import the metadata creating one model for each different metadata source. After the successful importing process, we need to create a configuration to relate all the required models.

One of the main features of OEMM is to perform Data Flow Lineage and Impact Analysis. Why is this so important? Because the typical problem that we face in IT is to understand how all the company data is related, what will happen if I modify one column in a source schema,  or how I can find out where a particular piece of data come from. In this post I will talk about these process which are exactly the same as in the previous version.

I created an environment where I installed the latest release of OEMM (, Oracle Database 12c that contains the source, staging and DW schemas and ODI 12c which is the ELT tool to load and transform the data into our DW schema. In addition I also have OBIEE 11g which repository is based on the DW schema and some OBI analyses to query the data. I imported all the metadata from these different sources and create a configuration that includes all the models created.


Data Impact Analysis

The process of Data Impact Analysis is to find out which data items in target database and/or applications are affected if we change one source object, like a column or a table or another source object.

There are many ways to start the tracing data impact process in OEMM. We can start the analysis impact process in OEMM from the model where the object belongs or from a diagram.

We can open the model from the repository panel or from the configuration Architecture diagram. Once the model is opened, select the object that you want to perform the impact analysis, right click on it, and select Trace Data Impact or Trace Lineage (Advanced) option. Either option opens the same dialog window.


Then, we need to choose between the model and the configuration as the scope for this impact analysis. The most common option is to select one configuration to see the consequences of a possible change in all the target models present in this configuration. Also we have the option to see the result in graphic or in text mode.

In our example, we are going to trace the data impact of the ORDERS table which is in the database source model and is the source for the fact table in the DW.

In the following picture, you will find the resulted diagram of the data impact analysis. In order to see the diagrams better in OEMM,  you can collapse the Properties windows at your right and press the Fit to Content button in the diagram menu.


There are plenty of interesting things that you can do from here. First, if you take a closer look you will see that the lines between the Orders Source and Order Staging as well as the ones from Orders Staging and Orders_DW are ticker than the ones that connect Orders_DW to the OBIEE Model. This means there are some ETL process in the middle of these connections. When we right click on one of these arrows and select Trace ETL details, a new tab is opened inside the configuration and will show the mappings involved and the operations (a JOIN in our example). Also if we expand the windows properties and select the JOIN element we can see its condition.


Coming back to the Data Impact diagram, you can select one particular object like a column and press the highlight path button to emphasise only a specific data flow. This is very useful when you have many objects in the diagram and you need to focus in some particular data path.


If you want to keep this diagram saved so you can come back easily to it, you can add a bookmark. You can create folders to organise your bookmarks and they will appear at the bottom left of the screen.




Tracing Data Lineage

The Trace Data Lineage process is to find the path from the source to the target object that you selected.

In OEMM, tracing data lineage is a very similar process to impact analysis. In fact, it uses the same dialog window but now you should select the Trace Data Lineage or Trace Lineage (Advanced) option. Again, either option opens the same dialog window.

In our example, we are going to use the Table View from a particular OBI Analysis. We will perform first the data lineage using the model as a scope and then run another one in the context of the configuration in order to make clear the differences between these two data lineages.

Once I have the OBIEE Model opened, I double-click the required analysis and the analysis with all its components (criteria and layout) will appear in the Metadata Browser. I will run first the data lineage based on the model.


Another tab is shown with the data lineage based on the model. You can select one measure for example and see the properties. It will show details as the expression or formula (if it is a calculated item) and the default aggregation. In the diagram appears three main objects (that can be collapsed or expanded as required), one that represents the physical layer in the OBI repository (rpd), another for the Presentation layer and finally the table view in the OBI Analysis.

As you probably realise by now, there are a lot of tabs and panels that appears as you open different objects and execute some tasks. The panels are collapsible and can be resized, so you will need to do this a lot in order to visualise what you want.


We are going to repeat the same process but now selecting the configuration that contains the OBIEE model. In this new diagram you will see where the data in the OBI Analysis is coming from and the relationship with the other models in the configuration including the ODI model and the different database schemas. You can expand or collapse the objects in the diagram, tracing ETL details, highlight the path and all the features that we’ve seen for the data impact diagram as it is the same diagram.


Another useful and new feature of this release is the quick find that allows you to search for specific words and show you a list of all the objects in the diagram that matches the search criteria. If you double-click in one of this results, the object will appear highlighted in the diagram.


Model Connection Overview

The Model Connection Overview diagram shows the connection of the objects inside a model. As its name suggests is just an overview of the connections. You cannot go deeper into the object details, but is useful to have a rough idea of the object relationship in a model.  In the next pictures you will see an example of he Model Connection diagram of the OBIEE model and the diagram for a particular mapping in the ODI model. You can also save it as a bookmark as the other diagrams.


And this is how you can perform data lineage and impact analysis in OEMM, analysing the relationships among your company data.


Categories: BI & Warehousing

Managing Impala and Other Mixed Workloads on the Oracle Big Data Appliance

Tue, 2015-09-15 04:56

One of our current client projects uses Cloudera Impala to provide fast ad-hoc querying to the data we’re loading into their Oracle Big Data Appliance Hadoop environment. Impala bypasses MapReduce to provide faster queries than Hive, but to do so it does a lot of processing in-memory and runs server processes on each node in the cluster, leading in some cases to runaway queries blocking other workloads in the same way that OBIEE queries on an Oracle Database can sometimes block ETL and application workloads. Several projects share this same Big Data Appliance, so to try and limit the impact Impala could have on other cluster workloads the client had disabled the Impala Daemons on nine of the twelve nodes in their Big Data Appliance; our concern with this approach was that an Impala query could access data from any datanode in the Big Data Appliance cluster, so whilst HDFS data is typically stored and replicated to three nodes in the cluster running the Impala daemons on just a quarter of the available nodes was likely to lead to data locality issues for Impala and blocks getting shipped across the network unnecessarily.

Going back to OBIEE and the Oracle Database, Oracle have a resource management feature for the Oracle database that allows you to put users and queries into separate resource pools and manage the share of overall resources that each pool gets. I covered this concept on the blog a few years ago, and the version of Cloudera Hadoop (CDH5.3) as used on the client’s Big Data Appliance has a feature called “YARN”, or Yet Another Resource Negotiator, that splits out the resource management and scheduling parts that were bound into MapReduce in Hadoop 1.0 so that MapReduce then just runs as a workload type on Hadoop, and with it then possible to run other workload types, for example Apache Spark, on that same cluster management framework.



Impala isn’t however configured to use YARN by default and uses an internal scheduler to govern how concurrent queries run and use cluster resources, but it can be configured to use YARN in what Cloudera term “Integrated Resource Management” and our initial response was to recommend this approach; however YARN is really optimised for longer-running batch jobs and not the shorter jobs that Impala generates (such that Cloudera recommends you don’t actually use YARN, and control Impala resource usage via service-level process constraints or through a new Impala feature called Admission Control instead). Taking a step back though, how do we actually see what resources Impala is using across the cluster when a query runs, and is there a feature similar to the Oracle Database’s SQL Explain Plan to help us understand how an Impala SQL query is executed? Then, using this and the various resource management options to us, can we use them to understand how YARN and other options will affect the Impala users on the client’s cluster if we enable them? And, given that we were going to test this all out on one of our development Hadoop clusters running back at the office on VMWare, how well could we simulate the multiple concurrent queries and mixed workload we’d then encounter on the real customer Big Data Appliance?

When trying to understand what goes on when a Cloudera Impala SQL query runs, the two main tools in your toolbox are EXPLAIN plans and query profiles. The concept of EXPLAIN plans will be familiar to Oracle developers, and putting “explain” before your Impala SQL query when you’re using the Impala Shell (or pressing the “Explain” button when you’re using the Impala Editor in Hue) will display an output like the one below, showing the steps the optimiser plans to take to return the query results:

[] > explain
select sum( as total_flights, d.dest_city
from flight_delays f join geog_dest d on f.dest = d.dest
join geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and   o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000;
Query: explain select sum( as total_flights, d.dest_city
from flight_delays f join geog_dest d on f.dest = d.dest
join geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and   o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000
| Explain String                                                      |
| Estimated Per-Host Requirements: Memory=154.01MB VCores=2           |
|                                                                     |
| 10:EXCHANGE [UNPARTITIONED]                                         |
| |                                                                   |
| 09:AGGREGATE [FINALIZE]                                             |
| |  output: sum:merge(                                     |
| |  group by: d.dest_city                                            |
| |  having: sum( > 3000                                    |
| |                                                                   |
| 08:EXCHANGE [HASH(d.dest_city)]                                     |
| |                                                                   |
| 05:AGGREGATE                                                        |
| |  output: sum(                                           |
| |  group by: d.dest_city                                            |
| |                                                                   |
| 04:HASH JOIN [INNER JOIN, BROADCAST]                                |
| |  hash predicates: f.orig = o.orig                                 |
| |                                                                   |
| |--07:EXCHANGE [BROADCAST]                                          |
| |  |                                                                |
| |  02:SCAN HDFS [airlines.geog_origin o]                            |
| |     partitions=1/1 files=1 size=147.08KB                          |
| |     predicates: o.orig_state IN ('Florida', 'New York', 'Alaska') |
| |                                                                   |
| 03:HASH JOIN [INNER JOIN, BROADCAST]                                |
| |  hash predicates: f.dest = d.dest                                 |
| |                                                                   |
| |--06:EXCHANGE [BROADCAST]                                          |
| |  |                                                                |
| |  01:SCAN HDFS [airlines.geog_dest d]                              |
| |     partitions=1/1 files=1 size=147.08KB                          |
| |     predicates: d.dest_state = 'California'                       |
| |                                                                   |
| 00:SCAN HDFS [airlines.flight_delays f]                             |
|    partitions=1/1 files=1 size=64.00MB                              |
Fetched 35 row(s) in 0.21s

Like an Oracle SQL explain plan, Impala’s cost-based optimiser uses table and partition stats that you should have gathered previously using Impala’s “compute stats” command to determine what it thinks is the optimal execution plan for your query. To see the actual cost and timings for the various plan steps that are run for a query, you can then use the “summary” statement after your query has run (or for more detail, the “profile” statement”) to see the actual timings and stats for each step in the query execution.

[] > summary;                                 > ;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 20.35us  | 20.35us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 142.18ms | 180.81ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 59.86us  | 123.39us | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 171.72ms | 208.36ms | 60      | 1.93K      | 22.73 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 89.42ms  | 101.82ms | 540.04K | 131.88M    | 12.79 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 16.32us  | 19.63us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 302.83ms | 302.83ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 936.71ms | 1.10s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 19.02us  | 46.49us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 266.99ms | 266.99ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 1.07s    | 1.90s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

Output from the Summary statement gives us some useful information in working out the impact of the various resource management options for the Oracle Big Data Appliance, at least in terms of its impact on individual Impala queries – we’ll look at the impact on the overall Hadoop cluster and individual nodes later on. From the output of the above Summary report I can see that my query ran on all six nodes in the cluster (queries I ran earlier on a smaller version of the fact table ran on just a single node), and I can see how long each step in the query actually took to run. So what happens if I run the same query again on the cluster but disable the Impala daemon service role on three of the nodes, using Cloudera Manager?


Here’s the Summary output after running the query again:

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 22.01us  | 22.01us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 3      | 111.12ms | 117.24ms | 7       | 193        | 6.27 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 3      | 30.09us  | 39.02us  | 30      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 3      | 161.26ms | 173.57ms | 30      | 1.93K      | 22.84 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 3      | 156.50ms | 238.90ms | 540.04K | 131.88M    | 12.81 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 3      | 20.19us  | 28.93us  | 1.41K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 477.38ms | 477.38ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 3      | 1.48s    | 1.66s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 3      | 12.07us  | 14.89us  | 519     | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 308.83ms | 308.83ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 3      | 3.39s    | 6.85s    | 131.88M | 131.88M    | 74.11 MB  | 480.00 MB     | airlines.flight_delays_full f |

What the Summary statement doesn’t show you is the overall time the query took to run, and the query ran against three nodes took 9.48s to run compared to 3.59s for the one before where I had all six nodes’ Impala daemon enabled. In-fact I’d expect a query running on the client’s BDA with just three out of twelve nodes enabled to run even slower because of the block locality issue – Impala has a feature called block locality tracking which keeps track of where HDFS data blocks are actually located on the cluster and tries to run impalad tasks on the right nodes, but three out of twelve nodes running makes that job really hard – but the other factor that we need to consider is how running multiple queries concurrently affects things when only a few nodes are handling all the Impala user queries.

To try and simulate concurrent queries running I opened six terminal session against nodes actually running Impala Daemon service roles and submitted the same query from each session, with a second or two gap between each query; with all six nodes enabled the average response time rose to about 6s, but with just three enabled the response rose fairly consistently to around 27s.


This is of course what you’d expect when everything was trying to run on the same three (now resource-starved) server nodes, and again I’d expect this to be even more pronounced on the client’s twelve-node BDA. What this test of course didn’t cover was running workloads other than Impala on the same cluster, or running queries against different datasets, but it did at least show us how response-time increases fairly dramatically (albeit consistently) as more Impala users come onto the system.

So now we have some baseline benchmarking figures, let’s configure Impala to use YARN, using Cloudera Manager on the CDH5.3 setup used on the client’s BDA and our development cluster back in the office. There’s actually two parts to Impala running on YARN in CDH5.x; YARN itself as the overall cluster resource management layer, and another component called Llama (Low-Latency, or “Long-Lived”, Application Master) that sits between YARN and Impala and reduces the time that each Impala query takes to obtain YARN resource allocations.

llama arch

Enabling YARN and Llama (and if you want to, configuring Llama and thereby Impala for high-availability) is done through a wizard in CDH5.3 that also offers to set up an Linux feature called Cgroups that YARN can use to limit the “containers” it uses for resource management at the OS-level.

Once you’ve run through the wizard and restarted the cluster, Impala should be configured to use YARN instead of its own scheduler to request resources, which in-theory will allow Hadoop and the Big Data Appliance to consider Impala workloads alongside MapReduce, Spark and HBase when scheduling jobs across the cluster. Before we get into the options YARN gives us for managing these workloads I ran the same Impala queries again, first as a single query and then with six running concurrently, to see what impact YARN on its own had on query response times.

The single query on its own took around the same time as without YARN to run (3-4s), but when I ran six concurrent queries together the response time went up from the 3-4s that I saw without YARN enabled to between 5s and 18s depending on the session, with quite a bit of variation between response times compared to the consistent times I saw when YARN wasn’t being used – which surprised me as one of the stated benefits of YARN is making job execution times more predictable and smooth, though this cloud be more of an overall-cluster thing and there are also recommendations around configuring YARN and Llama’s resource estimation more efficient for Impala in the Cloudera docs.

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 41.38us  | 41.38us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 115.28ms | 123.04ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 44.44us  | 67.62us  | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 170.91ms | 201.47ms | 60      | 1.93K      | 22.82 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 82.25ms  | 98.34ms  | 540.04K | 131.88M    | 12.81 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 15.39us  | 18.99us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 244.40ms | 244.40ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 850.55ms | 942.47ms | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 13.99us  | 19.05us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 222.03ms | 222.03ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 1.54s    | 2.88s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

But it seems clear that users of Impala on the client cluster should expect some sort of overhead from using YARN to manage Impala’s resources, with the payoff being better balance between Impala workloads and the other uses they’re putting the BDA cluster too – however I think there’s more we can do to fine-tune how Llama and YARN allocate memory to Impala queries up-front (allocating a set amount of memory for all queries, rather than making an estimate and then adding more memory mid-query if it’s needed) and of course we’ve not really tested it on a cluster with a full, mixed workload running. But what about our original scenario, where only a certain percentage of the overall cluster resources or nodes are allocated to Impala query processing? To set up that sort of division resources we can use another feature of YARN called dynamic allocation, and dynamic resource pools that we can set up through Cloudera Manager again.

Dynamic allocation is one of the ways that YARN can be configured to manage multiple workloads on a Hadoop cluster (the other way is through static service pools, and I’ll come to those in a moment). Using dynamic allocation I can set up a resource pool for the airline flight delays application that my Impala SQL queries are associated with and allocate it 25% of overall cluster resources, with the remainder of cluster resources allocated to other applications. I can keep that weighting simple as I have done in the screenshot below, or I can allocate resources based on virtual cores and memory, but I found it simpler to just set these overall weightings and let YARN worry about cores and RAM. 


Depending on the scheduling policy you select, YARN will prioritise Impala and other jobs in different ways, but the recommended scheduling policy for mixed workloads is dominent resource fairness which balances RAM and CPU depending on which resource pool needs them most at a particular time. Note also that Impala can either be managed as part of the overall YARN workload or separately, a choice you can make in the Impala service configuration settings in Cloudera Manager (the “Enable Dynamic Resource Pools” setting that’s checked below, but was unchecked for the screenshot above)


There’s also a separate control you can place on Impala queries called Admission Control, that limits the number of queries that can run or be queued for a resource pool at any particular time. The docs are a bit vague on when to use admission control, when to use YARN or not and so on, but my take on this is that if it’s just Impala queries you’re worried about and throttling their use solves the problem then use this feature and leave Impala outside of YARN, but if you need to manage overall mixed workloads then do it all through YARN. For my testing example though I just went with simple resource pool weighting, and you can see from the screenshot below where multiple queries are running at once for my pool, CPU and RAM resources are constrained as expected.


To make a particular Impala query run within a specific resource pool you can either allocate that user to a named resource pool, or you can specific the resource pool in your Impala shell session like this:

[] > set request_pool = airlines; 
REQUEST_POOL set to airlines
[] > select sum( as total_flights, d.dest_city
from airlines.flight_delays_full f join airlines.geog_dest d on f.dest = d.dest
join airlines.geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000;

Looking then at a typical summary output for a query running with these restrictions (25% of resources overall) and other queries running concurrently, the numbers don’t look all that different to before and results took between 8s and 30s to return – again I was surprised on the variance but I think YARN is more about overall cluster performance rather than individual queries, and you shouldn’t read too much into specific times on a dev server with an unrepresentative overall workload.

[] > summary;
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
| 10:EXCHANGE     | 1      | 26.78us  | 26.78us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 209.10ms | 262.02ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 63.20us  | 118.89us | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 282.56ms | 401.37ms | 60      | 1.93K      | 22.76 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 99.56ms  | 114.14ms | 540.04K | 131.88M    | 12.85 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 15.49us  | 17.94us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 531.08ms | 531.08ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 1.20s    | 1.54s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 24.29us  | 68.23us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 287.31ms | 287.31ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 2.34s    | 3.13s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |

A point to note is that I found it very hard to get Impala queries to run when I got down to specifying virtual core and memory limits rather than just overall weightings, so I’d go with these high-level resource pool prioritisations which seemed to work and didn’t unduly affect query response times. For example the setting below looked clever, but queries always seemed to time out and I never really got a satisfactory setup working properly.


Note that for YARN dynamic resource pools to be used, all Linux/CDH users will need to be assigned to resource pools so they don’t run as “unconstrained”; this can also be done from the Dynamic Resource Pools configuration page.

Finally though, all of this management through resource pools might not be the best way to control resource usage by YARN. The Cloudera docs say quite clearly on the Integrated Resource Management page that:

“When using YARN with Impala, Cloudera recommends using the static partitioning technique (through a static service pool) rather than the combination of YARN and Llama. YARN is a central, synchronous scheduler and thus introduces higher latency and variance which is better suited for batch processing than for interactive workloads like Impala (especially with higher concurrency). Currently, YARN allocates memory throughout the query, making it hard to reason about out-of-memory and timeout conditions.

What this means in-practice is that, if you’ve got a single project using the Big Data Appliance and you just want to specify at a high-level what proportion of resources Impala, HBase, MapReduce and the other services under YARN management use, you can define this as static service pool settings in Cloudera Manager and have these restrictions enforced by Linux Cgroups. In the screenshot below I unwound all of the dynamic resource pool settings I created a moment ago and allocated 25% of overall cluster resources to Impala, with the wizard then using those top-level values to set limits for services across all nodes in the cluster based on their actual RAM and CPU, the services running on them and so on.


Then, going back to Cloudera Manager and running some queries, you can see these static service pool limits being applied in real-time and their effect in the form of graphs for particular cluster resources.


So given all of this, what was our recommendation to the client about how best to set up resource management for Impala and other workloads on their Big Data Appliance? Not too much should be read into individual numbers – it’s hard to simulate a proper mixed workload on a development server, and of course their BDA has 12 nodes, more memory, faster CPUs. However it’s probably fair to say these are the obvious conclusions:

  • Running Impala daemons on just a subset of nodes isn’t actually a bad way to constrain resources used by Impala, but it’ll only work on clusters with a small amount of nodes so that there’s a good chance one node will have one of the three copies of a data block. On a system of the scale of our customer’s, we’ll probably hit unacceptable overheads in terms of block locality. I would not carry on with this approach because of that.
  • If the customer BDA will be running a mixed workload, i.e. data loading, long-running Hive/Pig/Spark jobs as well as short-running Impala jobs, enabling Impala for YARN and setting overall resource pools for applications would be the best approach, but individual Impala queries will probably run slower than now (even given the restriction in resources), due to the overhead YARN imposes when scheduling and running jobs. But this will be the best way to allocate resource between applications and provide a generally “smoother” experience for users
  • If the BDA needs to be optimised mostly for Impala queries, then don’t manage Impala under YARN, leave it outside of this and just use Static service pools to allocate Impala roughly 25% of resources across all nodes. In both this and the previous instance (Impala on YARN) then all nodes should be re-enabled for Impala so as to minimize issues over block locality
  • If the only real issue is Impala queries for a particular application taking all resources/becoming runaway, Impala could be left outside of YARN but enabled for admission control so as to limit the total number of running/queued queries for a particular application.
Categories: BI & Warehousing

Oracle Big Data Discovery 1.1 now GA, and Available as part of BigDataLite 4.2.1

Sat, 2015-09-05 01:53

The new Oracle Big Data Discovery 1.1 release went GA a couple of weeks ago, and came with a bunch of features that addressed show-stoppers in the original 1.0 release; the ability to refresh and reload datasets from Hive, compatibility with Cloudera CDH and Hortonworks HDP Hadoop platforms, Kerberos integration, and the ability to bring in datasets from remote JDBC datasources. If you’re new to Big Data Discovery I covered the initial release in a number of blog posts over the past year or so:

So let’s start by loading some data into Big Data Discovery so that we can explore what’s in it, see the range of attributes and their values and do some basic data clean-up and enrichment. As with Big Data Discovery 1.0 you import (or “sample”) data into Big Data Discovery’s DGraph engine either via file upload, or by using a command-line utility. Data in Hadoop has to be registered in the Hive HCatalog metadata layer, and I’ll start by importing a Hive table mapped to some webserver log files via a Hive SERDE:


To import or sample this table into BDD’s DGraph engine I use the following command to invoke the Big Data Discovery Data Processing engine, which reads the Hive table metadata, loads the Hive table data into the DGraph engine (either all rows, or a representative sample) and process/enriches the data to add geocoding, for example:

[oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t apachelog_parsed

This then runs as an Apache Spark job under YARN, progress of which you can track either from the console or through Cloudera Manager / Hue.

[2015-09-04T14:20:43.792-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:oracle] 
 client token: N/A
 diagnostics: N/A
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 start time: 1441390841404
 final status: UNDEFINED
 tracking URL: http://bigdatalite.localdomain:8088/proxy/application_1441385366282_0001/
 user: oracle
[2015-09-04T14:20:45.794-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:oracle] Application report for application_1441385366282_0001 (state: ACCEPTED)

Going over to Big Data Discovery Studio I can then see the dataset within the catalog, and then load it into a project and start exploring and transforming the dataset. In the screenshots below I’m cleaning up the date and time field to turn it into a timestamp, and arranging the automatically-derived country, city, region and state attributes into a geography hierarchy. BDD1.1 comes with a bunch of other transformation enhancements including new options for enrichment, the ability to tag values via a whitelist and so on – a full list of new features for BDD1.1 can be found in MOS Doc.ID 2044712.1


Crucially now in BDD1.1 you can either refresh a data set with new data from Hive (re-load), or do an incremental update after you’ve selected an attribute as the update (identity) column – in the screenshot below I’m doing this for a dataset uploaded from a file, but you can reload and update dataset from the command-line too which then opens-up the possibility of scripting, scheduling etc.


You can also define JDBC data connections in the administration part of BDD Studio, and then type in SQL queries to define data sources that can then be added into your project as a dataset – loading their data directly into the DGraph engine rather than having to stage it in Hadoop beforehand.


Then, as with the initial release of Big Data Discovery, you can define joins between the data sets in a project based on common attributes – in this case I’m joining the page URLs in the webserver logs with the page and article data extracted from our WordPress install, sourced from both Hive and Oracle (via JDBC)


Other new features in BDD1.1 include the ability to define “applications”, projects and datasets that are considered “production quality” and include details on how to refresh and incrementally load their datasets (designed presumably to facilitate migrations from Endeca Information Discovery), and a number of other new features around usability, data exploration and security. You can download BDD1.1 from Oracle’s Edelivery site, or download it pre-installed and configured as part of the new Big Data Lite 4.2.1 Virtualbox virtual machine.

Categories: BI & Warehousing

Primeros Cursos de Rittman Mead en Español en América Latina

Fri, 2015-09-04 08:04

La semana pasada y en el contexto del OTN Tour LA 2015 dimos nuestros primeros cursos en Español en Quito, Ecuador y Cali, Colombia.

Se trató de un workshop en este caso de un día (8 horas) sobre Desarrollo Front-End de OBIEE 11g.

En esta oportunidad los cursos fueron presenciales, dónde todo el material teórico-práctico era en español y cada alumno tenía a su disposición durante el entrenamiento una máquina virtual exclusiva donde realizar la parte práctica contando con el apoyo y la asistencia del instructor en todo momento.

Hace muchos años que vengo dando cursos. El primer curso que di, fue allá por 1999 (ejem, sí, en el siglo pasado). Desde ahí he dado muchísimos más sobre distintos productos Oracle inclusive como instructora de Oracle University lo cuál he sido por más de 10 años. En total debo haber dado cursos a más de 500 personas en mi vida.

Hace un año y medio que estoy trabajando felizmente en Rittman Mead desarrollando y dictando cursos en Inglés, lo que ha sido un reto maravilloso. Pero volver a dar cursos en español fue como quien dice “cómo volver a casa”. Realmente una experiencia super linda.


El feedback que tuvimos del curso fue realmente muy bueno, y nos llena de alegría haber logrado una vez más exceder las expectativas de los  participantes. Esto son algunos ejemplos del feedback que hemos recibido.



Queremos agradecer a todos los asistentes del curso por su participación y también a nuestros partners en Ecuador Bluefrog y Refundation, y a los Grupos de Usuario de Oracle de Ecuador y Colombia por la organización.

Si quieres recibir nuestros cursos en tu propio idioma, ya sea en Español o Portugués o quieres convertirte en nuestro partner de entrenamientos, mándanos un correo a y nos pondremos inmediatamente en contacto contigo.

Categories: BI & Warehousing