Skip navigation.

BI & Warehousing

New Oracle Big Data Quick-Start Packages from Rittman Mead

Rittman Mead Consulting - Wed, 2015-03-25 05:00

Many organisations using Oracle’s business intelligence and data warehousing tools are now looking to extend their capabilities using “big data” technologies. Customers running their data warehouses on Oracle Databases are now looking to use Hadoop to extend their storage capacity whilst offloading initial data loading and ETL to this complementary platform; other customers are using Hadoop and Oracle’s Big Data Appliance to add new capabilities around unstructured and sensor data analysis, all at considerably lower-cost than traditional database storage.


In addition, as data and analytics technologies and capabilities have evolved, there has never been a better opportunity to reach further into your data to exploit more value. Big Data platforms, Data Science methods and data discovery technologies make it possible to unlock the power of your data and put it in the hands of your  executives and team members – but what is it worth to you? What’s the value to your organisation of exploring deeper int the data you have, and how do you show return?

Many organisations have begin to explore Big Data technologies to understand where they can exploit value and extend their existing analytics platforms, but what’s the business case? The good news is, using current platforms, and following architectures like the Oracle Information Management and Big Reference Architecture written in conjunction with Rittman Mead, the foundation is in place to unlock a range of growth opportunities. Finding new value in existing data, predictive analytics, data discovery, reducing the cost of data storage, ETL offloading are all starter business cases proven to return value quickly.


To help you start on the Oracle big data journey, Rittman Mead have put together two quick-start packages focuses on the most popular Oracle customer use-cases;

If this sounds like something you or your organization might be interested in, take a look at our new Quick Start Oracle Big Data and Big Data Discovery packages from Rittman Mead home page, or drop me an email at and I’ll let you know how we can help.

Categories: BI & Warehousing

Source Dependent Extract and Source Independent Load

Dylan's BI Notes - Tue, 2015-03-24 17:40
Typical data warehousing ETL process involves Extract, Transform, and Load. The concept of Source Dependent Extract (SDE) and Source Independent Load (SIL) are unique part of the BI Apps ETL since BI Apps has a universal data warehouse. Since the staging schema are designed according to the universal data warehouse design, the logic of loading data […]
Categories: BI & Warehousing

RM BI Forum 2015 : Justification Letters for Employers

Rittman Mead Consulting - Tue, 2015-03-24 03:48

(Thanks to Christian Berg @Nephentur for the suggestion, and acknowledgements to ODTUG KScope for the original idea – our favourite conference after the BI Forum)

The Rittman Mead BI Forum 2015 promises to be our best BI Forum yet, with fantastic speakers at each event, keynotes and guest speakers from Oracle and John Foreman, author of the bestselling book “Data Smart”, a data visualisation challenge and an optional one-day masterclass on delivering Oracle’s new Information Management and Big Data reference architecture by Rittman Mead’s Mark Rittman and Jordan Meyer. Uniquely amongst Oracle BI events we keep the numbers attending very limited and run just a single stream at each event, so everyone takes part in the same sessions and gets to meet all the attendees and speakers over the three days.

Sometimes though, management within organizations require special justification for team members to attend events like these, and to help you put your case together and get across the unique education and networking benefits of the Rittman Mead BI Forum, we’ve prepared justification letters for you to complete with your details, one each for the Brighton and Atlanta events. Click on the links below to download sample justification letters for the Brighton BI Forum running on May 6th-8th 2015, and the Atlanta one running the week after on May 13th-15th 2015:

Full details on the BI Forum 2015 agenda and how to register can be found on the Rittman Mead BI Forum 2015 home page, with registration open until the weekend before each event – hurry though as attendee numbers are strictly limited.

Categories: BI & Warehousing

OBIEE nqcmd Tidbits

Rittman Mead Consulting - Mon, 2015-03-23 21:42

nqcmd is the ODBC command line tool that always has, and hopefully always will, shipped with OBIEE. It enables you to manually fire queries directly at the BI Server, rather than through the usual way of Presentation Services generating Logical SQL and sending it to BI Server. This can be useful in several cases:

  1. Automated cache purging, by sending one of the SAPurge[…] ODBC commands to the BI Server, usually done as part of a script
  2. Automated execution of Logical SQL, often done to support testing scenarios
  3. Load Testing the BI Server (via a magic undocumented switch, SA_NQCMD_ADVANCED)
  4. Manual interogation of the BI Server – if you want to poke and prod nqsserver without launching a web browser, nqcmd is your friend :)

In using nqcmd there’re a couple of things I want to demonstrate here that I find useful but haven’t seen discussed [in detail] elsewhere.

Query Log via nqcmd

All BI Server queries run with a LOGLEVEL>=1 will write some log details to nqquery.log. The usual route to view this is either on the server directly itself, transferring it off with a tool such as WinSCP, or through the Administration page of OBIEE. Another option that is available is from nqcmd itself. You need to do two things:

  1. Set the environment variable SA_NQCMD_ADVANCED to Yes
  2. Include the command line arguments -ShowQueryLog -H when you invoke nqcmd. I don’t know what -H does – it’s just specified as being required for this to work.

Here’s a simple example in action:

[oracle@demo ~]$ export SA_NQCMD_ADVANCED=Yes
[oracle@demo ~]$ nqcmd -d AnalyticsWeb -u prodney -p Admin123 -ShowQueryLog -H

          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved

Connection open with info:
[0][State: 01000] [DataDirect][ODBC lib] Application's WCHAR type must be UTF16, because odbc driver's unicode type is UTF16

        [T]able info
        [C]olumn info
        [D]ata type info
        [F]oreign keys info
        [P]rimary key info
        [K]ey statistics info
        [S]pecial columns info
        [Q]uery statement
Select Option: Q

Give SQL Statement: SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"
SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"
Row count: 1
[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-0] [] [ecid: 0054Sw944KmFw000jzwkno0003ac0000rl,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] ###
########################################### [[
-------------------- SQL Request, logical request hash:
SET VARIABLE LOGLEVEL=1:SELECT "A - Sample Sales"."Base Facts"."1- Revenue" s_1 FROM "A - Sample Sales"

[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-34] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Query Status: Successful Completion [[

[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-28] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Physical query response time 0 (seconds), id <<333971>> [[


[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-29] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Physical Query Summary Stats: Number of physical queries 1, Cumulative time 0, DB-connect time 0 (seconds) [[

[2015-03-21T16:36:31.000+00:00] [OracleBIServerComponent] [TRACE:1] [USER-33] [] [ecid: 0054Sw94mRzFw000jzwkno0003ac0000ro,0] [tid: 56660700] [requestid: 201f0002] [sessionid: 201f0000] [username: prodney] -------------------- Logical Query Summary Stats: Elapsed time 0, Response time 0, Compilation time 0 (seconds) [[


Neat! But so what? Well, I see two uses straight away:

  1. In some situations you may not have access to the filesystem of the server on which the BI Server is running. For example, as a consultant I’ve been to clients where I’m given the Administration Tool client installation only. If I want to debug an RPD that I’m developing I’ll usually want to poke around in nqquery.log to see quite what physical SQL is being generated – and now I can.
  2. There was a discussion on the EMG mailing list recently about generating Physical SQL without executing it on the database. I’m going to discuss this in the next section of this article, and to do the analysis for this rapidly I’m using the inline query log.
Generating Physical SQL for OBIEE without Executing it – SKIP_PHYSICAL_QUERY_EXEC

OBIEE generates the Physical SQL that it runs against the database dynamically, at runtime. It takes the Logical request (“Logical SQL”), runs it through the RPD and generates one or more “Physical SQL” statements to be executed on the database as required to pull back the necessary data. A question arose recently on the EMG mailing list as to whether it is possible to get the Physical SQL – without executing it. You can imagine the benefits of this (namely, regression testing) since executing the database query each time is typically going to be expensive in machine resource and time consuming.

In SampleApp v406 there is a /home/oracle/scripts/PhysicalSQLGenerator, which does two things. First off it generates the Logical SQL for a given analysis, presumably using the generateReportSQL web service. It then takes that and runs it through nqcmd, scraping the nqquery.log for the resulting Physical SQL. In all of this no database queries get run. Very cool. But what’s the “secret sauce” at play here – can we distill it down in order to use it ourselves?

First, let’s look at how the SampleApp script does it. It sets some additional request variables in the Logical SQL:

[oracle@demo PhysicalSQLGenerator]$ cat lsql-out-dir/q1.lsql
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"

And if we extract the relevant part out of the bash script we can see that it also uses a couple of extra command line arguments (-q -NoFetch) when invoking nqcmd:

nqcmd -q -NoFetch -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

When it’s run we check nqquery.log and lo-and-behold we get this: (edited for brevity)

------------------- Sending query to database named 01 - Sample App Data (ORCL) (id: <<69923>>), connection pool named Sample Relational Connection, logical request hash dd4fb54f, physical request hash 8d6f36
3d: [[
SAWITH0 AS (select sum(T42442.Revenue) as c1
     BISAMPLE.SAMP_REVENUE_FA2 T42442 /* F21 Rev. (Aggregate 2) */ )
select D1.c1 as c1, D1.c2 as c2 from ( select distinct 0 as c1,
     D1.c1 as c2
     SAWITH0 D1 ) D1 where rownum <= 5000001


Query Status: Successful Completion [[

Rows 0, bytes 24 retrieved from database query id: <<69923>> Simulation Gateway 

Physical query response time 0 (seconds), id <<69923>> Simulation Gateway

Whilst the log says it is “Sending query to database” it does no such thing, and the “Simulation Gateway” is the giveaway clue. Proof that it doesn’t connect to the database? I shut the database down, and it still worked just fine. Crude, yes, but effective.

I’ll intersperse here the little trick that I mentioned in the first part of this article : -ShowQueryLog. It’s tedious switching back and forth between nqcmd and the nqquery.log when doing this kind of testing, so let’s do it all as one:

nqcmd -H -ShowQueryLog -q -NoFetch -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

Unfortunately it looks like -ShowQueryLog is mutually exclusive to -q and -NoFetch since it doesn’t return anything, even though the nqquery.log did get additional entries. But that’s fine, since by removing these two flags in order to get -ShowQueryLog to work we’re whittling down what is actually needed to generate the physical SQL on its own without database execution. Here’s the nqcmd, showing the query log inline and showing still the “Simulation Gateway” indicative of no physical query execution:

[oracle@demo PhysicalSQLGenerator]$ export SA_NQCMD_ADVANCED=Yes
[oracle@demo PhysicalSQLGenerator]$ nqcmd -H -ShowQueryLog -d AnalyticsWeb -u weblogic -p Admin123 -s lsql-out-dir/q1.lsql

          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved


s_0          s_1
Row count: 0
[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-0] [] [ecid: 0054Ut7AJ33Fw000jzwkno0005UZ00005Q,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] ############################################## [[
-------------------- SQL Request, logical request hash:
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"


[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-18] [] [ecid: 0054Ut7AK5DFw000jzwkno0005UZ00005S,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Sending query to database named 01 - Sample App Data (ORCL) (id: <<70983>>), connection pool named Sample Relational Connection, logical request hash dd4fb54f, physical request hash 8d6f363d: [[
SAWITH0 AS (select sum(T42442.Revenue) as c1
     BISAMPLE.SAMP_REVENUE_FA2 T42442 /* F21 Rev. (Aggregate 2) */ )
select D1.c1 as c1, D1.c2 as c2 from ( select distinct 0 as c1,
     D1.c1 as c2
     SAWITH0 D1 ) D1 where rownum <= 5000001

[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-34] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Query Status: Successful Completion [[

[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-26] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Rows 0, bytes 24 retrieved from database query id: <<70983>> Simulation Gateway [[

[2015-03-23T05:52:57.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-28] [] [ecid: 0054Ut7AYi0Fw000jzwkno0005UZ00005T,0] [tid: 8f194700] [requestid: 8a1e0002] [sessionid: 8a1e0000] [username: weblogic] -------------------- Physical query response time 0 (seconds), id <<70983>> Simulation Gateway [[


It’s clear that the “-q -Nofetch” parameters used in nqcmd don’t have an effect on whether the physical query is executed (they’re to do with whether nqcmd as an ODBC client pulls back and displays the data you ask for). It’s actually just a single request variable that does the job, and it goes under the rather obvious name of SKIP_PHYSICAL_QUERY_EXEC. When set to 1 it generates all the necessary physical SQL but doesn’t execute it, and the presence of “Simulation Gateway” in the log signals this.

Categories: BI & Warehousing

BI Apps has an Universal Data Warehouse

Dylan's BI Notes - Mon, 2015-03-23 13:48
BI Apps data warehouse design is based on an assumption that the data warehouse schema design is independent from OLTP system. The staging schema is an universal staging and the data warehouse is an universal data warehouse. The assumption is that no matter what the source system you are using, the business questions the BI […]
Categories: BI & Warehousing

Announcing the BI Forum 2015 Data Visualisation Challenge

Rittman Mead Consulting - Mon, 2015-03-23 03:00

The Rittman Mead BI Forum 2015 is running in Brighton from May 6th-8th 2015, and Atlanta from May 13th – 15th 2015. At this year’s events we’re introducing our first “data visualization challenge”, open to all attendees and with the dataset and scenario open from now until the start of each event. Using Oracle Business Intelligence 11g and any plugins or graphics libraries that embed and interact with OBIEE (full details and rules below), we challenge you to create the most effective dashboard or visualisation and bring it along to demo on the Friday of each event.

Help Donors Use their Funds Most Effectively

This year’s inaugural data visualisation challenge is based around the project and dataset, an online charity that makes it easy for anyone to help public school classroom projects that need funding (Rittman Mead will be making donations on behalf of the Brighton and Atlanta BI Forums to show our support for this great initiative). The project and dataset have been used in several hackathons and data crunching contests around the world, with analysis and visualisations helping to answer questions such as:

  • Why do some projects get funded, while others don’t?
  • Who donates to projects from different subjects?
  • Does proximity to schools change donation behavior?
  • What types of materials are teachers lacking the most? (eg chalk, paper, markers, etc)
  • Do poorer schools ask for more or less money from their donors?
  • If I need product x, what is the difference between projects asking for x that were successful vs those that aren’t.

More details on uses of the dataset can be found on the Donorschoose data blog, and example visualisations you could use to get some ideas and inspiration are on the Data Gallery showcase page.


Your challenge is to import this dataset into your analytical database of choice, and then create the best visualisation or dashboard in OBIEE to answer the following question: “Which project can I donate to, where my donation will have most impact?”

How Do I Take Part?

For more on the BI Forum 2015 Data Visualization Challenge including how to download the dataset and the rules of the challenge, take a look at the Rittman Mead BI Forum 2015 Data Visualisation Challenge web page where we’ve provided full details. You can either enter as an individual or as part of a team, but you must be registered for either the Brighton or Atlanta BI Forum events and come along in-person to demonstrate your solution – numbers at each event are strictly limited though, so make sure you register soon at the Rittman Mead BI Forum 2015 home page.

Categories: BI & Warehousing

Instrumenting OBIEE Database Connections For Improved Performance Diagnostics

Rittman Mead Consulting - Sun, 2015-03-22 19:30

Nearly four years ago I wrote a blog post entitled “Instrumenting OBIEE – The Final Chapter”. With hindsight, that title suffix (“The Final Chapter”) may have been a tad presumptuous and naïve of me (or perhaps I can just pretend to be ironic now and go for a five-part-trilogy style approach…). Back then OBIEE 11g had only just been released (who remembers in all its buggy-glory?), and in the subsequent years we’ve had significant patchset releases of OBIEE 11g bringing us up to now and with talk of OBIEE 12c around the corner.

As a fanboi of Cary Millsap and his approach to measuring and improving performance, instrumenting code in general – and OBIEE specifically – is something that’s interested me for a long time. The article was the final one that I wrote on my personal blog before joining Rittman Mead and it’s one that I’ve been meaning to re-publish here for a while. A recent client engagement gave me cause to revisit the instrumentation approach and refine it slightly as well as update it for a significant change made in OBIEE

What do I mean by instrumentation? Instrumentation is making your program expose information about what is being done, as well as actually doing it. Crudely put, it’s something like this:

40 GOTO 10

Rather than just firing some SQL at the database, instead we associate with that SQL information about what program sent it, and what that program was doing, who was using it, and so on. Instrumentation enables you to start analysing performance metrics against tangible actions rather than just amorphous clumps of SQL. It enables you to understand the workload profile on your system and how that’s affecting end users.

Pop quiz: which of these is going to be easier to work with for building up an understanding of a system’s behaviour and workload?

CLIENT_INFO          MODULE                    ACTION       CPU_TIME DISK_READS 
-------------------- ------------------------  ---------- ---------- ---------- 
                                               a17ff8e1         2999          1 
                                               fe6abd92         1000          6 
                                               a264593a         5999          2 
                                               571fe814         5000         12 
                                               63ea4181         7998          4 
                                               7b2fcb68        11999          5


CLIENT_INFO          MODULE                    ACTION       CPU_TIME DISK_READS
-------------------- ------------------------  ---------- ---------- ----------
06 Column Selector   GCBC Dashboard/Performan  a17ff8e1         2999          1
05 Table with condit GCBC Dashboard/Performan  a264593a         5999          2
06 View Selector     GCBC Dashboard/Performan  571fe814         5000         12
05 Table with condit GCBC Dashboard/Performan  63ea4181         7998          4
<unsaved analysis>   nqsserver@obi11-01        fe6abd92         1000          6
<unsaved analysis>   nqsserver@obi11-01        7b2fcb68        11999          5

The second one gives us the same information as before, plus the analysis being run by OBIEE, and the dashboard and page.

The benefits of instrumentation work both ways. It makes DBAs happy because they can look at resource usage on the database and trace it back easily to the originating OBIEE dashboard and user. Instrumentation also makes life much easier for troubleshooting OBIEE performance because it’s easy to trace a user’s entire session through from browser, through the BI Stack, and down into the database.

Instrumentation for OBIEE – Step By Step

If you want the ‘tl;dr’ version, the “how” rather than the “why”, here we go. For full details of why it works, see later in the article.

  1. In your RPD create three session variables. These are going to be the default values for variables that we’re going to send to the database. Make sure you set “Enable any user to set the value”.

  2. Set up a session variable initialization block to populate these variables. It is just a “dummy” init block as all you’re doing is setting them to empty/default values, so a ‘SELECT … FROM DUAL’ is just fine:

  3. For each Connection Pool you want to instrument, go to the Connection Scripts tab and add these three scripts to the Execute before query section:

    -- Pass the OBIEE user's name to CLIENT_IDENTIFIER
    call dbms_session.set_identifier('VALUEOF(NQ_SESSION.USER)')

    -- Pass the Analysis name to CLIENT_INFO
    call dbms_application_info.set_client_info(client_info=>SUBSTR('VALUEOF(NQ_SESSION.SAW_SRC_PATH)',(LENGTH('VALUEOF(NQ_SESSION.SAW_SRC_PATH)')-instr('VALUEOF(NQ_SESSION.SAW_SRC_PATH)','/',-1,1))*-1))

    -- Pass the dashboard name & page to MODULE
    -- NB OBIEE >= will set ACTION itself so there is no point setting it here (it will get overridden)
    call dbms_application_info.set_module(module_name=> SUBSTR('VALUEOF(NQ_SESSION.SAW_DASHBOARD)', ( LENGTH('VALUEOF(NQ_SESSION.SAW_DASHBOARD)') - INSTR('VALUEOF(NQ_SESSION.SAW_DASHBOARD)', '/', -1, 1) ) *- 1) || '/' || 'VALUEOF(NQ_SESSION.SAW_DASHBOARD_PG)' ,action_name=> '' );

    You can leave the comments in there, and in fact I’d recommend doing so to make it clear for future RPD developers what these scripts are for.

    Your connection pool should look like this:

    An important point to note is that you generally should not be adding these scripts to connection pools that are used for executing initialisation blocks. Initialisation block queries won’t have these request variables so if you did want to instrument them you’d need to find something else to include in the instrumentation.

Once you’ve made the above changes you should see MODULE, CLIENT_IDENTIFIER and CLIENT_INFO being populated in the Oracle system views :


SID PROGRAM CLIENT_ CLIENT_INFO              MODULE                       ACTION
--- ------- ------- ------------------------ ---------------------------- --------
 17 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 32846912
 65 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 4bc2a368
 74 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 35c9af67
193 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 10bdad6c
302 nqsserv prodney Geographical Analysis 1  11.10 Flights Delay/Overview 3a39d178
308 nqsserv prodney Delayed Fligth % history 11.10 Flights Delay/Overview 1fad81e0
421 nqsserv prodney Geographical Analysis 2  11.10 Flights Delay/Overview 4e5d36c1

You’ll note that we don’t set ACTION – that’s because OBIEE now sends a hash of the physical query text across in this column, meaning we can’t use it ourselves. Unfortunately the current version of OBIEE doesn’t store the physical query hash anywhere other than in nqquery.log, meaning that you can’t take advantage of it (i.e. link it back to data from Usage Tracking) within the database alone.

That’s all there is to it – easy! If you want to understand exactly how and why it works, read on…

Instrumentation for OBIEE – How Does it Work? Connection Pools

When OBIEE runs a dashboard, it does so by taking each analysis on that dashboard and sending a Logical Request for that analysis to the BI Server (nqsserver). The BI Server parses and compiles that Logical request into one or more Physical requests which it then sends to the source database(s).

OBIEE connects to the database via a Connection Pool which specifies the database-specific connection information including credentials, data source name (such as TNS for Oracle). The Connection Pool, as the name suggests, pools connections so that OBIEE is not going through the overhead of connecting and disconnecting for every single query that it needs to run. Instead it will open one or more connections as needed, and share that connection between queries as needed.

As well as the obvious configuration options in a connection pool such as database credentials, OBIEE also supports the option to send additional SQL to the database when it opens a connection and/or sends a new query. It’s this nice functionality that we piggy-back to enable our instrumentation.


The information that OBIEE can send back through its database connection is limited by what we can expose in variables. From the BI Server’s point of view there are three types of variables:

  1. Repository
  2. Session
  3. Request

The first two are fairly simple concepts; they’re defined within the RPD and populated with Initialisation Blocks (often known as “init blocks”) that are run by the BI Server either on a schedule (repository variables) or per user (session variables). There’s a special type of session variables known as System Session Variables, of which USER is a nice obvious example. These variables are pre-defined in OBIEE and are generally populated automatically when the user session begins (although some, like LOGLEVEL, still need an init block to set them explicitly).

The third type of variable, request variable, is slightly less obvious in function. In a nutshell, they are variables that are specified in the logical request sent to the BI Server, and are passed through to the internals of the BI Server. They’re often used for activating or disabling certain functionality. For example, you can tell OBIEE to specifically not use its cache for a request (even if it finds a match) by setting the request variable DISABLE_CACHE_HIT.

Request variables can be set manually inline in an analysis from the Advanced tab:

And they can also be set from Variable Prompts either within a report prompt or as a standalone dashboard prompt object. The point about request variables is that they are freeform; if they specify the name of an existing session variable then they will override it (if permitted), but they do not require the session variable to exist. We can see this easily enough – and see a variable request prompt in action at the same time. From the Prompts tab of an analysis I’ve added a Variable Prompt (rather than the usual Column Prompt) and given it a made up name, FOO:

Now when I run the analysis I specify a value for it:

and in the query log there’s the request variable:

-------------------- SQL Request, logical request hash:
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"

I’ve cut the quoted Logical SQL down to illustrate the point about the variable, because what was actually there is this:

-------------------- SQL Request, logical request hash:
SET VARIABLE QUERY_SRC_CD='Report',SAW_SRC_PATH='/users/prodney/request variable example',FOO='BAR', PREFERRED_CURRENCY='USD';
   0 s_0,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_1
FROM "A - Sample Sales"

which brings me on very nicely to the key point here. When Presentation Services sends a query to the BI Server it does so with a bunch of request variables set, including QUERY_SRC_CD and SAW_SRC_PATH. If you’ve worked with OBIEE for a while then you’ll recognise these names – they’re present in the Usage Tracking table S_NQ_ACCT. Ever wondered how OBIEE knows what values to store in Usage Tracking? Now you know. It’s whatever Presentation Services tells it to. You can easily test this yourself by playing around in nqcmd:

[oracle@demo ~]$ rlwrap nqcmd -d AnalyticsWeb -u prodney -p Admin123 -NoFetch

          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved



Statement execute succeeded

and looking at the results in S_NQ_ACCT:

BIEE_BIPLATFORM@pdborcl > select to_char(start_ts,'YYYY-MM-DD HH24:MI:SS') as start_ts,saw_src_path,query_src_cd from biee_biplatform.s_nq_acct where start_ts > sysdate -1 order by start_ts;

START_TS            SAW_SRC_PATH                             QUERY_SRC_CD
------------------- ---------------------------------------- --------------------
2015-03-21 11:55:10 /users/prodney/request variable example  Report
2015-03-21 12:44:41 BAR                                      FOO
2015-03-21 12:45:26 BAR                                      FOO
2015-03-21 12:45:28 BAR                                      FOO
2015-03-21 12:46:23 BAR                                      FOO

Key takeaway here: Presentation Services defines a bunch of useful request variables when it sends Logical SQL to the BI Server:

Embedding Variables in Connection Script Calls

There are four options that we can configure when connecting to the database from OBIEE. These are:


As of OBIEE version (i.e. OBIEE >= OBIEE automatically sets the ACTION field to a hash of the physical query – for more information see Doc ID 1941378.1. That leaves us with three remaining fields (since OBIEE sets ACTION after anything we do with the Connection Pool):


The syntax of the command in a Connection Script is physical SQL and the VALUEOF function to extract the OBIEE variable:


As a simple example here is passing the userid of the OBIEE user, using the Execute before query connection script:

-- Pass the OBIEE user's name to CLIENT_IDENTIFIER
call dbms_session.set_identifier('VALUEOF(NQ_SESSION.USER)')

This would be set for every Connection Pool – but only those used for query execution – not init blocks. Run a query that is routed through the Connection Pool you defined the script against and check out V$SESSION:

SQL> select sid,program,client_identifier from v$session where program like 'nqsserver%';

       SID PROGRAM                                          CLIENT_IDENTIFIER
---------- ------------------------------------------------ ----------------------------------------------------------------
        22 (TNS V1-V3)         prodney

The USER session variable is always present, so this is a safe thing to do. But, what about SAW_SRC_PATH? This is the path in the Presentation Catalog of the analysis being executed. Let’s add this into the Connection Pool script, passing it through as the CLIENT_INFO:

-- Pass the Analysis name to CLIENT_INFO
call dbms_application_info.set_client_info(client_info=>'VALUEOF(NQ_SESSION.SAW_SRC_PATH)')

This works just fine for analyses within a dashboard, or standalone analyses that have been saved. But what about a new analysis that hasn’t been saved yet? Unfortunately the result is not pretty:

[10058][State: S1000] [NQODBC] [SQL_STATE: S1000] [nQSError: 10058] A general error has occurred.
[nQSError: 43113] Message returned from OBIS.
[nQSError: 43119] Query Failed:
[nQSError: 23006] The session variable, NQ_SESSION.SAW_SRC_PATH, has no value definition.
Statement execute failed

That’s because SAW_SRC_PATH is a request variable and since the analysis has not been saved Presentation Services does not pass it to BI Server as a request variable. The same holds true for SAW_DASHBOARD and SAW_DASHBOARD_PG if you run an analysis outside of a dashboard – the respective request variables are not set and hence the connection pool script causes the query itself to fail.

The way around this is we cheat, slightly. If you create a session variable with the names of these request variables that we want to use in the connection pool scripts then we avoid the above nasty failures. If the request variables are set then all is well, and if they are not then we fall back on whatever value we initialise the session variable with.

The final icing on the cake of the solution given above is a bit of string munging with INSTR and SUBSTR to convert and concatenate the dashboard path and page into a single string, so instead of :

/shared/01. QuickStart/_portal/1.30 Quickstart/Overview

we get:

1.30 Quickstart/Overview

Which is much easier on the eye when looking at dashboard names. Similarly with the analysis path we strip all but the last section of it.

Granular monitoring of OBIEE on the database

Once OBIEE has been configured to be more articulate in its connection to the database, it enables the use of DBMS_MONITOR to understand more about the performance of given dashboards, analyses, or queries for a given user. Through DBMS_MONITOR the collection of statistics such as DB time, DB CPU, and so can be triggered, as well as trace-file generation for queries matching the criteria specified.

As an example, here is switching on system statistics collection for just one dashboard in OBIEE, using SERV_MOD_ACT_STAT_ENABLE

call dbms_monitor.SERV_MOD_ACT_STAT_ENABLE(
    module_name=>'GCBC Dashboard/Overview'

Now Oracle stats to collect information whenever that particular dashboard is run, which we can use to understand more about how it is performing from a database point of view:

SYS@orcl AS SYSDBA> select module,stat_name,value from V$SERV_MOD_ACT_STATS;

MODULE                   STAT_NAME                           VALUE
------------------------ ------------------------------ ----------
GCBC Dashboard/Overview  user calls                             60
GCBC Dashboard/Overview  DB time                              6789
GCBC Dashboard/Overview  DB CPU                               9996
GCBC Dashboard/Overview  parse count (total)                    15
GCBC Dashboard/Overview  parse time elapsed                    476
GCBC Dashboard/Overview  execute count                          15
GCBC Dashboard/Overview  sql execute elapsed time             3887

Similarly the CLIENT_IDENTIFIER field can be used to collect statistics with CLIENT_ID_STAT_ENABLE or trigger trace file generation with CLIENT_ID_TRACE_ENABLE. What you populate CLIENT_IDENTIFIER with it up to you – by default the script I’ve detailed at the top of this article inserts the OBIEE username in it, but you may want to put the analysis here if that’s of more use from a diagnostics point of view on the database side. The CLIENT_INFO field is still available for the other item, but cannot be used with DBMS_MONITOR for identifying queries.

Categories: BI & Warehousing

More on the Rittman Mead BI Forum 2015 Masterclass : “Delivering the Oracle Big Data and Information Management Reference Architecture”

Rittman Mead Consulting - Thu, 2015-03-12 05:29

Each year at the Rittman Mead BI Forum we host an optional one-day masterclass before the event opens properly on Wednesday evening, with guest speakers over the year including Kurt Wolff, Kevin McGinley and last year, Cloudera’s Lars George. This year I’m particularly excited that together with Jordan Meyer, our Head of R&D, I’ll be presenting the masterclass on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture”.

NewImageLast year we launched at the Brighton BI Forum event a new reference architecture that Rittman Mead had collaborated with Oracle on, that incorporated big data and schema-on-read databases into the Oracle data warehouse and BI reference architecture. In two subsequent blog posts, and in a white paper published on the Oracle website a few weeks after, concepts such as the “Discovery Lab”, “Data Reservoirs” and the “Data Factory” were introduced as a way of incorporating the latest thinking, and product capabilities, into the reference architecture for Oracle-based BI, data warehousing and big data systems.

One of the problems I always feel with reference architectures though is that they tell you what you should create, but they don’t tell you how. Just how do you go from a set of example files and a vague requirement from the client to do something interesting with Hadoop and data science, and how do you turn the insights produced by that process into a production-ready, enterprise Big Data system? How do you implement the data factory, and how do you use new tools such as Oracle Big Data Discovery and Oracle Big Data SQL as part of this architecture? In this masterclass we’re looking to explain the “how” and “why” to go with this new reference architecture, based on experiences working with clients over the past couple of years.

The masterclass will be divided into two sections; the first, led by Jordan Meyer, will focus on the data discovery and “data science” parts of the Information Management architecture, going through initial analysis and discovery of datasets using R and Oracle R Enterprise. Jordan will share techniques he uses from both his work at Rittman Mead and his work with Slacker Radio, a Silicon Valley startup, and will introduce the R and Oracle R Enterprise toolset for uncovering insights, correlations and patterns in sample datasets and productionizing them as database routines. Over his three hours he’ll cover topics including: 

Session #1 – Data exploration and discovery with R (2 hours) 

1.1 Introduction to R 

1.2 Tidy Data  

1.3 Data transformations 

1.4 Data Visualization 

Session #2 – Predictive Modeling in the enterprise (1 hr)

2.1 Classification

2.2 Regression

2.3 Deploying models to the data warehouse with ORE

After lunch, I’ll take the insights and analysis patterns identified in the Discovery Lab and turn them into production big data pipelines and datasets using Oracle Data Integrator 12c, Oracle Big Data Discovery and Oracle Big Data SQL For a flavour of the topics I’ll be covering take a look at this Slideshare presentation from a recent Oracle event, and in the masterclass itself I’ll concentrate on techniques and approaches for ingesting and transforming streaming and semi-structured data, storing it in Hadoop-based data stores, and presenting it out to users using BI tools like OBIEE, and Oracle’s new Big Data Discovery.

Session # 3 – Building the Data Reservoir and Data Factory (2 hr)

3.1 Designing and Building the Data Reservoir using Cloudera CDH5 / Hortonworks HDP, Oracle BDA and Oracle Database 12c

3.2 Building the Data Factory using ODI12c & new component Hadoop KM modules, real-time loading using Apache Kafka, Spark and Spark Streaming

Session #4 – Accessing and visualising the data (1 hr)

4.1 Discovering and Analyzing the Data Reservoir using Oracle Big Data Discovery

4.2 Reporting and Dashboards across the Data Reservoir using Oracle Big Data SQL + OBIEE

You can register for a place at the two masterclasses when booking your BI Forum 2015 place, but you’ll need to hurry as we limit the number of attendees at each event in order to maximise interaction and networking within each group. Registration is open now and the two events take place in May – hopefully we’ll see you there!

Categories: BI & Warehousing

An Introduction to Analysing ODI Runtime Data Through Elasticsearch and Kibana 4

Rittman Mead Consulting - Thu, 2015-03-12 01:39

An important part of working with ODI is analysing the performance when it runs, and identifying steps that might be inefficient as well as variations in runtime against a baseline trend. The Operator tool in ODI itself is great for digging down into individual sessions and load plan executions, but for broader analysis we need a different approach. We also need to make sure we keep the data available for trend analysis, as it’s often the case that tables behind Operator are frequently purged for performance reasons.

In this article I’m going to show how we can make use of a generic method of pulling information out of an RDBMS such as Oracle and storing it in Elasticsearch, from where it can be explored and analysed through Kibana. It’s standalone, it’s easy to do, it’s free open source – and it looks and works great! Here I’m going to use it for supporting the analysis of ODI runtime information, but it is equally applicable to any time-based data you’ve got in an RDBMS (e.g. OBIEE Usage Tracking data).

Kibana is an open-source data visualisation and analysis tool, working with data stored in Elasticsearch. These tools work really well for very rapid analysis of any kind of data that you want to chuck at them quickly and work with. By skipping the process of schema definition and data modelling the time taken to the first results is drastically reduced. It enables to you quickly start “chucking about” data and getting meaning out of it before you commit full-scale to how you want to analyse it, which is what the traditional modelling route can sometimes force you to do prematurely.

ODI writes runtime information to the database, about sessions run, steps executed, time taken and rows processed. This data is important for analysing things like performance issues, and batch run times. Whilst with the equivalent runtime data (Usage Tracking) from OBIEE there is the superb RPD/Dashboard content that Oracle ship in SampleApp v406, for ODI the options aren’t as vast, ultimately being based on home-brew SQL against the repository tables using the repository schema documentation from Oracle. Building an OBIEE metadata model against the ODI schema is one option, but then requires an OBIEE server on which to run it – or merging into an existing OBIEE deployment – which means that it can become more hassle than it’s worth. It also means a bunch of up-front modelling before you get any kind of visualisations and data out. By copying the data across into Elasticsearch it’s easy to quickly build analyses against it, and has the additional benefit of retaining the data as long as you’d like meaning that it’s still available for long-term trend analysis once the data’s been purged from the ODI repository itself.

Let’s take a bit of a walk through the ODI dashboard that I’ve put together. First up is a view on the number of sessions that have run over time, along with their duration. For duration I’ve shown 50th (median), 75th and 95th percentiles to get an idea of the spread of session runtimes. At the moment we’re looking at all sessions, so it’s not surprising that there is a wide range since there’ll always be small sessions and longer ones:

Next up on the dashboard comes a summary of top sessions by runtime, both cumulative and per-session. The longest running sessions are an obvious point of interest, but cumulative runtime is also important; something may only take a short while to run when compared to some singular long-running sessions, but if it runs hundreds of times then it all adds up and can give a big performance boost if time is shaved off it.

Plotting out session execution times is useful to be able to see both when the longest running sessions ran:

The final element on this first dashboard is one giving the detail for each of the top x long-running session executions, including the session number so that it can be examined in further detail through the Operator tool.

Kibana dashboards are interactive, so you can click on a point in a graph to zoom in on that time period, as well as click & drag to select an arbitrary range. The latter technique is sometimes known as “Brushing”, and if I’m not describing it very well have a look at this example here and you’ll see in an instant what I mean.

As you focus on a time period in one graph the whole dashboard’s time filter changes, so where you have a table of detail data it then just shows it for the range you’ve selected. Notice also that the granularity of the aggregation changes as well, from a summary of every three hours in the first of the screenshots through to 30 seconds in the last. This is a nice way of presenting a summary of data, but isn’t always desirable (it can mask extremes and abnormalities) so can be configured to be fixed as well.

Time isn’t the only interaction on the dashboard – anything that’s not a metric can be clicked on to apply a filter. So in the above example where the top session by cumulative time are listed out we might want to find out more about the one with several thousand executions

Simply clicking on it then filters the dashboard and now the session details table and graph show information just for that session, including duration, and rows processed:

Session performance analysis

As an example of the benefit of using a spread of percentiles we can see here is a particular session that had an erratic runtime with great variation, that then stabilised. The purple line is the 95th percentile response time; the green and blue are 50th and 75th respectively. It’s clear that whilst up to 75% of the sessions completed in about the same kind of time each time they ran, the remaining quarter took anything up to five times as long.

One of the most important things in performance is ensuring consistent performance, and that is what happens here from about half way along the horizontal axis at c.February:

But what was causing the variation? By digging a notch deeper and looking at the runtime of the individual steps within the given session it can be seen that the inconsistent runtime was caused by a single step (the green line in this graph) within the execution. When this step’s runtime stabilises, so does the overall performance of the session:

This is performing a port-mortem on a resolved performance problem to illustrate how useful the data is – obviously if there were still a performance problem we’d have a clear path of investigation to pursue thanks to this data.


Data’s pulled from the ODI repository tables using Elasticsearch JDBC river, from where it’s stored and indexed in Elasticsearch, and presented through Kibana 4 dashboards.


The data load from the repository tables into Elasticsearch is incremental, meaning that the solution works for both historical analysis and more immediate monitoring too. Because the data’s actually stored in Elasticsearch for analysis it means the ODI repository tables can be purged if required and you can still work with a full history of runtime data in Kibana.

If you’re interested in finding out more about this solution and how Rittman Mead can help you with your ODI and OBIEE implementation and monitoring needs, please do get in touch.

Categories: BI & Warehousing

Announcing the Special Guest Speakers for Brighton & Atlanta BI Forum 2015

Rittman Mead Consulting - Mon, 2015-03-09 08:13

As well as a great line-up of speakers and sessions at each of the Brighton & Atlanta Rittman Mead BI Forum 2015 events in May, I’m very pleased to announce our two guest speakers who’ll give the second keynotes, on the Thursday evening of the two events just before we leave for the restaurant and the appreciation events. This year our special guest speaker in Atlanta is John Foreman, Chief Data Scientist at MailChimp and author of the book “Data Smart: Using Data Science to Transform Information into Insight”; and in Brighton we’re delighted to have Reiner Zimmerman, Senior Director of Product Management at Oracle US and the person behind the Oracle DW & Big Data Global Leaders program.


I first came across John Foreman when somebody recommended his book to me, “Data Smart”, a year or so ago. At that time Rittman Mead were getting more-and-more requests from our customers asking us to help with their advanced analytics and predictive modelings needs, and I was looking around for resources to help myself and the team get to grips with some of the more advanced modelings and statistical techniques Oracle’s tools now support – techniques such as clustering and pattern matching, linear regression and genetic algorithms.

One of the challenges when learning these sorts of techniques is not getting to caught up in the tools and technology – R was our favoured technology at the time, and there’s lots to it – so John’s book was particularly well-timed as it goes through these types of “data science” techniques but focuses on Microsoft Excel as the analysis tool, with simple examples and a very readable style.

Back in his day job, John is Chief Data Scientist at MailChimp and has become a particularly in-demand speaker following the success of his book, and I was very excited to hear from Charles Elliott, our Practice Manager for Rittman Mead America, that he lived near John in Atlanta and had arranged for him to keynote at our Atlanta BI Forum event. His Keynote will be entitled “How Mailchimp used qualitative and quantitative analysis to build their next product” and we’re very much looking forward to meeting him at our event in Atlanta on May 13th-15th 2015.


Our second keynote speaker at the Brighton Rittman Mead BI Forum 2015 event is non-other than Reiner Zimmerman, best known in EMEA for organising the Oracle DW Global Leaders Program. We’ve known Reiner for several years now as Rittman Mead are one of the associate sponsors for the program, which aims to bring together the leading organizations building data warehouse and big data systems on the Oracle Engineered Systems platform.

A bit like the BI Forum (but even more exclusive), the DW Global Leaders program holds meetings in the US, EMEA and AsiaPac over the year and is a fantastic networking and knowledge-sharing group for an exclusive set of customers putting together the most cutting-edge DW and big data systems on the latest Oracle technology. Reiner’s also an excellent speaker and a past visitor to the BI Forum, and his session entitled “Hadoop and Oracle BDA customer cases from around the world” will be a look at what customers are really doing, and the value they’re getting, from building big data systems on the Oracle platform.

Registration is now open for both the Brighton and Atlanta BI Forum 2015 events, with full details including the speaker line-up and how to register on the event website. Keep an eye on the blog for more details of both events later this week including more on the masterclass by myself and Jordan Meyer, and a data visualisation “bake-off” we’re going to run on the second day of each event. Watch this space…!

Categories: BI & Warehousing

Extracting Data from Cloud Apps

Dylan's BI Notes - Thu, 2015-03-05 15:20
I think that it would be easier if the cloud application can be aware of the data integration needs and publish the interfaces proactively. Here are some basic requirements for the applications that can be considered as data integration friendly: 1. Publish the object data model This is required for source analysis. For example, here is […]
Categories: BI & Warehousing

Rittman Mead BI Forum 2015 Now Open for Registration!

Rittman Mead Consulting - Wed, 2015-03-04 09:46

I’m very pleased to announce that the Rittman Mead BI Forum 2015, running in Brighton and Atlanta in May 2015, is now open for registration.

Back for its seventh successful year, the Rittman Mead BI Forum once again will be showcasing the best speakers and presentations on topics around Oracle Business Intelligence and data warehousing, with two events running in Brighton, UK and Atlanta, USA in May 2015. The Rittman Mead BI Forum is different to other Oracle tech events in that we keep the numbers attending limited, topics are all at the intermediate-to-expert level, and we concentrate on just one topic – Oracle Business Intelligence Enterprise Edition, and the technologies and products that support it.


As in previous years, the BI Forum will run on two consecutive weeks, starting in Brighton and then moving over to Atlanta for the following week. Here’s the dates and venue locations:

This year our optional one-day masterclass will be delivered by Jordan Meyer, our Head of R&D, and myself and will be on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture” that we launched last year at our Brighton event. Details of the masterclass, and the speaker and session line up at the two events are on the Rittman Mead BI Forum 2015 homepage

Each event has its own agenda, but both will focus on the technology and implementation aspects of Oracle BI, DW, Big Data and Analytics. Most of the sessions run for 45 minutes, but on the first day we’ll be holding a debate and on the second we’ll be running a data visualization “bake-off” – details on this, the masterclass and the keynotes and our special guest speakers will be revealed on this blog over the next few weeks – watch this space!

Categories: BI & Warehousing

Creating Real-Time Search Dashboards using Apache Solr, Hue, Flume and Cloudera Morphlines

Rittman Mead Consulting - Wed, 2015-03-04 01:19

Late last week Cloudera published a blog post on their developer site on building a real-time log analytics dashboard using Apache Kafka, Cloudera Search and Hue. As I’d recently been playing around with Oracle Big Data Discovery with our website log data as the data source, and as we’ve also been doing the same exercise in our development labs using ElasticSearch and Kibana I thought it’d be interesting to give it a go; partly out of curiosity around how Solr, Kafka and Hue search works and compares to Elasticsearch, but also to try and work out what extra benefit Big Data Discovery gives you above and beyond free and open-source tools.


In the example, Apache web log data is read from the Linux server via a Flume syslog source, then fed into Apache Kafka as the transport mechanism before being loaded into Solr using a data transformation framework called “morphlines”. I’ve been looking at Kafka as an alternative to Flume for ingesting data into a Hadoop system for a while mainly because of the tireless advocacy of Cloudera’s Gwen Shapira (Oracle ACE, ex-Pythian, now at Cloudera) who I respect immensely and has a great background in Oracle database administration as well as Hadoop, and because it potentially offers some useful benefits if used instead of, or more likely alongside, Flume – a publish-subscribe model vs. push, the ability to have multiple consumers as well as publishers, and a more robust transport mechanism that should avoid data loss when an agent node goes down. Kafka is now available as a parcel and service descriptor that you can download and then install within CDH5, and so I set up a separate VM in my Hadoop cluster as a Kafka broker and also installed Solr at the same time.


Working through the example, in the end I went with a slightly different and simplified approach that swapped the syslog Flume source for an Apache Server file tailing source, as our webserver was on a different host to the Flume agent and I’d previously set this up before for an earlier blog post. I also dropped the Kafka element as the Cloudera article wasn’t that clear to me whether it’d work in its published form or needed amending to use with Kafka (“To get data from Kafka, parse it with Morphlines, and index it into Solr, you can use an almost identical configuration”), and so I went with an architecture that looked like this:


Compared to Big Data Discovery, this approach has got some drawbacks, but some interesting benefits. From a drawback perspective, Apache Solr (or Cloudera Search as it’s called in CDH5, where Cloudera have integrated Solr with HDFS storage) needs some quite fiddly manual setup that’s definitely an IT task, rather than the point-and-click dataset setup that you get with Big Data Discovery. In terms of benefits though, apart from being free it’s potentially more scalable than Big Data Discovery as BDD has to sample the full Hadoop dataset and fit that sample (typically 1m rows, or 1-5% of the full dataset) into BDD’s Endeca Server-based DGraph engine; Solr, however, indexes the whole Hadoop dataset and can store its indexes and log files within HDFS across the cluster – potentially very interesting if it works.

Back to drawbacks though, the first complication is that Solr’s configuration settings in this Cloudera Search incarnation are stored in Apache Zookeeper, so you first have to download a template copy of the collection files (schema, index etc) from Zookeeper using solrctl, the command-line tool for SolrCloud (Solr running on a distributed cluster, as it is with Cloudera Search)

solrctl --zk bda5node2:2181/solr instancedir --generate $HOME/accessCollection

Then – and this again is a tricky part compared to Big Data Discovery – you have to edit the schema.xml file that Solr uses to determine which fields to index, what their datatypes are and so on. The Cloudera blog post points to a Github repo with the required schema.xml file for Apache Combined Log Format input files, I found I had to add an extra entry for the “text” field name before Solr would index properly, added at the end of the file except here:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="time" type="tdate" indexed="true" stored="true" />
   <field name="record" type="text_general" indexed="true" stored="false" multiValued="true"/>
   <field name="client_ip" type="string" indexed="true" stored="true" />
   <field name="code" type="string" indexed="true" stored="true" />
   <field name="user_agent" type="string" indexed="true" stored="true" />
   <field name="protocol" type="string" indexed="true" stored="true" />   
   <field name="url" type="string" indexed="true" stored="true" />   
   <field name="request" type="string" indexed="true" stored="true" />
   <field name="referer" type="string" indexed="true" stored="true" />
   <field name="bytes" type="string" indexed="true" stored="true" />
   <field name="method" type="string" indexed="true" stored="true" />
   <field name="extension" type="string" indexed="true" stored="true" />   
   <field name="app" type="string" indexed="true" stored="true" />      
   <field name="subapp" type="string" indexed="true" stored="true" />
   <field name="device_family" type="string" indexed="true" stored="true" />
   <field name="user_agent_major" type="string" indexed="true" stored="true" />   
   <field name="user_agent_family" type="string" indexed="true" stored="true" />
   <field name="os_family" type="string" indexed="true" stored="true" />   
   <field name="os_major" type="string" indexed="true" stored="true" />
   <field name="region_code" type="string" indexed="true" stored="true" />
   <field name="country_code" type="string" indexed="true" stored="true" />
   <field name="city" type="string" indexed="true" stored="true" />
   <field name="latitude" type="float" indexed="true" stored="true" />
   <field name="longitude" type="float" indexed="true" stored="true" />
   <field name="country_name" type="string" indexed="true" stored="true" />
   <field name="country_code3" type="string" indexed="true" stored="true" />
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
   <dynamicField name="ignored_*" type="ignored"/>

Then you have to upload the solr configuration settings to Zookeeper, and then configure Solr to use this particular set of Zookeeper Solr settings (note the “—create” before the accessCollection collection name in the second command, this was missing from the Cloudera steps but is needed to be a valid solrctl command)

solrctl --zk bda5node2:2181/solr instancedir --create accessCollection $HOME/accessCollection
solrctl --zk bda5node2:2181/solr --create accessCollection -s 1

At this point you should be able to go to the Solr web admin page within the CDH cluster (, in my case), and see the collection (a distributed Solr index) listed with the updated index schema.


Next I configure the Flume source agent on the RM webserver, using this Flume conf file:

## Local instalation: /etc/flume1.5.0
## configuration file location:  /etc/flume1.5.0/conf/conf
## bin file location: /etc/flume1.5.0/conf/bin
## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent
source_agent.sources = apache_server
source_agent.sources.apache_server.type = exec
source_agent.sources.apache_server.command = tail -f /etc/httpd/logs/access_log
source_agent.sources.apache_server.batchSize = 1
source_agent.sources.apache_server.channels = memoryChannel
source_agent.sources.apache_server.interceptors = itime ihost itype
source_agent.sources.apache_server.interceptors.itime.type = timestamp
source_agent.sources.apache_server.interceptors.ihost.type = host
source_agent.sources.apache_server.interceptors.ihost.useIP = false
source_agent.sources.apache_server.interceptors.ihost.hostHeader = host
source_agent.sources.apache_server.interceptors.itype.type = static
source_agent.sources.apache_server.interceptors.itype.key = log_type
source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined
source_agent.channels = memoryChannel
source_agent.channels.memoryChannel.type = memory
source_agent.channels.memoryChannel.capacity = 100
## Send to Flume Collector on Hadoop Node
source_agent.sinks = avro_sink
source_agent.sinks.avro_sink.type = avro = memoryChannel
source_agent.sinks.avro_sink.hostname =
source_agent.sinks.avro_sink.port = 4545

and then I set up a Flume sink agent as part of the Flume service using Cloudera Manager, initially set as “stopped”.


The Flume configuration file for this sink agent is where the clever stuff happens.

collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = bda5node5
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

collector.channels = mc1 mc2
collector.channels.mc1.type = memory
collector.channels.mc1.transactionCapacity = 1000
collector.channels.mc1.capacity = 100000
collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100000
collector.channels.mc2.transactionCapacity = 1000

collector.sinks = LocalOut MorphlineSolrSink

collector.sinks.LocalOut.type = file_roll = /tmp/flume/website_logs
collector.sinks.LocalOut.sink.rollInterval = 0 = mc1

collector.sinks.MorphlineSolrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
collector.sinks.MorphlineSolrSink.morphlineFile = /tmp/morphline.conf = mc2

The interesting bit here is the MorphlineSolrSink flume sink. This Flume sink type routes flume events to a morphline script that in turn copies the log data into the HDFS storage area used by Solr, and passes it to Solr for immediate indexing. Cloudera Morphlines is a command-based lightweight ETL framework designed to transform streaming data from Flume, Spark and other sources and load it into HDFS, HBase or in our case, Solr. Morphlines config files define ETL routines that then call  extensible morphlines Kite SDK functions to perform transformations on incoming data streams such as

  • Split webserver request fields into HTTP protocol, method and URL requested
  • In conjunction with the Maxmind GeoIP database, generate the country, city and geocode for a given IP address
  • Converting dates and times in string format to a Solr-format date and timestamp

with the output then being passed to Solr in this instance, along with the UUID and other metadata Solr needs, for loading to the Solr index, or “collection” as its termed when it’s running across the cluster (note the full log files aren’t stored by this process into HDFS, just the Solr indexes and transaction logs). The morphlines config file I used is below, based on the one provided in the Github repo accompanying the Cloudera blog post – note though that you need to download and setup the Maxmind GeoIP database file, and install the Python pip utility and a couple of pip packages before this will work:

# Specify server locations in a SOLR_LOCATOR variable;
# used later in variable substitutions
# Change the zkHost to point to your own Zookeeper quorum
    # Name of solr collection
    collection : accessCollection
    # ZooKeeper ensemble
    zkHost : "bda5node2:2181/solr"
# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more (potentially
# nested) commands. A morphline is a way to consume records (e.g. Flume events,
# HDFS files or blocks), turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on it's way to
# Solr (or a MapReduceIndexerTool RecordWriter that feeds via a Reducer into Solr).
morphlines : [
    # Name used to identify a morphline. E.g. used if there are multiple morphlines in a
    # morphline config file
    id : morphline1
    # Import all morphline commands in these java packages and their subpackages.
    # Other commands that may be present on the classpath are not visible to this morphline.
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
    commands : [
        ## Read the email stream and break it up into individual messages.
        ## The beginning of a message is marked by regex clause below
        ## The reason we use this command is that one event can have multiple
        ## messages
        readCSV {
    separator:  " "
            columns:  [client_ip,C1,C2,time,dummy1,request,code,bytes,referer,user_agent,C3]
    ignoreFirstLine : false
            quoteChar : "\""
            commentPrefix : ""
            trim : true
            charset : UTF-8
split { 
inputField : request
outputFields : [method, url, protocol]          
separator : " "        
isRegex : false      
#separator : """\s*,\s*"""        
#  #isRegex : true      
addEmptyStrings : false
trim : true          
split { 
inputField : url 
outputFields : ["", app, subapp]          
separator : "\/"        
isRegex : false      
#separator : """\s*,\s*"""        
#  #isRegex : true      
addEmptyStrings : false
trim : true          
userAgent {
inputField : user_agent
outputFields : {
user_agent_family : "@{ua_family}"
user_agent_major  : "@{ua_major}"
device_family     : "@{device_family}"
os_family         : "@{os_family}"
os_major  : "@{os_major}"
#Extract GEO information
geoIP {
            inputField : client_ip
            database : "/tmp/GeoLite2-City.mmdb"
# extract parts of the geolocation info from the Jackson JsonNode Java 
# # object contained in the _attachment_body field and store the parts in
# # the given record output fields:      
extractJsonPaths {
flatten : false
paths : { 
country_code : /country/iso_code
country_name : /country/names/en
                region_code  : /continent/code
#"/subdivisions[]/names/en" : "/subdivisions[]/names/en"     
#"/subdivisions[]/iso_code" : "/subdivisions[]/iso_code"     
city : /city/names/en
#/postal/code : /postal/code
latitude : /location/latitude
longitude : /location/longitude
#/location/latitude_longitude : /location/latitude_longitude
#/location/longitude_latitude : /location/longitude_latitude
      #{logInfo { format : "BODY : {}", args : ["@{}"] } }
    # add Unique ID, in case our message_id field from above is not present
        generateUUID {
    # convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ" format
       #  21/Nov/2014:22:08:27
        convertTimestamp {
            field : time 
            inputFormats : ["[dd/MMM/yyyy:HH:mm:ss", "EEE, d MMM yyyy HH:mm:ss Z", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"]
            inputTimezone : America/Los_Angeles
           outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
            outputTimezone : UTC
    # Consume the output record of the previous command and pipe another
    # record downstream.
    # This command sanitizes record fields that are unknown to Solr schema.xml
    # by deleting them. Recall that Solr throws an exception on any attempt to
    # load a document that contains a field that isn't specified in schema.xml
        sanitizeUnknownSolrFields {
            # Location from which to fetch Solr schema
            solrLocator : ${SOLR_LOCATOR}
    # load the record into a SolrServer or MapReduce SolrOutputFormat.
        loadSolr {
            solrLocator : ${SOLR_LOCATOR}

Then it’s just a case of starting the target sink agent using Cloudera Manager, and the source agent on the RM webserver using the flume-ng command-line utility, and then (hopefully) watch the web activity log entries start to arrive as documents in the Solr index/collection – which, after a bit of fiddling around and correcting typos, it did:


What’s neat here is that instead of having to use either an ETL tool such as ODI to process and parse the log entries (as I did here, in an earlier blog post series on ODI on Hadoop), or use the Hive-to-DGraph data reload feature in BDD, I’ve instead just got a Flume sink running this morphlines process and my data is added in real-time to my Solr index, and as you’ll see in a moment, a Hue Search dashboard.

To get Hue to work with my Solr service and new index, you first have to add the Solr service URL details to the Hue configuration settings using Cloudera Manager, like this:


Then, you can select the index from the list presented by the Search application within Hue, and start creating your data discovery and faceted search dashboard.


with the end result, after a few minutes of setup, looking like this for me:


So how does Solr, Hue, Flume and Morphlines compare to Oracle Big Data Discovery as a potential search-and-discovery solution on Hadoop? What’s impressive is how little work, once I’d figured it out, it took to set this up including the real-time loading and indexing of data for the dashboard. Compared to a loading HDFS and Hive using ODI, and manually refreshing the BDD DGraph data store, it’s much more lightweight and pretty elegant. But, it’s clearly an IT / developer solution, and I spent a fair few late nights getting it all to work – getting the Solr schema.xml right was a tricky task, and the morphlines / Solr ingestion process was particularly hard to to debug and understand why it wasn’t working.

Oracle Big Data Discovery, by contrast, makes the data loading, transformation and enrichment process available to the business or data analyst, and provides much richer tools for cataloging and exploring the full universe of datasets on the Hadoop cluster. Morphlines compares well to the Groovy transformations provided by Big Data Discovery and Solr is extensible to add functionality such as sentiment analysis and text parsing, but again these are IT tasks and not something the average data analyst will want to do.

In summary then – Hue, Solr and the Morphlines transformation framework can be an excellent tool in the hands of IT professionals and can create surprisingly featureful and elegant solutions with just a bit of code and process configuration – but where Big Data Discovery comes into its own is putting significant parts of this capability in the hands of the business and the data analyst, and providing tools for data upload and wrangling, combining that data with other datasets, analyzing that whole dataset (or “data reservoir”) and then collaborating with others around the organization.

Categories: BI & Warehousing

Australian March Training Offer

Rittman Mead Consulting - Fri, 2015-02-27 21:31

Autumn is almost upon us here in Australia so why not hold off  going into hibernation and head into the classroom instead.

For March and April only, Rittmanmead courses in Australia* are being offered at significantly discounted prices.

Heading up this promotion is the popular TRN202 OBIEE 11g Bootcamp course which will be held in Melbourne, Australia* on March 16th-20th 2015.

This is not a cut down version of the regular course but the entire 5 day content. Details

To enrol for this specially priced course, visit the Rittmanmead website training page. Registration is only open between March 1st – March 9th 2015 so register quickly to secure a spot.

Further specially priced courses will be advertised in the coming weeks.

*This offer is only available for courses run in Australia.
Registration Period: 01/03/2015 12:00am – 09/03/2015 11:59:59pm
Further Terms and Conditions can be found during registration

Categories: BI & Warehousing

Introducing Oracle Big Data Discovery Part 3: Data Exploration and Visualization

Rittman Mead Consulting - Thu, 2015-02-26 17:08

In the first two posts in this series, we looked at what Oracle Big Data Discovery is and how you can use it to sample, cleanse and then catalog data in your Hadoop-based data reservoir. At the end of that second post we’d loaded some webserver log data into BDD, and then uploaded some additional reference data that we then joined to the log file dataset to provide descriptive attributes to add to the base log activity. Once you’ve loaded the datasets into BDD you can do some basic searching and graphing of your data directly from the “Explore” part o the interface, selecting and locating attribute values from the search bar and displaying individual attributes in the “Scratchpad” area.


With Big Data Discovery though you can go one step further and build complete applications to search and analyse your data, using the “Discover” part of the application. Using this feature you can add one or more charts to a dashboard page that go much further than the simple data visualisations you get on the Explore part of the application, based on the chart types and UI interactions that you first saw in Oracle Endeca Information Discovery Studio.


Components you can add include thematic maps, summary bars (like OBIEE’s performance tiles, but for multiple measures), various bar, line and bubble charts, all of which can then be faceted-searched using an OEID-like search component.


Each visualisation component is tied to a particular “view” that points to one or more underlying BDD datasets – samples of the full dataset held in the Hadoop cluster stored in the Endeca Server-based DGraph engine. For example, the thematic map above was created against the post comments dataset, with the theme colours defined using the number of comments metric and each country defined by a country name attribute derived from the calling host IP address.


Views are auto-generated by BDD when you import a dataset, or when you join two or more datasets together. You can also use the Endeca EQL language to define your own views using a SQL-type language, and then define which columns represent attributes, which ones are metrics (measures) and how those metrics are aggregated.


Like OEID before it, Big Data Discovery isn’t a substitute for a regular BI tool like OBIEE – beyond simple charts and visualizations its tricky to create more complex data selections, drill-paths in hierarchies, subtotals and so forth, and users will need to understand the concept of multiple views and datatypes, when to drop into EQL and so on – but for non-technical users working in an organization’s big data team it’s a great way to put a visual front-end onto the data in the data reservoir without having to understand tools like R Studio.

So that’s it for this three-part overview of Oracle Big Data Discovery and how it works with the Hadoop-based data reservoir. Keep an eye on the blog over the next few weeks as we get to grips with this new tool, and we’ll be covering it as part of the optional masterclass at the Brighton and Atlanta Rittman Mead BI Forum 2015 events this May.

Categories: BI & Warehousing

Why and How to use Oracle Metadata Management 12c. Part 2: Importing and Relating Metadata

Rittman Mead Consulting - Thu, 2015-02-26 05:27

In the first post of this series we have seen how to install and configure OEEM to start working with it. In this new post we are going to see how we import the metadata into OEMM from different sources like Oracle Database, OBIEE, ODI and OWB and then relate all of them inside OEMM.


After we have installed and configured OEMM, we need to start adding all the metadata from the different sources and applications that we use. In this example the sources will be some Oracle schemas and our applications will be ODI, OWB and OBIEE. To import the metadata for all of them we need to create one model in OEMM for each. A model in OEMM has all the connection details for a specific source or metadata provider (i.e: database schema, ODI repository, etc), and is also the container for the metadata of that specific source after the import process. So one model can connect to one specific source or application.

First and for organisational purposes we will create a Folder to contain the future models.  You can also create your models first, and then create the folder/s that you want and then just move the models under the correspondent folders. In addition, you can create folders within another folder.

To create a folder, right-click on the Repository entry under the Repository panel in the OEMM main page. Select New > Folder in the pop-up menu, enter a name and press the Create button.



The next step is creating the models and import the metadata of the different sources. The import or reverse engineering process is named harvesting in OEMM. We will start with the model for the Oracle Database. In this particular example I used Oracle 12c.

To create a model right click on the folder or the repository entry, and select New > Model. In the Create Model window that appears, enter a name for the new model and select the type of source that you want to import or to be more precise which will be the Import bridge that this model will use.

The Import Bridge is part of the Meta Integration® Model Bridge (MIMB) software and is the way that OEMM connect to the sources and applications to reverse engineering the metadata. You will find import bridges for a wide range of technologies like different databases, Business Intelligence  and Data Integration products from different vendors, Big Data stores, etc.


For this first example we will select the Oracle Database (via JDBC) import bridge and in the Import Setup tab we will add all the usual connection details: host, port, service and user and password to connect to the Database. This user should have at least the CONNECT privilege and the SELEC_CATALOG_ROLE role. We can also define this model for specific schemas using the magnifying glass to choose the shown schemas or just write the schemas (in uppercase) separated by “;”. Also we can decide if we want that the stored procedures are going to be included in this imported metadata or not.


After all the connection details have set, we test the connection and wait until we receive the Connection Successful message, and finally press the Create button. A message windows will appear asking if we want to “Import a new version now?” Press yes to start the harvesting process. A log window will show you the progress in the import process that can take several minutes. After the process is finished a new windows message ask if we want to open the model.


Choose yes to see all the objects that are imported for this model as it is shown in the figure below.


We need to repeat the process explained above to create the models for the rest of sources and applications that we are going to use in this example. The process is the same for all of them but of course there are some differences in the connection details required after we chose the specific Import Bridge for each one.

In the next screenshot you will find the connection details for the ODI model after you choose the Oracle Data Integrator (ODI) Import Bridge. In the Import Setup tab, you need to select the appropriate driver to connect to the database where is the ODI Repository (that could be Oracle, SQLServer, DB2, etc), the ODI Home folder, the URL to connect to database, the schema and the password for the Master Repository, user and password for the ODI User (SUPERVISOR for example), the name of the Work Repository from that we want to select the ODI Objects and the Context.


We need to select the Scope for this model between two options: Projects, that will include packages and mappings (or interfaces for versions before 12c) or Load Plans and Scenarios, that includes the Load Plans and Scenarios.

After we chose the Scope we can also filter the Content of the scope pressing the magnifying glass icon and select the specific objects that we want for this model.

After you press the create button to start the harvesting process, open the model created and it will look similar to this if you choosing Projects as Scope.


For the connection details to create the OWB model , you need to take a couple of things into account. First, the version of the OWB from which you want to import the metadata. If it is or later you will need to do the these two steps before:

  1. Copy the following .JAR files from: MetaIntegrationInstallationDir\java\ to %OWB_HOME%\owb\lib\ext\
  • jsr173_1.0_api.jar
  • MIR.jar
  • MIRModelBridge.jar
  • MIROracleWarehouseBuilderOmb.jar
  • MIRUtil.jar
  • stax-1.1.1-dev.jar
  1. Copy the MetaIntegrationInstallationDir\bin\mimbexec.bat file into the same OWB directory.

As the version that I have is, I copy the files detailed above, set the connection parameters like is shown in the following image and test the connection.

owb model config

When I started the import process the following error message appears in the log windows:



After trying many things unsuccessfully, I asked David Allan for help and he sent me another mimbexec.bat because apparently between and there were directory name changes.  This a temporary fix and a proper one is being worked on.

I substituted the bat file and I received another error message as it is shown in the next screenshot.



After a while, I realised that the issue that OEMM reported was because I was using an external table in one of the mappings. I changed it for a common table and the import process worked well. This has reported as a bug and a solution is being worked on. I really want to thank David for all his invaluable help on that.

This is how it looks the OWB model after the import of the metadata.


The last model that we need to create is the one based on OBIEE. There are different import bridges depending on the metadata that we need to import from OBIEE. Could be Oracle Business Intelligence (OBI) Server, Oracle Business Intelligence (OBI) Enterprise Edition and Oracle Business Intelligence (OBI) Answers.

The OBI Server import bridge needs the OBI repository in xml format as a parameter to import it, and the result model will contain all the objects defined in the three layers of the repository (Presentation, Business Model, Physical) as well as the repository connections, the variables and the initialisation blocks defined in the OBI repository.


To use the OBI Enterprise Edition import bridge we need to set the login user and password to connect to OBIEE (usually weblogic or a user with admin privileges), the repository file in xml format, and we can also filter the amount of reports retrieved from the OBI Presentation Server.

There are a couple of interesting not mandatory options, one is for optimise the import of large models which if it sets to true doesn’t return some objects like joins, relationships, logical fk, etc., to consume less memory at run time. And another option is to set if we want to do an incremental import to import only the changes of the source or each time we want to import everything.


The last import bridge to use with OBI is the OBI Answers, which will be import the content for a particular analysis or KPI report. This bridge needs to have the specific analysis in XML format.



About models, there are a couple of additional things that you need to take note. First if you want to see the configuration details you need to right-click the model and choose the settings option from the pop-up menu. In case that you want to open the model to see the objects that contains, double-click on it.

Another thing is for every parameter that you have in a model, you will find a very detailed help at the right in the import setup tab; and if you click on the name of the Import Bridge in the same tab, you have the documentation of this particular bridge which I find it very useful.

There are two tabs more in the folder and model definition that we won’t use in this example but that we talk in future posts: one for security and another to executing scripts when an event happens to this object. Models also have an additional tab Import Schedule, to create a plan to do the harvest process.

Relate the models

Once we have defined our models we need to relate them and to validate their relationship. The automated process of relate these models through the validation is named stitching. In order to do that we must create a Configuration first. A configuration in OEMM is a collection of models and another objects like mappings, glossaries, etc, that are related in someway.

According to the online documentation we need to consider a configuration as any of these options:

  • Repository workspace: a collection of Repository Objects to be analyzed together (search, browse, reports, etc.) as a technical scope, or business area under the same access permission scope.
  • Enterprise architecture – a collection of data store Models (ODS, data staging areas, data warehouses, data marts, etc.) and data process Models (ETL/DI, and BI) connected together through data flow stitching.
  • Design workflow – a collection of conceptual, logical and physical Models connected (semantically stitched) together through semantic mappings modeling the design process.

To create a Configuration, just right-click on a selected folder or the repository entry and choose New> Configuration. Enter a name for the configuration and press the Create button.


The configuration is opened and you need to drag the models that you want to be stitched inside this configuration as it is shown in the following screenshot


As you drag and drop your models, you can see that some of them have a warning icon after you include them in the configuration, and that is because we need to connect that model with the appropriate source of data.

To do that, select the model in the configuration and press Edit Connection. Choose the correspondent store for each connection and press OK.





















After you finish with all the models, press the Validate button, to start stitching or relate them.


In most of the cases, OEMM can assign the correspondent default schema for each of the connections in the model. If in some cases cannot do it , like in OWB, you need to do it manually.


In the following image you will see all the models validated. For this example, I’ve created four databases models, one that contains the source (transactional system), one for the staging schema, and another two that contains different data warehouses. Also an ODI model, an OWB model and an OBIEE model.


You can see also the relationship between the models that belong to a configuration in a graphical view in the Architecture Diagram tab of the Configuration. If the diagram looks like a little messy, you  can press the Edit button and then Layout to order the way the components are shown.


In summary, we create and harvesting (reverse-engineer) the models and then relate or stitching them to can analyse them together. In the next post, we will see some interesting stuff that we can do with configurations and models like trace data lineage and trace data impact.

Categories: BI & Warehousing

Introducing Oracle Big Data Discovery Part 2: Data Transformation, Wrangling and Exploration

Rittman Mead Consulting - Wed, 2015-02-25 09:51

In yesterday’s post I looked at Oracle Big Data Discovery and how it brought the search and analytic capabilities of Endeca to Hadoop. We looked at how the Oracle Endeca Information Discovery Studio application works with a version of the Endeca Server engine to analyse and visualise sample sets of data from the Hadoop cluster, and how it uses Apache Spark to retrieve data from Hadoop and then transform that data to make it more suitable for data discovery and data analysis applications. Oracle Big Data Discovery is designed to work alongside ODI and GoldenGate for Big Data once you’ve decided on your main data flows, and Oracle Big Data SQL for BI tool and application access to the entire “data reservoir”. So how does Big Data Discovery work, and what role does it play in the overall big data project workflow?

The best way to think of Big Data Discovery, to my mind, is “Endeca on Hadoop”. Endeca Information Discovery had three main parts to it; the data loading part performed using Endeca Information Discovery Integrator and more recently, the personal data upload feature in Endeca Information Discovery Studio. Data was then ingested into the Endeca Server engine and stored in a key/value-store NoSQL database, indexed, parsed and enriched, and then analyzed using the graphical user interface provided by Studio. As I explained in more detail in my first post in the series yesterday, Big Data Discovery runs the Studio and DGraph (Endeca Server) elements on one or more dedicated nodes, and then reads data in from Hadoop and then writes it back in transformed states using Apache Spark, as shown in the diagram below:


As the data discovery and analysis features in Big Data Discovery rely on getting data into the DGraph (Endeca Server) engine first of all, this implies two things; first, we’ll need to take a subset or sample of the entire Hadoop dataset and load just that into the DGraph engine, and second we’ll need some means of transforming and “massaging” that data so it works well as a data discovery set, and then writing those changes back to the full Hadoop dataset if we want to use it with some other tool – OBIEE or Big Data SQL, for example. To see how this process works, let’s use the same Rittman Mead Apache webserver logs that I’ve used in my previous examples, and bring that data and some additional reference data into Big Data Discovery.

The log data from the RM webserver is in Apache Combined Log Format and a sample of the rows looks like this:


For data to be eligible to be ingested into Big Data Discovery, it has to be registered in the Hive Metastore and with the metadata available to use by external tools using the HCatalog service. This means that you already need to have created a Hive table over each datasource, either pointing this table to regular fixed-width or delimited files, or using a SerDe to translate another file format – say a compressed/column-store format like Parquet – into a format that Hive can understand. In our case I can use the RegEx SerDe that I first used in this blog post a while ago to create a Hive table over the log file and split out the various log file elements, with the resulting DDL looking like this:

host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) 
([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"
[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
LOCATION '/user/oracle/rm_logs';

If I then register the SerDe with Big Data Discovery I could ingest the table and file at this point, or I can use a Hive CTAS statement to remove the dependency on the SerDe and ingest into BDD without any further configuration.

create table access_logs as
select * 
from apachelog;

At this point, if you’ve got the BDD Hive Table Detector running, it should pick up the presence of the new hive table and ingest it into BDD (you can whitelist table names, and restrict it to certain Hive databases if needed). Or, you can manually trigger the ingestion from the Data Processing CLI on the BDD node, like this:

[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli
[oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_logs;

The data processing process then creates an Apache Oozie job to sample a statistically relevant sample set of data into Apache Spark – with a 1% sample providing 95% sample accuracy – that is the profiled, enriched and then loaded into the Big Data Discovery DGraph engine for further transformation, then exploration and analysis within Big Data Discovery Studio.


The profiling step in this process scans the incoming data and helps BDD determine the datatype of each Hive table column, the distribution of values within the column and so on, whilst the enrichment part identifies key words and phrases and other key lexical facts about the dataset. A key concept here also is that BDD typically works with a representative sample of your Hive table contents, not the whole contents, as all the data you analyse has to fit within the memory space with the DGraph engine, just like it used to with Endeca Server. At some point its likely that the functionality of the DGraph engine will be unbundled from the Endeca Server and run natively across the actual Hadoop cluster, but for now you have to separately ingest data into the DGraph engine (which can run clustered on BDD nodes) and analyse it there – however the rules of sampling are that if you’ve got a sufficiently big sample – say, 1m rows – regardless of the actual main dataset size this sample set is considered sufficiently representative – 95% in this case – as to make loading a bigger sample set not really worth the effort. But bear in mind when working with a BDD dataset that you’re working a sample, not the full set, so if a value you’re looking for is missing it might be because it’s not in this particular sample.

Once you’ve ingested the new dataset into BDD, you see it listed amongst the others that have previously been ingested, like this:


At this point you can explore the dataset, to take an initial look at the patterns and values in the dataset in its raw form.


Unfortunately, in this raw form the data in the access_logs table isn’t all that useful – details of the page request URL are mixed in with the HTTP protocol and method, for example; dates are in strings; details of the person accession the site are in IP address format rather than a geographical location, and so on. In previous examples on this blog I’ve looked at various methods to cleanse, transform and enhance the data in log file tables like this, using tools and techniques such as Hive table transformations, Pig and Apache Spark scripts, and ODI mappings but all of these typically require some IT invovement whereas one of the hallmarks of recent versions of Endeca Information Discovery Studio was giving power-users the ability to transform and enrich data themselves. Big Data Discovery provides tools to cleanse, transform and enrich data, with menu items for common transformations and a Groovy script editor for more complex ones, including deriving sentiment values from textual data and stripping out HTML and formatting characters from text.


Once you’ve finished transforming and enriching the dataset, you can either save (commit) the changes back to the sample dataset in the BDD DGraph engine, or you can use the transformation rules you’ve defined to apply those transformations to the entire Hive table contents back on Hadoop, with the transformation work being done using Apache Spark. Datasets are loaded into “projects” and each project can have its own transformed view of the raw data, with copies of the dataset being kept in the BDD DGraph engine to represent each team’s specific view onto the raw datasets.


In practice I found this didn’t, at the current product state, completely replace the need for a Hadoop developer or R data analyst – you need to get your data files into Hive and HCatalog at the start which involves parsing and interpreting semi-structured data files, and I often did some transformations in BDD, then applied the transformations to the whole Hive dataset and then re-imported the results back into BDD to start from a simple known state. But it certainly made tasks such as turning IP addresses into countries and cities, splitting our URLs and removing HTML tags much easier and I got the data cleansing process done in a matter of hours compared to the days with manual Hive, Pig and Spark scripting.

Now the data in my log file dataset is much more usable and easy to understand, with URLs split out, status codes grouped into high-level descriptors, and other descriptive and formatting changes made.


I can also at this point bring in additional datasets, either created manually outside of BDD and ingested into the DGraph from Hive, or manually uploaded using the Studio interface. These dataset uploads then live in the BDD DGraph engine, and are then written back to Hive for long-term persistence or for sharing with other tools and processes.


These datasets can then be joined to the main dataset on matching dataset columns, giving you a table-join interface not unlike OBIEE’s physical model editor.


So now we’re in a position where our datasets have been ingested into BDD, and we’ve cleansed, transformed and joined them into a combined web activity dataset. In tomorrow’s final post I’ll look at the data visualisation part of Big Data Discovery and see how it brings the capabilities of Endeca Information Discovery Studio to Hadoop.

Categories: BI & Warehousing

Introducing Oracle Big Data Discovery Part 1: “The Visual Face of Hadoop”

Rittman Mead Consulting - Mon, 2015-02-23 20:38

Oracle Big Data Discovery was released last week, the latest addition to Oracle’s big data tools suite that includes Oracle Big Data SQL, ODI and it’s Hadoop capabilities and Oracle GoldenGate for Big Data 12c. Introduced by Oracle as “the visual face of Hadoop”, Big Data Discovery combines the data discovery and visualisation elements of Oracle Endeca Information Discovery with data loading and transformation features built on Apache Spark to deliver a tool aimed at the “Discovery Lab” part of the Oracle Big Data and Information Management Reference Architecture.


Most readers of this blog will probably be aware of Oracle Endeca Information Discovery, based on the Endeca Latitude product acquired as part of the Endeca aquisition. Oracle positioned Endeca Information Discovery (OEID) in two main ways; on the one hand as a data discovery tool for textual and unstructured data that complemented the more structured analysis capabilities of Oracle Business Intellligence, and on the other hand, as a fast click-and-refine data exploration tool similar to Qlikview and Tableau.

The problem for Oracle though was that data discovery against files and documents is a bit of a “solution looking for a problem” and doesn’t have a naturally huge market (especially considering the license cost of OEID Studio and the Endeca Server engine that stores and analyzes the data), whereas Qlikview and Tableau are significantly cheaper than OEID (at least at the start) and are more focused on BI-type tasks, making OEID a good too but not one with a mass market. To address this, whilst OEID will continue as a standalone tool the data discovery and unstructured data analysis parts of OEID are making their way into this new product called Oracle Big Data Discovery, whilst the fast click-and-refine features will surface as part of Visual Analyzer in OBIEE12c.

More importantly, Big Data Discovery will run on Hadoop making it a solution for a real problem – how to catalog, explore, refine and visualise the data in the data reservoir, where data has been landed that might be in schema-on-read databases, might need further analysis and understanding, and users need large-scale tooling to extract the nuggets of information that in time make their way into the “Execution” part of the Big Data and Information Management Reference Architecture. As some who’s admired the technology behind Endeca Information Discovery but sometimes struggled to find real-life use-cases or customers for it, I’m really pleased to see its core technology applied to a problem space that I’m encountering every day with Rittman Mead’s customers.


In this first post, I’ll look at how Big Data Discovery is architected and how it works with Cloudera CDH5, the Hadoop distribution we use with our customers (Hortonworks HDP support is coming soon). In the next post I’ll look at how data is loaded into Big Data Discovery and then cataloged and transformed using the BDD front-end; then finally, we’ll take a look at exploring and analysing data using the visual capabilities of BDD evolved from the Studio tool within OEID. Oracle Big Data Discovery 1.0 is now GA (Generally Available) but as you’ll see in a moment you do need a fairly powerful setup to run it, at least until such time as Oracle release a compact install version running on VM.

To run Big Data Discovery you’ll need access to a Hadoop install, which in most cases will consist of 6 (minumum 3 or 4, but 6 is the minimum we use) to 18 or so Hadoop nodes running Cloudera CDH5.3. BDD generally runs on its own server nodes and itself can be clustered, but for our setup we ran 1 BDD node alongside 6 CDH5.3 Hadoop nodes looking like this:


Oracle Big Data Discovery is made up of three component types highlighted in red in the above diagram, two of which typically run on their own dedicated BDD nodes and another which runs on each node in the Hadoop cluster (though there are various install types including all on one node, for demo purposes)

  • The Studio web user interface, which combines the faceted search and data discovery parts of Endeca Information Discovery Studio with a lightweight data transformation capability
  • The DGraph Gateway, which brings Endeca Server search/analytics capabilities to the world of Hadoop, and
  • The Data Processing component that runs on each of the Hadoop nodes, and uses Hive’s HCatalog feature to read Hive table metadata and Apache Spark to load and transform data in the cluster

The Studio component can run across several nodes for high-availability and load-balancing, which the DGraph element can run on a single node as I’ve set it up, or in a cluster with a single “leader” node and multiple “follower” nodes again for enhanced availability and throughput. The DGraph part them works alongside Apache Spark to run intensive search and analytics on subsets of the whole Hadoop dataset, with sample sets of data being moved into the DGraph engine and any resulting transformations then being applied to the whole Hadoop dataset using Apache Spark. All of this then runs as part of the wider Oracle Big Data product architecture, which uses Big Data Discovery and Oracle R for the discovery lab and Oracle Exadata, Oracle Big Data Appliance and Oracle Big Data SQL to take discovery lab innovations to the wider enterprise audience.


So how does Oracle Big Data Discovery work in practice, and what’s a typical workflow? How does it give us the capability to make sense of structured, semi-structured and unstructured data in the Hadoop data reservoir, and how does it look from the perspective of an Oracle Endeca Information Discovery developer, or an OBIEE/ODI developer? Check back for the next parts in this three part series where I’ll first look at the data transformation and exploration capabilities of Big Data Discovery, and then look at how the Studio web interface brings data discovery and data visualisation to Hadoop.

Categories: BI & Warehousing

Next Generation Outline Extractor - New Version Available

Tim Tow - Mon, 2015-02-16 14:42
Today we released a new version of the Next Generation Outline Extractor, version  Here are the release notes from this new version:

Version supports the following Essbase versions:


Issues resolved in version

2015.02.15 - Issue 1355 - All Writers - Add functionality to replace all line feeds, carriage returns, tabs, and extraneous spaces in formulas

2015.02.13 - Issue 1354 - RelationalWriter - Changed the default database name from dodeca to extractor


2015.02.13 - Issue 1352 - RelationalWriter - Added Server, Application, and Cube columns to the CACHED_OUTLINE_VERSIONS table

2015.02.13 - Issue 1351 - Fixed issue with LoadFileWriter where UDA column headers were incorrectly written in the form UDAS0,DimName instead of UDA0,DimName

In addition, a number of fixes, etc, were put into 2.0.2 and earlier releases and those releases went unannounced.  Those updates included the following items:

  1. There is no longer a default .properties file for the Extractor.  This will force a user to specify a .properties file.  (
  2. Removed the "/" character as a switch for command line arguments as it causes problems in Linux. (
  3. Fixed issue when combining MaxL input with relational output where a "not supported" error message would appear due to certain properties were not being read correctly from the XML file (
  4. Command line operations resulted in an error due to an improper attempt to interact with the GUI progress bar. (
  5. Shared members attributes where not be properly written resulting in a delimiter/column count mismatch. (
  6. Added encoding options where a user can choose between UTF-8 and ANSI encodings.  The Extractor will attempt to detect encoding from selected outline and, if the detected outline encoding is different from the user selected outline encoding, a warning message appears.
Categories: BI & Warehousing

Ephemeral Port Issue with Essbase Has Been Fixed!

Tim Tow - Fri, 2015-02-13 09:24
The issue that has plagued a number of Essbase customers over the years related to running out of available ports has finally been fixed!

This issue, which often manifested itself with errors in the Essbase error 10420xx range, was caused by how the Essbase Java API communicated with the server. In essence, whenever a piece of information was needed, the Essbase Java API grabbed a port from the pool of available ports, did its business, and then released the port back to the pool. That doesn’t sound bad, but the problem occurs due to how Windows handles this pool of ports. Windows will put the port into a timeout status for a period of time before it makes the port available for reuse and the default timeout in Windows is 4 minutes! Further, the size of the available pool of ports is only about 16,000 ports in the later versions of Windows. That may sound like a lot of ports, but the speed of modern computers makes it possible, and even likely, that certain operations, such as the outline APIs, that call Essbase many, many times to get information would be subject to this issue. Frankly, we see this issue quite often with both VB and the Java Essbase Outline Extractors.

We brought this issue to the attention of the Java API team and assisted them by testing a prerelease version of the Java API jars. I am happy to report the fix was released with Essbase In addition, there is a new setting that allows you to turn the optimization on or off:


It is our understanding that this optimization is turned on by default. I also checked the default files shipped with both Essbase and and did not see that setting in those files. It may be one of those settings that is there in case it messes something else up. The work of our own Jay Zuercher in our labs and searching Oracle Support seems to have confirmed that thought. There is apparently an issue where EIS drill-through reports don't work in Smart View if socket optimization is turned on. It is documented in Oracle Support Doc ID 1959533.1.

There is also another undocumented setting:


According to Oracle development, this value defaults to 300 ms but there should be little need to ever change it. The only reason it is there is to tune socket optimization in case more than 2 sockets are used per Java API session.

Jay also tested the version in our labs with the Next Generation Outline Extractor. With the default settings, one large test outline we have, "BigBad", with about 120,000 members in it, extracted in 1 minute and 50 seconds.  With socket optimization turned off, the same outline was only about 25% complete after 2 hours.   In summary, this fix will be very useful for a lot of Oracle customers.
Categories: BI & Warehousing