Skip navigation.

Rittman Mead Consulting

Syndicate content
Delivering Oracle Business Intelligence
Updated: 4 hours 31 min ago

Getting The Users’ Trust – Part 2

Thu, 2014-09-18 04:35

Last time I wrote about the performance aspects of a BI system and how they could affect a user’s confidence. I concluded by mentioning that incorrect data might be generated by poorly coded ETL routines causing data loss or duplication. This time I am looking more at the quality of the data we load (or don’t load).

Back in the 1990’s I worked with a 4.5 TB DWH that had a single source for fact and reference data, that is the data loaded was self-consistent. Less and less these days we find a single source DWH to be the case; we are adding multiple data sources (both internal and external). Customers can now appear on CRM, ERP, social media, credit referencing, loyalty, and a whole host of other systems. This proliferation of data sources gives rise to a variety of issues we need to be at least aware of, and in reality, should be actively managing. Some of these issues require us to work out processing rules within our data warehouse such as what do we do with fact data that arrives before its supporting reference data; I once had a system where our customer source could only be extracted once a week but purchases made by new customers would appear in our fact feed immediately after customer registration. Obviously, it is a business call on whether we publish facts that involve yet to be loaded customers straight away or defer those loads until the customer has been processed in the DWH. In the case of my example we needed to auto-create new customers in the data warehouse with just the minimum of data, the surrogate key and the business key and then do a SCD type 1update when the full customer data profile is loaded the following week. Technical issues such as these are trivial, we formulate and agree a business rule to define our actions and we implement it in our ETL or, possibly, the reporting code. In my opinion the bigger issues to resolve are in Data Governance and Data Quality.

Some people combine Data Quality and Governance together as a single topic and believe that a single solution will put all right. However, to my mind, they are completely separate issues. Data quality is about the content of the data and governance is about ownership, providence and business management of the data. Today, Data Governance is increasingly becoming a regulatory requirement, especially in finance.

Governance is much more than the data lineage tools we might access in ETL tools such as ODI and even OWB. ETL lineage is about source to target mappings; our ability to say that ‘bank branch name’ comes from this source attribute, travels through these multiple ODI mappings and finally updates that column in our BANK_BRANCH dimension table. In true Data Governance we probably do some or all of these:

  • Create a dictionary of approved business terms. This will define every attribute in business terms and also provide translations between geographic and business-unit centric ways of viewing data. In finance one division may talk about “customer”, another division will say “investor”, a third says “borrower”; in all three cases we are really talking about the same kind of object, a person. This dictionary should go down to the level of individual attribute and measures and include the type of data being held such as text, currency, date-time, these data types are logical types and not physical types as seen on the actual sources. It is important that this dictionary is shared throughout the organisation and is “the true definition” of what is reported.
  • Define ownership (or stewardship) for the approved business data item.
  • Map business data sources and targets to our approved list of terms (at attribute level). It is very possible that some attributes will have multiple potential sources, in such cases we must specify which source will be the master source.
  • Define processes to keep our business data aligned.  
  • Define ownership for the sources for design (and for static data such as ISO country codes, content) change accountability. Possibility integrate into change notification mechanism of change process.
  • Define data release processes for approved external reference data.
  • Define data access and redaction rules for compliance purposes.
  • Build-in audit and control.
As you can see we are not, in the main, talking data content, instead we are improving our description of the business data over that are already held in database data dictionaries and XSD files. This is still metadata and is almost certainly best managed in some kind of Data Governance application. One tool we might consider for this is Oracle Data Relationship Manager from the Hyperion family of products. If we want to go more DIY it may be possible to leverage some of the data responsibility features of Oracle SQL Developer Data Modeller.

Whereas governance is about using the right data and having processes and people to guarantee it is correctly sourced, Data Quality is much finer in grain and looks at the actual content. Here a tool such as Oracle Enterprise Data Quality is invaluable. By the way I have noticed that OEDQ version 12 has recently been released, I have a blog on this in the pipeline.

I tend to divide Data Quality into three disciplines:

  • Data Profiling is always going to be our first step. Before we fix things we need to know what to fix! Generally, we try to profile a sample of the data and assess it column by column, row by row to build a picture of the actual content. Typically we look at data range, nulls, number of distinct values and in the case of text data: character types used (alpha, letter case, numeric, accents, punctuation etc), regular expressions. From this we develop a plan to tackle quality, for example on a data entry web-page we may want to tighten processing rules to prevent certain “anticipated” errors; more usually we come up with business rules to apply in our next stage. 
  • Data Assessment. Here we test the full dataset against the developed rules to identify data that conforms or needs remedy. This remedy could be referring the data back to the source system owner for correction, providing a set of data fixes to apply to the source which can be validated and applied as a batch, creating processes to “fix” data on the source at initial data entry, or (and I would strongly advise against this for governance reasons) dynamically fix in an ETL process. The reason I am against fixing data downstream in ETL is that the data we report on in our Data Warehouse is not going to match the source and this will be problematic when we try to validate if our data warehouse fits reality.
  • Data de-duplication. This final discipline of our DQ process is the most difficult, identifying data that is potentially duplicated in our data feed. In data quality terms a duplicate is where two or more rows refer to what is probably (statistically) the same item, this is a lot more fuzzy than an exact match in database terms; people miskey data, call centre staff mis-hear names, companies merge and combine data sets, I have even seen customers registering a new email address because they can not be bothered to reset their password on a e-selling website. De-duplication is important to improve the accuracy of BI in general, it is nigh-on mandatory for organisations that need to manage risk and prevent fraud.
Data Quality is so important to trusted BI; without it we run the risk that our dimensions do not roll-up correctly and that we under-report by separating our duplicates. However, being correct at the data warehouse is only part of the story, these corrections also need to be on the sources; to do that we have to implement processes and disciplines throughout the organisation.   For BI that users can trust we need to combine both data management disciplines. From governance we need to be sure that we are using the correct business terms for all attributes and that the data displayed in those attributes has made the correct journey from the original source. From quality we gain confidence that we are correctly aggregating data in our reporting.   At the end of the day we need to be right to be trusted.

 

 

Categories: BI & Warehousing

Getting The Users’ Trust – Part 1

Wed, 2014-09-17 03:02

Looking back over some of my truly ancient Rittman Mead blogs (so old in fact that they came with me when I joined the company soon after Rittman Mead was launched), I see recurrent themes on why people “do” BI and what makes for successful implementations. After all, why would an organisation wish to invest serious money in a project if it does not give value either in terms of cost reduction or increasing profitability through smart decisions. This requires technology to provide answers and a workforce that is both able to use this technology and has faith that the answers returned allow them to do their jobs better. Giving users this trust in the BI platform generally boils down to resolving these three issues: ease of use of the reporting tool, quickness of data return and “accuracy” or validity of the response. These last two issues are a fundamental part of my work here at Rittman Mead and underpin all that I do in terms of BI architecture, performance, and data quality. Even today as we adapt our BI systems to include Big Data and Advanced Analytics I follow the same sound approaches to ensure usable, reliable data and the ability to analyse it in a reasonable time.

Storage is cheap so don’t aggregate away your knowledge. If my raw data feed is sales by item by store by customer by day and I only store it in my data warehouse as sales by month by state I can’t go back to do any analysis on my customers, my stores, my products. Remember that the UNGROUP BY only existed in my April Fools’ post. Where you choose to store your ‘unaggregated’ data may well be different these days; Hadoop and schema on read paradigms often being a sensible approach. Mark Rittman has been looking at architectures where both the traditional DWH and Big Data happily co-exist.

When improving performance I tend to avoid tuning specific queries, instead I aim to make frequent access patterns work well. Tuning individual queries is almost always not a sustainable approach in BI; this week’s hot, ‘we need the answer immediately’ query may have no business focus next week. Indexes that we create to make a specific query fly may have no positive effect on other queries; indeed, indexes may degrade other aspects of BI performance such as increased data load times and have subtle effects such as changing a query plan cost so that groups of materialized views are no longer candidates in query re-write (this is especially true when you use nested views and the base view is no longer accessed).

My favoured performance improvement techniques are: correctly placing the data be it clustering, partitioning, compressing, table pinning, in-memory or whatever, and making sure that the query optimiser knows all about the nature of the data; again and again “right” optimiser information is key to good performance. Right is not just about running DBMS_STATS.gather_XXX over tables or schemas every now and then; it is also about telling the optimiser about data relationships between data items. Constraints describe the data, for example which columns allow NULL values, which columns are part of parent-child relationships (foreign keys). Extended table statistics can help describe relationships between columns in a single table for example in a product dimensions table the product sub-category and the product category columns will have an interdependence, without that knowledge cardinality estimates can be very wrong and favour nested loop style plans that could be very poor performing on large data sets.

Sometimes we will need to create aggregates to answer queries quickly; I tend to build ‘generic’ aggregates, those that can be used by many queries. Often I find that although data is loaded frequently, even near-real-time, many business users wish to look at larger time windows such as week, month, or quarter; In practice I see little need for day level aggregates over the whole data warehouse timespan, however, there will always be specific cases that might require day-level summaries. If I build summary tables or use Materialized Views I would aim to make tables that are at least 80% smaller than the base table and to avoid aggregates that partially roll up many dimensional hierarchies; customer category by product category by store region by month would probably not be the ideal aggregate for most real-user queries. That said Oracle does allow us to use fancy grouping semantics in the building of aggregates (grouping sets, group by rollup and group by cube.) The in-database Oracle OLAP cube functionality is still alive and well (and was given a performance boost in Oracle 12c); it may be more appropriate to aggregate in a cube (or relational-look-alike) rather than individual summaries.

Getting the wrong results quickly is no good, we must be sure that the results we display are correct. As professional developers we test to prove that we are not losing or gaining data through incorrect joins and filters, but ETL coding is often the smallest factor in “incorrect results” and this brings me to part 2, Data Quality.

Categories: BI & Warehousing

Rittman Mead/Oracle Data Integration Speakeasy @ Oracle Open World

Thu, 2014-09-11 10:59

If you are attending Oracle Open World this year and fancy bit of a different experience, come and join Rittman Mead and Oracle’s Data Integration teams for drinks and networking at 7pm on Tuesday 30th September at the Local Edition speakeasy on Market Street.

We will be providing a couple of hours of free drinks with the opportunity to quiz our leading data integration experts and Oracle’s data integration team about any aspect of the data integration toolset, architecture and our innovative implementation approaches, and to relax and kick back at the end of a long day. So whether you want to know about how ODI can facilitate your big data strategy, or implement data quality and data governance across your enterprise data architecture, please come along.

The Local Edition is located at 691 Market St, San Francisco, CA and the event runs from 7pm to 9pm. Please register here.

For further information on this event and the sessions we are presenting at Oracle Open World contact us at info@rittmanmead.com.

Categories: BI & Warehousing

Using Oracle GoldenGate for Trickle-Feeding RDBMS Transactions into Hive and HDFS

Wed, 2014-09-10 15:13

A few months ago I wrote a post on the blog around using Apache Flume to trickle-feed log data into HDFS and Hive, using the Rittman Mead website as the source for the log entries. Flume is a good technology to use for this type of capture requirement as it captures log entries, HTTP calls, JMS queue entries and other “event” sources easily, has a resilient architecture and integrates well with HDFS and Hive. But what if the source you want to capture activity for is a relational database, for example Oracle Database 12c? With Flume you’d need to spool the database transactions to file, whereas what you really want is a way to directly connect to the database engine and capture the changes from source.

Which is exactly what Oracle GoldenGate does, and what most people don’t realise is that GoldenGate can also load data into HDFS and Hive, as well as the usual database targets. Hive and HDFS aren’t fully-supported targets yet, you can use the Oracle GoldenGate for Java adapter to act as the handler process and then land the data in HDFS files or Hive tables on your target Hadoop platform. My Oracle Support has two tech nodes, “Integrating OGG Adapter with Hive (Doc ID 1586188.1)” and “Integrating OGG Adapter with HDFS (Doc ID 1586210.1)” that give example implementations of the Java adapters you’d need for these two target types, with the overall end-to-end process for landing Hive data looking like the diagram below (and the HDFS one just swapping out HDFS for Hive at the handler adapter stage)

NewImage

This is also a good example of the sorts of technology we’d use to implement the “data factory” concept within the new Oracle Information Management Reference Architecture, the part of the architecture that moves data between the Hadoop and NoSQL-based Data Reservoir, and the relationally-stored enterprise information store; in this case, trickle-feeding transactional data from the Oracle database into Hadoop, perhaps to archive it at lower-cost than we could do in an Oracle database, or to add transaction activity data to a Hadoop-based application

NewImage

So I asked my colleague Nelio Guimaraes to set up a GoldenGate capture process on our Cloudera CDH5.1 Hadoop cluster, using GoldenGate 12.1.2.0.0 for our source Oracle 11gR2 database and Oracle GoldenGate for Java, downloadable separately on edelivery.oracle.com under Oracle Fusion Middleware > Oracle GoldenGate Application Adapters 11.2.1.0.0 for JMS and Flat File Media Pack. In our example, we’re going to capture activity on the SCOTT.EMP table in the Oracle database, and then perform the following step to set up replication from it into a replica Hive table:

  1. Create a table in Hive that corresponds to the table in Oracle database.
  2. Create a table in the Oracle database and prepare the table for replication.
  3. Configure the Oracle GoldenGate Capture to extract transactions from the Oracle database and create the trail file.
  4. Configure the Oracle GoldenGate Pump to read the trail and invoke the custom adapter
  5. Configure the property file for the Hive handler
  6. Code, Compile and package the custom Hive handler
  7. Execute a test. 
Setting up the Oracle Database Source Capture

Let’s go into the Oracle database first, check the table definition, and then connect to Hadoop to create a Hive table of the same column definition.

[oracle@centraldb11gr2 ~]$ sqlplus scott/tiger
SQL*Plus: Release 11.2.0.3.0 Production on Thu Sep 11 01:08:49 2014
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL> describe DEPT
 Name Null? Type
 ----------------------------------------- -------- ----------------------------
 DEPTNO NOT NULL NUMBER(2)
 DNAME VARCHAR2(14)
 LOC VARCHAR2(13)
SQL> exit
...
[oracle@centraldb11gr2 ~]$ ssh oracle@cdh51-node1
Last login: Sun Sep 7 16:11:36 2014 from officeimac.rittmandev.com
[oracle@cdh51-node1 ~]$ hive
...
create external table dept
(
 DEPTNO string, 
 DNAME string, 
 LOC string
) row format delimited fields terminated by '\;' stored as textfile
location '/user/hive/warehouse/department'; 
exit
...

Then I install Oracle Golden Gate 12.1.2 on the source Oracle database, just as you’d do for any Golden Gate install, and make sure supplemental logging is enabled for the table I’m looking to capture. Then I go into the ggsci Golden Gate command-line utility, to first register the user it’ll be connecting as, and what table it needs to capture activity for.

[oracle@centraldb11gr2 12.1.2]$ cd /u01/app/oracle/product/ggs/12.1.2/
[oracle@centraldb11gr2 12.1.2]$ ./ggsci
$ggsci> DBLOGIN USERID sys@ctrl11g, PASSWORD password sysdba
$ggsci> ADD TRANDATA SCOTT.DEPT COLS(DEPTNO), NOKEY

GoldenGate uses a number of components to replicate data from source to targets, as shown in the diagram below.

NewImageFor our purposes, though, there are just three that we need to configure; the Extract component, which captures table activity on the source; the Pump process that moves data (or the “trail”) from source database to the Hadoop cluster; and the Replicat component that takes that activity and applies it to the target tables. In our example, the extract and pump processes will be as normal, but we need to create a custom “handler” for the target Hive table that uses the Golden Gate Java API and the Hadoop FS Java API.

The tool we use to set up the extract and capture process is ggsci, the command-line Golden Gate Software Command Interface. I’ll use it first to set up the Manager process that runs on both source and target servers, giving it a port number and connection details into the source Oracle database.

$ggsci> edit params mgr
PORT 7809
USERID sys@ctrl11g, PASSWORD password sysdba
PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/12.1.2/dirdat/*, USECHECKPOINTS

Then I create two configuration files, one for the extract process and one for the pump process, and then use those to start those two processes.

$ggsci> edit params ehive
EXTRACT ehive
USERID sys@ctrl11g, PASSWORD password sysdba
EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, FORMAT RELEASE 11.2
TABLE SCOTT.DEPT;
$ggsci> edit params phive
EXTRACT phive
RMTHOST cdh51-node1.rittmandev.com, MGRPORT 7809
RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, FORMAT RELEASE 11.2
PASSTHRU
TABLE SCOTT.DEPT;
$ggsci> ADD EXTRACT ehive, TRANLOG, BEGIN NOW
$ggsci> ADD EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, EXTRACT ehive
$ggsci> ADD EXTRACT phive, EXTTRAILSOURCE /u01/app/oracle/product/ggs/12.1.2/dirdat/et
$ggsci> ADD RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, EXTRACT phive

As the Java event handler on the target Hadoop platform won’t be able to ordinarily get table metadata for the source Oracle database, we’ll use the defgen utility on the source platform to create the parameter file that the replicat process will need.

$ggsci> edit params dept
defsfile ./dirsql/DEPT.sql
USERID ggsrc@ctrl11g, PASSWORD ggsrc
TABLE SCOTT.DEPT;
./defgen paramfile ./dirprm/dept.prm NOEXTATTR

Note that NOEXTATTR means no extra attributes; because the version on target is a generic and minimal version, the definition file with extra attributes won’t be interpreted. Then, this DEPT.sql file will need to be copied across to the target Hadoop platform where you’ve installed Oracle GoldenGate for Java, to the /dirsql folder within the GoldenGate install. 

[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1
oracle@cdh51-node1's password: 
Last login: Wed Sep 10 17:05:49 2014 from centraldb11gr2.rittmandev.com
[oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/
[oracle@cdh51-node1 11.2.1]
$ pwd/u01/app/oracle/product/ggs/11.2.1
[oracle@cdh51-node1 11.2.1]$ ls dirsql/
DEPT.sql

Then, going back to the source Oracle database platform, we’ll start the Golden Gate Monitor process, and then the extract and pump processes.

[oracle@cdh51-node1 11.2.1]$ ssh oracle@centraldb11gr2
oracle@centraldb11gr2's password: 
Last login: Thu Sep 11 01:08:18 2014 from bdanode1.rittmandev.com
GGSCI (centraldb11gr2.rittmandev.com) 7> start mgr
Manager started.
 
GGSCI (centraldb11gr2.rittmandev.com) 8> start ehive
 
Sending START request to MANAGER ...
EXTRACT EHIVE starting
 
GGSCI (centraldb11gr2.rittmandev.com) 9> start phive
 
Sending START request to MANAGER ...
EXTRACT PHIVE starting

Setting up the Hadoop / Hive Replicat Process

Setting up the Hadoop side involves a couple of similar steps to the source capture side; first we configure the parameters for the Manager process, then configure the extract process that will pull table activity off of the trail file, sent over by the pump process on the source Oracle database.

[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1
oracle@cdh51-node1's password: 
Last login: Wed Sep 10 21:09:38 2014 from centraldb11gr2.rittmandev.com
[oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/
[oracle@cdh51-node1 11.2.1]$ ./ggsci
$ggsci> edit params mgr
PORT 7809
PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/11.2.1/dirdat/*, usecheckpoints, minkeepdays 3
$ggsci> add extract tphive, exttrailsource /u01/app/oracle/product/ggs/11.2.1/dirdat/rt
$ggsci> edit params tphive
EXTRACT tphive
SOURCEDEFS ./dirsql/DEPT.sql
CUserExit ./libggjava_ue.so CUSEREXIT PassThru IncludeUpdateBefores
GETUPDATEBEFORES
TABLE SCOTT.DEPT;

Now it’s time to create the Java hander that will write the trail data to the HDFS files and Hive table. The My Oracle Support Doc.ID 1586188.1 I mentioned at the start of the article has a sample Java program called SampleHandlerHive.java that writes incoming transactions into an HDFS file within the Hive directory, and also writes it to a file on the local filesystem. To get this working on our Hadoop system, we created a new java source code file from the content in SampleHandlerHive.java, updated the path from hadoopConf.addResource to point the the correct location of core-site.xml, hdfs-site.xml and mapred-site.xml, and then compiled it as follows:

export CLASSPATH=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/*
javac -d . SampleHandlerHive.java

Successfully executing the above command created the SampleHiveHandler.class under /u01/app/oracle/product/ggs/11.2.1//dirprm/com/mycompany/bigdata. To create the JAR file that the GoldenGate for Java adapter will need, I then need to change directory to the /dirprm directory under the Golden Gate install, and then run the following commands:

jar cvf myhivehandler.jar com
chmod 755 myhivehandler.jar

I also need to create a properties file for this JAR to use, in the same /dirprm directory. This properties file amongst other things tells the Golden Gate for Java adapter where in HDFS to write the data to (the location where the Hive table keeps its data files), and also references any other JAR files from the Hadoop distribution that it’ll need to get access to.

[oracle@cdh51-node1 dirprm]$ cat tphive.properties 
#Adapter Logging parameters. 
gg.log=log4j
gg.log.level=info
 
#Adapter Check pointing  parameters
goldengate.userexit.chkptprefix=HIVECHKP_
goldengate.userexit.nochkpt=true
 
# Java User Exit Property
goldengate.userexit.writers=jvm
jvm.bootoptions=-Xms64m -Xmx512M -Djava.class.path=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/u01/app/oracle/product/ggs/11.2.1/dirprm:/u01/app/oracle/product/ggs/11.2.1/dirprm/myhivehandler.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-common-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-configuration-1.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-logging-1.1.3.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-lang-2.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop/conf.dist:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/guava-11.0.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/hadoop-auth-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-hdfs-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/commons-cli-1.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/protobuf-java-2.5.0.jar
 
#Properties for reporting statistics
# Minimum number of {records, seconds} before generating a report
jvm.stats.time=3600
jvm.stats.numrecs=5000
jvm.stats.display=TRUE
jvm.stats.full=TRUE
 
#Hive Handler.  
gg.handlerlist=hivehandler
gg.handler.hivehandler.type=com.mycompany.bigdata.SampleHandlerHive
gg.handler.hivehandler.HDFSFileName=/user/hive/warehouse/department/dep_data
gg.handler.hivehandler.RegularFileName=cinfo_hive.txt
gg.handler.hivehandler.RecordDelimiter=;
gg.handler.hivehandler.mode=tx

Now, the final step on the Hadoop side is to start its Golden Gate Manager process, and then start the Replicat and apply process.

GGSCI (cdh51-node1.rittmandev.com) 5> start mgr
 
Manager started. 
 
GGSCI (cdh51-node1.rittmandev.com) 6> start tphive
 
Sending START request to MANAGER ...
EXTRACT TPHIVE starting

Testing it All Out

So now I’ve got the extract and pump processes running on the Oracle Database side, and the apply process running on the Hadoop side, let’s do a quick test and see if it’s working. I’ll start by looking at what data is in each table at the beginning.

SQL> select * from dept;     

    DEPTNO DNAME  LOC
 ---------- -------------- -------------

10 ACCOUNTING  NEW YORK
20 RESEARCH  DALLAS
30 SALES  CHICAGO
40 OPERATIONS  BOSTON
50 TESTE  PORTO
60 NELIO  STS
70 RAQUEL  AVES
 
7 rows selected.

Over on the Hadoop side, there’s just one row in the Hive table:

hive> select * from customer;

OK 80MARCIA   ST

Now I’ll go back to Oracle and insert a new row in the DEPT table:

SQL> insert into dept (deptno, dname, loc)
  2  values (75, 'EXEC','BRIGHTON'); 

1 row created. 
SQL> commit; 

Commit complete.

And, going back over to Hadoop, I can see Golden Gate has added that record to the Hive table, by the Golden Gate for Java adapter writing the transaction to the underlying HDFS file.

hive> select * from customer;

OK 80MARCIA   ST
75 EXEC       BRIGHTON

So there you have it; Golden Gate replicating Oracle RBDMS transactions into HDFS and Hive, to complement Apache Flume’s ability to replicate log and event data into Hadoop. Moreover, as Michael Rainey explained in this three part blog series, Golden Gate is closely integrated into the new 12c release of Oracle Data Integrator, making it even easier to manage Golden Gate replication processes into your overall data loading project, and giving Hadoop developers and Golden Gate users access to the full set of load orchestration and data quality features in that product rather than having to rely on home-grown scripting, or Oozie.

Categories: BI & Warehousing

OBIEE SampleApp in The Cloud: Importing VirtualBox Machines to AWS EC2

Wed, 2014-09-10 01:40

Virtualisation has revolutionised how we work as developers. A decade ago, using new software would mean trying to find room on a real tin server to install it, hoping it worked, and if it didn’t, picking apart the pieces probably leaving the server in a worse state than it was to begin with. Nowadays, we can just launch a virtual machine to give a clean environment and if it doesn’t work – trash it and start again.
The sting in the tail of virtualisation is that full-blown VMs are heavy – for disk we need several GB just for a blank OS, and dozens of GB if you’re talking about a software stack such as Fusion MiddleWare (FMW), and the host machine needs to have the RAM and CPU to support it all too. Technologies such as Linux Containers go some way to making things lighter by abstracting out a chunk of the OS, but this isn’t something that’s reached the common desktop yet.

So whilst VMs are awesome, it’s not always practical to maintain a library of all of them on your local laptop (even 1TB drives fill up pretty quickly), nor will your laptop have the grunt to run more than one or two VMs at most. VMs like this are also local to your laptop or server – but wouldn’t it be neat if you could duplicate that VM and make a server based on it instantly available to anyone in the world with an internet connection? And that’s where The Cloud comes in, because it enables us to store as much data as we can eat (and pay for), and provision “hardware” at the click of a button for just as long as we need it, accessible from anywhere.

Here at Rittman Mead we make extensive use of Amazon Web Services (AWS) and their Elastic Computing Cloud (EC2) offering. Our website runs on it, our training servers run on it, and it scales just as we need it to. A class of 3 students is as easy to provision for as a class of 24 – no hunting around for spare servers or laptops, no hardware sat idle in a cupboard as spare capacity “just in case”.

One of the challenges that we’ve faced up until now is that all servers have had to be built from scratch in the cloud. Obviously we work with development VMs on local machines too, so wouldn’t it be nice if we could build VMs locally and then push them to the cloud? Well, now we can. Amazon offer a route to import virtual machines, and in this article I’m going to show how that works. I’ll use the superb SampleApp v406 VM that Oracle provide, because this is a great real-life example of a VM that is so useful, but many developers can find too memory-intensive to be able to run on their local machines all the time.

This tutorial is based on exporting a Linux guest VM from a Linux host server. A Windows guest probably behaves differently, but a Mac or Windows host should work fine since VirtualBox is supported on both. The specifics are based on SampleApp, but the process should be broadly the same for all VMs. 

Obtain the VM

We’re going to use SampleApp, which can be downloaded from Oracle.

  1. Download the six-part archive from http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples–167534.html
  2. Verify the md5 checksums against those published on the download page:
    [oracle@asgard sampleapp406]$ ll
    total 30490752
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 01:33 SampleAppv406.zip.001
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 01:30 SampleAppv406.zip.002
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:03 SampleAppv406.zip.003
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:34 SampleAppv406.zip.004
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:19 SampleAppv406.zip.005
    -rw-r--r-- 1 oracle oinstall 4977591522 Sep  9 02:53 SampleAppv406.zip.006
    [oracle@asgard sampleapp406]$ md5sum *
    2b9e11f69ada5f889088dd74b5229322  SampleAppv406.zip.001
    f8a1a5ae6162b20b3e9c6c888698c071  SampleAppv406.zip.002
    68438cfea87e8d3a2e2f15ff00dadf12  SampleAppv406.zip.003
    b71d9ace4f75951198fc8197da1cfe62  SampleAppv406.zip.004
    4f1a5389c9e0addc19dce6bbc759ec20  SampleAppv406.zip.005
    2c430f87e22ff9718d5528247eff2da4  SampleAppv406.zip.006
  3. Unpack the archive using 7zip — the instructions for SampleApp are very clear that you must use 7zip, and not another archive tool such as winzip.
    [oracle@asgard sampleapp406]$ time 7za x SampleAppv406.zip.001</code>7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
    p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,80 CPUs)
    
    Processing archive: SampleAppv406.zip.001
    
    Extracting SampleAppv406Appliance
    Extracting SampleAppv406Appliance/SampleAppv406ga-disk1.vmdk
    Extracting SampleAppv406Appliance/SampleAppv406ga.ovf
    
    Everything is Ok
    
    Folders: 1
    Files: 2
    Size: 31191990916
    Compressed: 5242880000
    
    real 1m53.685s
    user 0m16.562s
    sys 1m15.578s
  4. Because we need to change a couple of things on the VM first (see below), we’ll have to import the VM to VirtualBox so that we can boot it up and make these changes.You can import using the VirtualBox GUI, or as I prefer, the VBoxManage command line interface. I like to time all these things (just because, numbers), so stick a time command on the front:
    time VBoxManage import --vsys 0 --eula accept SampleAppv406Appliance/SampleAppv406ga.ovf

    This took 12 minutes or so, but that was on a high-spec system, so YMMV.
    [...]
    0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
    Successfully imported the appliance.
    
    real    12m15.434s
    user    0m1.674s
    sys     0m2.807s
Preparing the VM

Importing Linux VMs to Amazon EC2 will only work if the kernel is supported, which according to an AWS blog post includes Red Hat Enterprise Linux 5.1 – 6.5. Whilst SampleApp v406 is built on Oracle Linux 6.5 (which isn’t listed by AWS as supported), we have the option of telling the VM to use a kernel that is Red Hat Enterprise Linux compatible (instead of the default Unbreakable Enterprise Kernel – UEK). There are some other pre-requisites that you need to check if you’re trying this with your own VM, including a network adaptor configured to use DHCP. The aforementioned blog post has details.

  1. Boot the VirtualBox VM, which should land you straight in the desktop environment, logged in as the oracle user.
  2. We need to modify a file as root (superuser). Here’s how to do it graphically, or use vi if you’re a real programmer:
    1. Open a Terminal window from the toolbar at the top of the screen
    2. Enter
      sudo gedit /etc/grub.conf

      The sudo bit is important, because it tells Linux to run the command as root. (I’m on an xkcd-roll here: 1, 2)

    3. In the text editor that opens, you will see a header to the file and then a set of repeating sections beginning with title. These are the available kernels that the machine can run under. The default is 3, which is zero-based, so it’s the fourth title section. Note that the kernel version details include uek which stands for Unbreakable Enterprise Kernel – and is not going to work on EC2.
    4. Change the default to 0, so that we’ll instead boot to a Red Hat Compatible Kernel, which will work on EC2
    5. Save the file
  3. Optional steps:
    1. Whilst you’ve got the server running, add your SSH key to the image so that you can connect to it easily once it is up on EC2. For more information about SSH keys, see my previous blog post here, and a step-by-step for doing it on SampleApp here.
    2. Disable non-SSH key logins (in /etc/ssh/sshd_config, set PasswordAuthentication no and PubkeyAuthentication yes), so that your server once on EC2 is less vulnerable to attack. Particularly important if you’re using the stock image with Admin123 as the root password.
    3. Set up screen, and OBIEE and the database as a Linux service, both covered in my article here.
  4. Shutdown the instance by entering this at a Terminal window:

    sudo shutdown -h now

Export the VirtualBox VM to Amazon EC2

Now we’re ready to really get going. The first step is to export the VirtualBox VM to a format that Amazon EC2 can work with. Whilst they don’t explicitly support VMs from VirtualBox, they do support the VMDK format – which VirtualBox can create. You can do the export from the graphical interface, or as before, from the command line:

time VBoxManage export "OBIEE SampleApp v406" --output OBIEE-SampleApp-v406.ovf

0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Successfully exported 1 machine(s).

real    56m51.426s
user    0m6.971s
sys     0m12.162s

If you compare the result of this to what we downloaded from Oracle it looks pretty similar – an OVF file and a VMDK file. The only difference is that the VMDK file is updated with the changes we made above, including the modified kernel settings which are crucial for the success of the next step.

[oracle@asgard sampleapp406]$ ls -lh
total 59G
-rw------- 1 oracle oinstall  30G Sep  9 10:55 OBIEE-SampleApp-v406-disk1.vmdk
-rw------- 1 oracle oinstall  15K Sep  9 09:58 OBIEE-SampleApp-v406.ovf

We’re ready now to get all cloudy. For this, you’ll need:

  1. An AWS account
    1. You’ll also need your AWS account’s Access Key and Secret Key
  2. AWS EC2 commandline tools installed, along with a Java Runtime Environment (JRE) 1.7 or greater:

    wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip
    sudo mkdir /usr/local/ec2
    sudo unzip ec2-api-tools.zip -d /usr/local/ec2
    # You might need to fiddle with the following paths and version numbers: 
    sudo yum install -y java-1.7.0-openjdk.x86_64
    cat >> ~/.bash_profile <<EOF
    export JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre"
    export EC2_HOME=/usr/local/ec2/ec2-api-tools-1.7.1.1/
    export PATH=$PATH:$EC2_HOME/bin
    EOF<

  3. Set your credentials as environment variables:
    export AWS_ACCESS_KEY=xxxxxxxxxxxxxx
    export AWS_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxx
  4. Ideally a nice fat pipe to upload the VM file over, because at 30GB it is not trivial (not in 2014, anyway)

What’s going to happen now is we use an EC2 command line tool to upload our VMDK (virtual disk) file to Amazon S3 (a storage platform), from where it gets converted into an EBS volume (Elastic Block Store, i.e. a EC2 virtual disk), and from there attached to a new EC2 instance (a “server”/”VM”).

Before we can do the upload we need an S3 “bucket” to put the disk image in that we’re uploading. You can create one from https://console.aws.amazon.com/s3/. In this example, I’ve got one called rmc-vms – but you’ll need your own.

Once the bucket has been created, we build the command line upload statement using ec2-import-instance:

time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY

Points to note:

  • m3.large is the spec for the VM. You can see the available list here. In the AWS blog post it suggests only a subset will work with the import method, but I’ve not hit this limitation yet.
  • region is the AWS Region in which the EBS volume and EC2 instance will be built. I’m using ew-west-1 (Ireland), and it makes sense to use the one geographically closest to where you or your users are located. Still waiting for uk-yorks-1
  • architecture and platform relate to the type of VM you’re importing.

The upload process took just over 45 minutes for me, and that’s from a data centre with a decent upload:

[oracle@asgard sampleapp406]$ time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY
Requesting volume size: 200 GB
TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       0       Status       active  StatusMessage   Pending : Downloaded 0
Creating new manifest at rmc-vms/d77672aa-0e0b-4555-b368-79d386842112/OBIEE-SampleApp-v406-disk1.vmdkmanifest.xml
Uploading the manifest file
Uploading 31191914496 bytes across 2975 parts
0% |--------------------------------------------------| 100%
   |==================================================|
Done
Average speed was 11.088 MBps
The disk image for import-i-fh08xcya has been uploaded to Amazon S3
where it is being converted into an EC2 instance.  You may monitor the
progress of this task by running ec2-describe-conversion-tasks.  When
the task is completed, you may use ec2-delete-disk-image to remove the
image from S3.

real    46m59.871s
user    10m31.996s
sys     3m2.560s

Once the upload has finished Amazon automatically converts the VMDK (now residing on S3) into a EBS volume, and then attaches it to a new EC2 instance (i.e. a VM). You can monitor the status of this task using ec2-describe-conversion-tasks, optionally filtered on the TaskId returned by the import command above:

ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       3898992128
Status  active  StatusMessage   Pending : Downloaded 31149971456

This is now an ideal time to mention as a side note the Linux utility watch, which simply re-issues a command for you every x seconds (2 by default). This way you can leave a window open and keep an eye on the progress of what is going to be a long-running job

watch ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya

Every 2.0s: ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya                                                             Tue Sep  9 12:03:24 2014

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       5848511808
Status  active  StatusMessage   Pending : Downloaded 31149971456

And whilst we’re at it, if you’re using a remote server to do this (as I am, to take advantage of the large bandwidth), you will find screen invaluable for keeping tasks running and being able to reconnect at will. You can read more about screen and watch here.

So back to our EC2 import job. To start with, the task will be Pending: (NB unlike lots of CLI tools, you read the output of this one left-to-right, rather than as columns with headings)

$ ec2-describe-conversion-tasks --region eu-west-1
TaskType        IMPORTINSTANCE  TaskId  import-i-ffvx6z86       ExpirationTime  2014-09-12T15:32:01Z    Status  active  StatusMessage   Pending InstanceID      i-b2245ef2
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5021144064      VolumeSize      60      AvailabilityZone        eu-west-1a      ApproximateBytesConverted       4707330352      Status  active  StatusMessage   Pending : Downloaded 5010658304

After a few moments it gets underway, and you can see a Progress percentage indicator: (scroll right in the code snippet below to see)

TaskType        IMPORTINSTANCE  TaskId  import-i-fgr0djcc       ExpirationTime  2014-09-15T15:39:28Z    Status  active  StatusMessage   Progress: 53%   InstanceID      i-c7692e87
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5582545920      VolumeId        vol-f71368f0    VolumeSize      20      AvailabilityZone        eu-west-1a      ApproximateBytesConverted       5582536640      Status  completed

Note that at this point you’ll see also see an Instance in the EC2 list, but it won’t launch (no attached disk – because it’s still being imported!)

If something goes wrong you’ll see the Status as cancelled, such as in this example here where the kernel in the VM was not a supported one (observe it is the UEK kernel, which isn’t supported by Amazon):

TaskType        IMPORTINSTANCE  TaskId  import-i-ffvx6z86       ExpirationTime  2014-09-12T15:32:01Z    Status  cancelled       StatusMessage   ClientError: Unsupported kernel version 2.6.32-300.32.1.el5uek       InstanceID      i-b2245ef2
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5021144064      VolumeId        vol-91b1c896    VolumeSize      60      AvailabilityZone        eu-west-1a      ApproximateBytesConverted    5021128688      Status  completed

After an hour or so, the task should complete:

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  completed       InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeId        vol-a383f8a4    VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBy
tesConverted    31191855472     Status  completed

At this point you can remove the VMDK from S3 (and should do, else you’ll continue to be charged for it), following the instructions for ec2-delete-disk-image

Booting the new server on EC2

Go to your EC2 control panel, where you should see an instance (EC2 term for “server”) in Stopped state and with no name.

Select the instance, and click Start on the Actions menu. After a few moments a Public IP will be shown in the details pane. But, we’re not home free quite yet…read on.

Firewalls

So this is where it gets a bit tricky. By default, the instance will have launched with Amazon’s Firewall (known as a Security Group) in place which – unless you have an existing AWS account and have modified the default security group’s configuration – is only open on port 22, which is for ssh traffic.

You need to head over to the Security Group configuration page, accessed in several ways but easiest is clicking on the security group name from the instance details pane:

Click on the Inbound tab and then Edit, and add “Custom TCP Rule” for the following ports:

  • 7780 (OBIEE front end)
  • 7001 (WLS Console / EM)
  • 5902 (oracle VNC)

You can make things more secure by allowing access to the WLS admin (7001) and VNC port (5902) to a specific IP address or range only.

Whilst we’re talking about security, your server is now open to the internet and all the nefarious persons out there, so you’ll be wanting to harden your server not least by resetting all the passwords to ones which aren’t publicly documented in the SampleApp user documentation!

Once you’ve updated your Security Group, you can connect to your server! If you installed the OBIEE and database auto start scripts (and if not, why not??) you should find OBIEE running just nicely on http://[your ip]:7780/analytics – note that the port is 7780, not 9704.

2014-09-09_20-21-23

If you didn’t install the script, you will need to start the services manually per the SampleApp documentation. To connect to the server you can ssh (using Terminal, PuTTY, etc) to the server or connect on VNC (Admin123 is the password). For VNC clients try Screen Share on Macs (installed by default), or RealVNC on Windows.

Caveats & Disclaimers
  • Running a server on AWS EC2 costs real money, so watch out. Once you’ve put your credit card details in, Amazon will continue to charge your card whilst there are chargeable items on your account (EBS volumes, instances – running or not- , and so on). You can get an idea of the scale of charges here.
  • As mentioned above, a server on the open internet is a lot more vulnerable than one virtualised on your local machine. You will get poked and probed, usually by automated scripts looking for open ports, weak passwords, and so on. SampleApp is designed to open the toybox of a pimped-out OBIEE deployment to you, it is not “hardened”, and you risk learning the tough way about the need for it if you’re not careful.
Cloning

Amazon EC2 supports taking a snapshot of a server, either for backup/rollback purposes or spinning up as a clone, using an Amazon Machine Image (AMI). From the Instances page, simply select “Create an Image” to build your AMI. You can then build another instance (or ten) from this AMI as needed, exact replicas of the server as it was at the point that you created the image.

Lather, Rinse, and Repeat

There’s a whole host of VirtualBox “appliances” out there, and some of them such as the developer-tools-focused ones only really make sense as local VMs. But there are plenty that would benefit from a bit of “Cloud-isation”, where they’re too big or heavy to keep on your laptop all the time, but are handy to be able to spin up at will. A prime example of this for me is the EBS Vision demo database that we use for our BI Apps training. Oracle used to provide an pre-built Amazon image (know as an AMI) of this, but since withdrew it. However, Oracle do publish Oracle VM VirtualBox templates for EBS 12.1.3 and 12.2.3 (related blog), so from this with a bit of leg-work and a big upload pipe, it’s a simple matter to brew your own AWS version of it — ready to run whenever you need it.

Categories: BI & Warehousing

Sunday Times Tech Track 100

Tue, 2014-09-09 14:35

Over the weekend, Rittman Mead was listed in the Sunday Times Tech Track 100. We are extremely proud to get recognition for the business as well as our technical capability and expertise.

A lot of the public face of Rittman Mead focuses on the tools and technologies we work with. Since day one we have had a core policy to share as much information as possible. Even before the advent of social media, we shared pretty much everything we knew through either our blog or by speaking at conferences, but we very rarely talk about the business itself. However, a lot of the journey we have gone through over the last 7 years has been about the growth and maintenance of a successful, sustainable, multi-national business. We have been able to talk about, educate and evangelise about the tools and technologies as a result of having the successful business to support this.

I remember during one interview we did several years ago the candidate asked (and I’m paraphrasing): “How do you guys make any money, all I see/read is people sitting in airports writing blog posts about leading edge technologies?”.

One massive benefit from this is we often face the same problems (albeit on a different scales) to those that we talk about with customers, so we have been able to better understand the underlying drivers and proposed solutions for our clients.

From a personal point of view, this has meant spending a lot more time looking at contracts as opposed to code and reading business books/blogs as opposed to technical ones. However, it has been well worth it and I would like to say thanks to all of those both inside and outside of the company who have helped contribute to this success.

Categories: BI & Warehousing

Analyzing Twitter Data using Datasift, MongoDB, Hive and ODI12c

Mon, 2014-09-08 14:39

Last week I posted an article on the blog around analysing Twitter data using Datasift, MongoDB and Pig, where I used the Datasift service to stream tweets about Rittman Mead into a MongoDB NoSQL database, and then queried the dataset using Pig. The context for this is the idea of a “data reservoir”, where we supplement the more traditional file and relational datasets we find in data warehouses with other data, typically machine generated, unstructured or very low-level, to add context to the numbers in our reporting system. In the example I quoted in the article, it’d be very interesting to take the activity we record against our blog and website and correlate that with the “conversation” that happens about it in the social media world; for example, were the hits for a particular article due to it been mentioned in a tweet, and did a spike in activity correspond to a particularly influential Twitter user retweeting something we’d tweeted?

NewImage

In that previous article I’d used Pig to access and analyse the data, in part because I saw a match between the nested datasets in a typical DataSift Twitter message and the relations, tuples and bags you get in a Pig schema. For example, if you look at the Tweet from Borkur in the screenshot below from RoboMongo, a Mac OS X client for MongoDB that I’ve found useful, you can see the author details nested inside the interaction details, and the Type attribute having many values under the Trends parent attribute – these map well onto Pig tuples and bags respectively.

NewImage

What I’d really like to do with this dataset, though, is to take certain elements of it and use that to supplement the data I’m loading using ODI12c. Whilst ODI can run arbitrary R, Pig and shell scripts using the ODI Procedure feature (as I did here to make use of Sqoop, before Oracle added Sqoop KMs to ODI12.1.3), it gets the best out of Hadoop when it can access data using Hive, the SQL layer over Hadoop that represents HDFS data as rows and columns, and allows us to SELECT and INSERT data using SQL commands – or to be precise, a dialect of SQL called HiveQL. But how will Hive cope with the nested and repeating data structures in a DataSift Twitter message, and allow us to get just the data out that we’re interested in?

In fact, the MongoDB connector for Hadoop that I used for Pig the other day also comes with Hive connectivity, in the form of a SerDe that lets Hive report against data in a MongoDB database (David Allen blogged about another MongoDB Hive storage handler a while ago, in an article about MongoDB and ODI). What’s more, this Hive connector for MongoDB is actually easier to work with that the Pig connector, as instead of worrying about Tuples and Bags you can just pick out the nested attributes that you’re interested in using a dot notation. For example, if I’m only interested in the InteractionID, username, tweet content and number of followers within a particular Twitter dataset, I can create a table that looks like this in Hive:

CREATE TABLE tweet_data(
  interactionId string,
  username string,
  content string,
  author_followers int)
ROW FORMAT SERDE 
  'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
  'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
  'mongo.columns.mapping'='{"interactionId":"interactionId",
  "username":"interaction.interaction.author.username",
  "content":\"interaction.interaction.content",
  "author_followers_count":"interaction.twitter.user.followers_count"}'
  )
TBLPROPERTIES (
  'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
  )

And at that point, it’s pretty easy to bring the dataset into ODI12c, through the IKM Hive to Hive Control Append knowledge module, and join up the Twitter dataset with the website log data that’s coming in via Flume. ODI can connect to Hive via JDBC drivers supplied with CDH4/5, and once you register the Hive connection and reverse-engineer the Hive metastore metadata into ODI’s repository, the complexity of the underlying Hive storage is hidden and you’re just presented with tables and columns, just like any other datastore type.

NewImage

Starting with the Twitter data first, I create a Hive table outside of ODI that returns the precise set of tweet attributes that I’m interested in, and then filter that dataset down to just the tweets that link to content on our website, by filtering on the tweet link’s URL matching the start of our website address.

NewImage

Then I load-up the hits from the Rittman Mead website, previously landed into Hadoop using Flume and exposed to ODI as another Hive table, filter out all the non-blog page accesses and keep just the URL part of the Apache Weblog request field, removing the transport mechanism and other bits around it.

NewImage

Then, I use a final ODI mapping to join the two datasets together, using ODI’s ability to apply HiveQL expressions to the incoming datasets so that’ve got the same format – trailing ‘/‘ at the end of the URL, no ampersand and query text at the end of the URL, and so on. Both this and the previous transformation are great examples of where ODI can help with this sort of work, making it pretty easy to munge and correct your data so that you’re then able to match-up the two different sources.

NewImage

Then it’s just a case of creating a package or load plan to sequence the mappings, and then run them using the local or standalone agent. You can see the individual KM steps running on the left-hand side, with ODI generating HiveQL queries which in turn are translated into MapReduce and run in parallel across the Hadoop cluster.

NewImage

And then, at the end of the process, I’ve got a Hive table of all of our blog articles that have been mentioned on Twitter (since we started consuming the tweet feed, a day or so ago), with the number of page requests and the number of times that page got mentioned in tweets.

NewImage

Obviously there’s a lot more we can do with this; we can access the number of followers each twitter user has, along with their location, gender and the sentiment (positive, negative, neutral) of the tweet. From that we can work out some impact from the twitter activity, and we can also add to it data from other sources such as Facebook, LinkedIn and so on to get a fuller picture of the activity around our site. Then, the data we’re gathering in can either be left in MongoDB, or I can use these ODI mappings to either archive it in Hive tables, or export the highlights out to Oracle Database using Sqoop or Oracle Loader for Hadoop.

Categories: BI & Warehousing

Analyzing Twitter Data using Datasift, MongoDB and Pig

Thu, 2014-09-04 17:08

If you followed our recent postings on the updated Oracle Information Management Reference Architecture, one of the key concepts we talk about is the “data reservoir”. This is a pool of additional data that you can add to your data warehouse, typically stored on Hadoop or NoSQL databases, where you store unstructured, semi-structured or unprocessed structured data in large volume and at low cost. Adding a data reservoir gives you the ability to leverage the types of data sources that previously were thought of as too “messy”, too high-volume to store for any serious amount of time, or require processing or storing by tools that aren’t in the usual relational data warehouse toolset.

NewImage

By formally including them in your overall information management architecture though, with common tools, security and data governance over the entire dataset, you give your users the ability to consider the whole “360-degree view” of their customers and their interactions with the market.

To take an example, a few weeks ago I posted a series of articles on the blog where I captured user activity on our website, http://www.rittmanmead.com, transported it to one of our Hadoop clusters using Apache Flume, and then analysed it using Hive, Pig and finally Spark. In one of the articles I used Pig and a geocoding API to determine the country that each website visitor came from, and then in a final five-part series I automated the whole process using ODI12c and then copied the final output tables to Oracle using Oracle Loader for Hadoop. This is quite a nice example of ETL-offloading into Hadoop, with an element of Hadoop-native event capture using Flume, but once the processing has finished the data moves out of Hadoop and into the Oracle database.

NewImage

What would be interesting though would be to start adding data into Hadoop that’s permanent, not transitory as part of an ETL process, to start building out this concept of the “data reservoir”. Taking our website activity dataset, something that would really add context to the visits to our site would be corresponding activity on social networks, to see who’s linking to our posts, who’s discussing them, whether those discussions are positive or negative, and which wider networks those people belong to. Twitter is a good place to start with this as it’s the place we see our articles and activities most discussed, but it’s be good to build out this picture over time to add in activity on social networks such as Facebook, Youtube, LinkedIn and Google+; if we did this, we’d be able to consider a much broader and richer picture when looking at activity around Rittman Mead, potentially correlating activity and visits to our website with mentions of us in the press, comments made by our team and the wider picture of what’s going on in our world.

NewImage

There are a number of ways you can bring Twitter data into your Hadoop cluster or data warehouse, but the most convenient way we’ve found is to use DataSift, a social media aggregation site and service that license raw feeds from the likes of Twitter, Facebook, WordPress and other social media platforms, enhancing the data feeds with sentiment scores and other attributes, and then sell access to the feeds via a number of formats and APIs. Accessing Twitter data through DataSift costs money, particularly if you want to go back and look at historical activity vs. just filtering on a few keywords in new Twitter activity, but they’re very developer-friendly and able to provide greater volumes of firehose activity than the standard Twitter developer API allows.

So assuming you can get access to a stream of Twitter data on a particular topic – in our case, all mentions of our website, our team’s Twitter handles, retweets of our content etc – the question then becomes one of how to store the data. Looking at the Datasift Sample Output page, each of these streams delivers their payloads via JSON documents, XML-like structures that nest categories of tweet metadata within parent structures that make up the total tweet data and metadata dataset.

NewImage

And there’s a good reason for this; individual tweets might not use every bit of possible tweet metadata, for example not including entries under “mentions” or “retweets” if those aren’t used in a  particular message. Certain bits of metadata might be repeated X numbers of times – @ mentions, for example, and the JSON document might have a different structure altogether if a different JSON schema version is used for a particular tweet. Altogether not an easy type of data structure for a relational database to hold – though Oracle 12.1.0.2.0 has just introduced native JSON support to the core Oracle database – but NoSQL databases in contrast find these sorts of data structures easy, and one of the most popular for this type of work is MongoDB.

MongoDB is a open-source “document” database that’s probably best known to the Oracle world through this internet cartoon; what the video is getting at is NoSQL advocates recommending databases such as MongoDB for large-scale web work when something much more mainstream like mySQL would do the job better, but where NoSQL and document-style database come into their own is storing just this type of semi-structured, schema-on-read datasets. In fact, Datasift support MongoDB as an API end-point for their Twitter feed, so let’s go ahead and set up a MongoDB database, prepare it for the Twitter data, and then set-up a Datasift feed into it.

MongoDB installation on Linux, for example to run alongside a Hadoop installation, is pretty straightforward and involves adding a YUM repository and then running “sudo yum install mongodb-org” (there’s an OS X installation too, but I wanted to run this server-side on my Hadoop cluster). Once you’ve installed the MongoDB software, you start the mongod service to enable the server element, and then log into the mongo command-shell to create a new database.

MongoDB, being a schema-on-read database, doesn’t require you to set up a database schema up-front; instead, the schema comes from the data you load into it, with MongoDB’s equivalent of tables called “collections”, and with those collections made up of documents, analogous to rows in Oracle. Where it gets interesting though is that collections and databases only get created when you first start using them, and individual documents can have slightly, or even completely, different schema structures to each other – which makes them ideal for holding the sorts of datasets generated by Twitter, Facebook and DataSift.

[root@cdh51-node1 ~]# mongo
MongoDB shell version: 2.6.4
connecting to: test
> use datasift2
switched to db datasift2

Let’s create a couple of simple documents, and then add those to a collection. Note that the document becomes available just by declaring it, as does the collection when I add documents to it. Note also that the query language we’re using to work with MongoDB is Javascript, again making it particularly suited to JSON documents, and web-type environments.

> a = { name : "mark" }
{ "name" : "mark" }
> b = { product : "chair", size : "L" }
{ "product" : "chair", "size" : "L" }
> db
datasift2
> db.testData.insert(a)
WriteResult({ "nInserted" : 1 })
> db.testData.insert(b)
WriteResult({ "nInserted" : 1 })
> db.testData.find()
{ "_id" : ObjectId("54094081b5b6021fe9bc8b10"), "name" : "mark" }
{ "_id" : ObjectId("54094088b5b6021fe9bc8b11"), "product" : "chair", "size" : "L" }

And note also how the second entry (document) in the collection has a different schema to the entry above it – perfect for our semi-structured Twitter data, and something we could store as-is in MongoDB in this loose data format and then apply more formal structures and schemas to when we come to access the data – as we’ll do in a moment using Pig, and more formally using ODI and Hive in the next article in this series.

Setting up the Twitter feed from DataSift is a two-stage process, once you’ve got an account with them and an API key; first you define your search terms against a nested document model for the data source, then you activate the feed, in this case into my MongoDB database, and wait for the tweets to roll in. For my feed I selected tweets written by myself and some of the Rittman Mead team, tweets mentioning us, and tweets that included links to our blog in the main tweet contents (there’s also a graphical query designer, but I prefer to write them by hand using what DataSift call their “curated stream definition language” (CSDL).

NewImage

You can then preview the feed, live, or go back and sample historic data if you’re interested in loading old tweets, rather than incoming new ones. Once you’re ready you then need to activate the feed, in my case by calling a URL using CURL with a bunch of parameters (our API key and other sensitive data has been masked):

curl -X POST 'https://api.datasift.com/v1/push/create' \
-d 'name=connectormongodb' \
-d 'hash=65bd9dc4943ec426b04819exxxxxxxxx' \
-d 'output_type=mongodb' \
-d 'output_params.host=rittmandev.com' \
-d 'output_params.port=27017' \
-d 'output_params.use_ssl=no' \
-d 'output_params.verify_ssl=no' \
-d 'output_params.db_name=datasiftmongodb' \
-d 'output_params.collection_name=rm_tweets' \
-H 'Auth: rittmanmead:xxxxxxxxxxxxxxxxxxxxxxxxxxx'

The “hash” in the parameter list is the specific feed to activate, and the output type is MongoDB. The collection name is new, and will be created by MongoDB when the first tweet comes in; let’s run the curl command now and sit back for a while, and wait for some twitter activity to arrive in MongoDB …

… and a couple of hours later, eight tweets have been captured by the DataSift filter, with the last of them being one from Michael Rainey about his trip tonight to the Seahawks game:

> db.rm_tweets.count()
8
> db.rm_tweets.findOne()
{
    "_id" : ObjectId("54089a879ad4ec99158b4d78"),
    "interactionId" : "1e43454b1a16a880e074e49c51369eac",
    "subscriptionId" : "f6cf211e03dca5da384786676c31fd3e",
    "hash" : "65bd9dc4943ec426b04819e6291ef1ce",
    "hashType" : "stream",
    "interaction" : {
        "demographic" : {
            "gender" : "male"
        },
        "interaction" : {
            "author" : {
                "avatar" : "http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "id" : 14551637,
                "language" : "en",
                "link" : "https://twitter.com/mRainey",
                "name" : "Michael Rainey",
                "username" : "mRainey"
            },
            "content" : "Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD",
            "created_at" : "Thu, 04 Sep 2014 16:58:29 +0000",
            "hashtags" : [
                "GoHawks",
                "kickoff2014",
                "GBvsSEA"
            ],
            "id" : "1e43454b1a16a880e074e49c51369eac",
            "link" : "https://twitter.com/mRainey/status/507573423334100992",
            "mention_ids" : [
                23642374
            ],
            "mentions" : [
                "Seahawks"
            ],
            "received_at" : 1409849909.2967,
            "schema" : {
                "version" : 3
            },
            "source" : "Instagram",
            "type" : "twitter"
        },
        "language" : {
            "tag" : "en",
                "tag_extended" : "en",
            "confidence" : 98
        },
        "links" : {
            "code" : [
                200
            ],
            "created_at" : [
                "Thu, 04 Sep 2014 16:58:29 +0000"
            ],
            "meta" : {
                "charset" : [
                    "CP1252"
                ],
                "lang" : [
                    "en"
                ],
                "opengraph" : [
                    {
                        "description" : "mrainey's photo on Instagram",
                        "image" : "http://photos-d.ak.instagram.com/hphotos-ak-xfa1/10655141_1470641446544147_1761180844_n.jpg",
                        "site_name" : "Instagram",
                        "type" : "instapp:photo",
                        "url" : "http://instagram.com/p/sh_h6sQBYT/"
                    }
                ]
            },
            "normalized_url" : [
                "http://instagram.com/p/sh_h6sQBYT"
            ],
            "title" : [
                "Instagram"
            ],
            "url" : [
                "http://instagram.com/p/sh_h6sQBYT/"
            ]
        },
        "salience" : {
            "content" : {
                "sentiment" : 0,
                "topics" : [
                    {
                        "name" : "Video Games",
                        "hits" : 0,
                        "score" : 0.5354745388031,
                        "additional" : "Greyson and I will be ready for the @Seahawks game tonight!"
                    }
                ]
            }
        },
        "trends" : {
            "type" : [
                "San Jose",
                "United States"
            ],
            "content" : [
                "seahawks"
            ],
            "source" : [
                "twitter"
            ]
        },
        "twitter" : {
            "created_at" : "Thu, 04 Sep 2014 16:58:29 +0000",
            "display_urls" : [
                "instagram.com/p/sh_h6sQBYT/"
            ],
            "domains" : [
                "instagram.com"
            ],
            "filter_level" : "medium",
            "hashtags" : [
                "GoHawks",
                "kickoff2014",
                "GBvsSEA"
            ],
            "id" : "507573423334100992",
            "lang" : "en",
            "links" : [
                "http://instagram.com/p/sh_h6sQBYT/"
            ],
            "mention_ids" : [
                23642374
            ],
            "mentions" : [
                "Seahawks"
            ],
            "source" : "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>",
            "text" : "Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD",
            "user" : {
                "created_at" : "Sat, 26 Apr 2008 21:18:01 +0000",
                "description" : "Data Integration (#ODI #GoldenGate #OBIA) consultant / blogger / speaker @RittmanMead.\nOracle ACE.\n#cycling #Seahawks #travel w/ @XiomaraRainey\n#GoCougs!",
                "favourites_count" : 746,
                "followers_count" : 486,
                "friends_count" : 349,
                "geo_enabled" : true,
                "id" : 14551637,
                "id_str" : "14551637",
                "lang" : "en",
                "listed_count" : 28,
                "location" : "Pasco, WA",
                "name" : "Michael Rainey",
                "profile_image_url" : "http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "profile_image_url_https" : "https://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "screen_name" : "mRainey",
                "statuses_count" : 8549,
                "time_zone" : "Pacific Time (US & Canada)",
                "url" : "http://www.linkedin.com/in/rainey",
                "utc_offset" : -25200,
                "verified" : false
            }
        }
    }
}

If you’ve not looked at Twitter metadata before, it’s surprising how much metadata accompanies what’s ostensibly an 140-character tweet. As well as details on the author, where the tweet was sent from, what Twitter client sent the tweet and details of the tweet itself, there’s details and statistics on the sender, the number of followers they’ve got and where they’re located, a list of all other Twitter users mentioned in the tweet and any URLs and images referenced.

Not every tweet will use every element of metadata, and some tweets will repeat certain attributes – other Twitter users you’ve mentioned in the tweet, for example – as many times as there are mentions. Which makes Twitter data a prime candidate for analysis using Pig and Spark, which handle easily the concept of nested data structures, tuples (ordered lists of data, such as attribute sets for an entity such as “author”), and bags (sets of unordered attributes, such as the list of @ mentions in a tweet).

There’s a MongoDB connector for Hadoop on Github which allows MapReduce to connect to MongoDB databases, running MapReduce jobs on MongoDB storage rather than HDFS (or S3, or whatever). This gives us the ability to use languages such as Pig and Hive to filter and aggregate our MongoDB data rather than MongoDB’s Javascript API, which isn’t as fully-featured and scaleable as MapReduce and has limitations in terms of the number of documents you can include in aggregations; let’s start then by connecting Pig to our MongoDB database, and reading in the documents with no Pig schema applied:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' using com.mongodb.hadoop.pig.MongoLoader;                                                                                                                                                  2014-09-05 06:40:51,773 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.                                                                                                                                        
2014-09-05 06:40:51,838 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> tweets_count = FOREACH (GROUP tweets ALL) GENERATE COUNT (tweets);                                             
2014-09-05 06:41:07,772 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:41:07,817 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> dump tweets_count
...
(9)
grunt>

So there’s nine tweets in the MongoDB database now. Let’s take a look at one of the documents by creating a Pig alias containing just a single record.

grunt> tweets_limit_1 = LIMIT tweets 1;
2014-09-05 06:43:12,351 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:43:12,443 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> dump tweets_limit_1
...
([interaction#{trends={source=(twitter), content=(seahawks), type=(San Jose,United States)}, twitter={filter_level=medium, text=Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD, mention_ids=(23642374), domains=(instagram.com), links=(http://instagram.com/p/sh_h6sQBYT/), lang=en, id=507573423334100992, source=<a href="http://instagram.com" rel="nofollow">Instagram</a>, created_at=Thu, 04 Sep 2014 16:58:29 +0000, hashtags=(GoHawks,kickoff2014,GBvsSEA), mentions=(Seahawks), user={profile_image_url_https=https://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, location=Pasco, WA, geo_enabled=true, statuses_count=8549, lang=en, url=http://www.linkedin.com/in/rainey, utc_offset=-25200, id=14551637, time_zone=Pacific Time (US & Canada), favourites_count=746, verified=false, friends_count=349, description=Data Integration (#ODI #GoldenGate #OBIA) consultant / blogger / speaker @RittmanMead.
Oracle ACE.
#cycling #Seahawks #travel w/ @XiomaraRainey
#GoCougs!, name=Michael Rainey, created_at=Sat, 26 Apr 2008 21:18:01 +0000, screen_name=mRainey, id_str=14551637, profile_image_url=http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, followers_count=486, listed_count=28}, display_urls=(instagram.com/p/sh_h6sQBYT/)}, salience={content={topics=([score#0.5354745388031,additional#Greyson and I will be ready for the @Seahawks game tonight!,hits#0,name#Video Games]), sentiment=0}}, links={created_at=(Thu, 04 Sep 2014 16:58:29 +0000), title=(Instagram), code=(200), normalized_url=(http://instagram.com/p/sh_h6sQBYT), url=(http://instagram.com/p/sh_h6sQBYT/), meta={lang=(en), charset=(CP1252), opengraph=([image#http://photos-d.ak.instagram.com/hphotos-ak-xfa1/10655141_1470641446544147_1761180844_n.jpg,type#instapp:photo,site_name#Instagram,url#http://instagram.com/p/sh_h6sQBYT/,description#mrainey's photo on Instagram])}}, interaction={schema={version=3}, id=1e43454b1a16a880e074e49c51369eac, content=Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD, author={id=14551637, username=mRainey, language=en, avatar=http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, name=Michael Rainey, link=https://twitter.com/mRainey}, received_at=1.4098499092967E9, source=Instagram, mention_ids=(23642374), link=https://twitter.com/mRainey/status/507573423334100992, created_at=Thu, 04 Sep 2014 16:58:29 +0000, hashtags=(GoHawks,kickoff2014,GBvsSEA), type=twitter, mentions=(Seahawks)}, language={tag=en, confidence=98, tag_extended=en}, demographic={gender=male}},interactionId#1e43454b1a16a880e074e49c51369eac,_id#54089a879ad4ec99158b4d78,hash#65bd9dc4943ec426b04819e6291ef1ce,subscriptionId#f6cf211e03dca5da384786676c31fd3e,hashType#stream])

And there’s Michael’s tweet again, with all the attributes from the MongoDB JSON document appended together into a single record. But in this format the data isn’t all that useful as we can’t easily access individual elements in the Twitter record; what would be better would be to apply a Pig schema definition to the LOAD statement, using the MongoDB document field listing that we saw when we displayed a single record from the MongoDB collection earlier.

I can start by referencing the document fields that become simple Pig dataypes; ID and interactionId, for example:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.my_first_test' using com.mongodb.hadoop.pig.MongoLoader('id:chararray,interactionId:chararray','id');
2014-09-05 06:57:57,985 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:57:58,022 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> describe tweets
2014-09-05 06:58:11,611 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
tweets: {id: chararray,interactionId: chararray}
grunt> tweets_limit_1 = LIMIT tweets 1;
...
(53fae22e9ad4ec93658b513e,1e42c2747542a100e074fff55100414a)
grunt>

Where the MongoDB document has fields nested within other fields, you can reference these as a tuple if they’re a set of attributes under a common header, or a bag if they’re just a list of values for a single attribute; for example, the “username” field is contained within the author tuple, which in-turn is contained within the interaction tuple, so to count tweets by author I’d need to first flatten the author tuple to turn its fields into scalar fields, then project out the username and other details; then I can group the relation in the normal way on those author details, and generate a count of tweets, like this:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' using com.mongodb.hadoop.pig.MongoLoader('id:chararray,interactionId:chararray,interaction:tuple(interaction:tuple(author:tuple(id:int,language:chararray,link:chararray,name:chararray,username:chararray)))','id');
grunt> tweets_author_tuple_flattened = FOREACH tweets GENERATE id, FLATTEN(interaction.$0);                                            
grunt> tweets_with_authors = FOREACH tweets_author_tuple_flattened GENERATE id, interaction::author.username, interaction::author.name;
grunt> tweets_author_group = GROUP tweets_with_authors by username; 
grunt> tweets_author_count = FOREACH tweets_author_group GENERATE group, COUNT(tweets_with_authors); 
...
(rmoff,1)
(dw_pete,1)
(mRainey,3)
(P_J_FLYNN,3)
(davidhuey,7)
(EdelweissK,1)
(JamesOickle,3)
(markrittman,3)
(rittmanmead,2)
(RedgraveChris,1)
grunt>

So there’s obviously a lot more we can do with the Twitter dataset as it stands, but where it’ll get really interesting is combining this with other social media interaction data – for example from Facebook, LinkedIn and so on – and then correlating that with out main site activity data. Check back in a few days when we’ll be covering this second stage in a further blog article, using ODI12c to orchestrate the process.

Categories: BI & Warehousing

Oracle 12.1.0.2 and Data Warehouses

Mon, 2014-09-01 09:21

If you follow Blogs and Tweets from the Oracle community you won’t have missed hearing about the recent release of the first patch-set for Oracle 12c. With this release there are some significant pieces of new functionality that will be of interest to Data Warehouse DBAs and architects. The headline feature that most Oracle followers will know of is the new in-memory option. In my opinion this is a game-changer for how we design reporting architectures; it gives us an effective way to build operational reporting over the reference data architecture described by Mark Rittman a few weeks ago. Of course, the database team here at Rittman Mead have been rolling up our sleeves and getting into in-memory technology for quite a while now, Mark even featured in the official launch presentation by Larry Ellison with the now famous “so easy it’s boring” quote. Last week Mark published the first of our Rittman Mead in-memory articles, with the promise of more in-memory articles to come including my article for the next edition of UKOUG’s “Oracle Scene”.

However, the in-memory option is not the only new feature that is going to be a benefit to us in the BI/DW world. One of the new features I am going to describe is Exadata only, but the first one I am going to mention is generally available in the 12.1.0.2 database.

Typically, data warehouse queries are different from those seen in the OLTP world – in DW we tend to access a large number of rows and probably aggregate things up to answer some business question. Often we are not using indexes and instead scanning tables or table partitions is the norm. Usually, the data we need to aggregate is widely scattered across the table or partition. Data Warehouse queries often look at records that share a set of common attributes; we look at the sales for the ‘ACME’ widget or the value of items shipped to Arizona. For us there can be great advantage if data we use together is stored together, and this is where Attribute Clustering can pay a part.

Attribute Clustering is usually configured on the table at at DDL time and in-effect controls the ordering of data inserted by DIRECT PATH operations, Oracle does not enforce this ordering for conventional inserts, this may not be an issue in data warehouses as bulk-batch operations typically use APPEND inserts, which are direct path inserts, or partition operations, it may be more of an issue with some of the real-time conventional path loading paradigms. In addition to Direct Path load operations Attribute Clustering can also occur when you do Alter table MOVE type operations (this also includes operations such as PARTITION SPLIT). On the surface, Attribute Clustering sounds little different to using an ORDER by on an append insert and hoping that Oracle actually stores the data where you expect it to. However, Attribute Clustering gives us two other possibilities in how we can order the data in the cluster.

Firstly, we can cluster on columns from JOINED dimension tables, for example in a SALES DW we may have a sales fact with a product key at the SKU level, but we often join to the product dimension and report at the Product Category level. In this case we can cluster our sales fact table so that each product category appears in the same cluster. For example, we have just opened a chain a supermarkets with a wide but uninspiring range of brands and products (see the tiny piece of our product dimension table below)

NewImage

As you can see, our Product PK has no relationship at all to the type of product being sold. In our Kimball-style data warehouse we typically store the product key on the fact table and join to the product dimension to obtain all of the other product attributes and hierarchy members. This is essentially what we can do with join Attribute Clustering, in our example we can cluster our fact table on PRODUCT_CATEGORY so that all of the Laundry sales are physically close to each other in the Fact table.

CREATE TABLE rm_sales (
product_idNUMBER NOT NULL,
store_id        NUMBER NOT NULL,
sales_date      DATE NOT NULL,
loyalty_card_id NUMBER ,
quantity_sold  NUMBER(3) NOT NULL,
value_sold    NUMBER(10,2) NOT NULL
)
CLUSTERING
rm_sales JOIN products ON (rm_sales.product_id = products.product_pk)
BY LINEAR ORDER (sales_date, product_category, store_id);

Notice we are clustering on a join to the product dimension table’s “product_category” column, we are also clustering on sales_date, this is especially important in the case of partitioned fact tables so that the benefits of clustering align to the partitioning strategy.  We are also not restricted in our clustering to just one join, if we wanted to we could also cluster our sales by store region e.g. the Colorado laundry product sales are located in the same area of the sales table. To use Join Attribute Clustering we need to define the PK / FK relationships between fact and dimension, however it is always good practice to have that in place as it helps the CBO so much with query plan evaluation

Secondly, notice the BY LINEAR ORDER clause in the table DDL. Of the two ordering options, Linear Order is the most basic form of clustering, it this case we have our data structured so that all the items for a sales day are clustered together and within that cluster we order by product category and those categories are in turn ordered by store_id. The other way we can cluster is BY INTERLEAVED ORDER; here, Oracle maps a combination of dimensional values to a single cluster value using a z-order curve fitting approach. This sounds complex but it ensures that items that are frequently queried together are co-located in the disk blocks in the storage.

Interleaved ordering is probably the best choice for data warehousing at it aligns well with how we access data in our queries. Although we could include all of the dimension keys in our ordering list, it is going to be more benefit to just include a subset of dimensions; typically for retail I’d go with DATE (or something that correlates to the time based partition key of the fact table), the product  and the store. Of course we can again join to the dimension tables and cluster at higher hierarchy levels such as product category and store region. The Oracle 12c Data Warehousing guide gives some good advice, but you can’t go far wrong if you think about clustering items together that will be queried together

Clustering data can give us some advantages in performance. Better data compression and improved index range scans spring to mind, but to get most benefits we should also look at another new feature, zone-maps. Unlike Attribute Clustering, Zone Maps are Engineered Systems only, In a way they are similar to storage indexes already found on Exadata, but they have some additional advantages, they are also somewhat different from zone maps encountered in other DB vendors’ products such as Netezza.

In Exadata, a storage index can provide the maximum and minimum values encountered for a column in storage cell. I say “can” as there is no guarantee that range for a given column is held in the storage index. Zone Maps on the other hand will always provide maxima and minima for all of the columns specified at zone map creation. The zone map is orientated in terms of contiguous database blocks and is materialized so that it is physically persisted in the database and thus survives DB startups. Like Materialized views Materialized zone maps can become stale and need to be maintained.

We can define a zone map on one or more table columns and just like Attribute Clustering we may also create zone maps on table joins. As a table can only have one zone map it is important to include all of the columns you wish to track. Zone Maps are designed to work well with attribute clustering, in fact it is just a simple DDL statement to add a zone-map to an Attribute Clustered table so that the zone map tracks the same attributes as the clustering. This is where we get the major performance boost from attribute clustering, Instead of looking at the whole table the zone map tells us which ranges of database blocks contain data that matches our query predicates.

Zone Maps with Attribute Clustering gives us another powerful tool to boost DW performance on Exadata – we can do star queries without resorting to bitmap indexes and we minimise IO when scanning fact tables as we only need look where we know the data to be. Exciting times!

Categories: BI & Warehousing

Taking a Look at the Oracle Database 12c In-Memory Option

Fri, 2014-08-29 13:43

The In-Memory Option for Oracle Database 12c became available a few weeks ago with the 12.1.0.2 database patchset, adding column-store and in-memory capabilities to the Oracle Database. Unlike pure in-memory databases such as Oracle TimesTen, the in-memory option adds an in-memory column-store feature to the regular row-based storage in the Oracle database, creating in-memory copies of selected row-store tables in a compressed column-based storage format, with the whole process being automatic and and enabled by issuing just a couple of commands – to the point where my feedback at the end of beta testing was that it was “almost boring” – said slightly tongue-in-cheek…

NewImage

But of course adding in-memory capabilities to the Oracle database is anything but boring – whilst TimesTen on Exalytics has given us the ability to store aggregate tables, and “hot data” tables  in a dedicated in-memory database co-located with the BI mid-tier, TimesTen is another set of skills to learn and another product to manage. Moreover, you’ve got to somehow get the data you want to aggregate or otherwise cache out of your data warehouse and into TimesTen, and then you’ve got the even more difficult task of keeping that TimesTen dataset in-sync with your main data warehouse data. Most importantly though, even with the 2TB and 4TB versions of Exalytics many data warehouses won’t fit entirely into memory, so you’re going to end-up with some of your data in TimesTen and some in your data warehouse, and with only the TimesTen dataset benefiting from in-memory speeds.

So what if you could enable the in-memory option on your data warehouse, and then just run it all in-memory there? To see how this worked, I thought it’d be interesting to take the flight delays dataset that comes with the latest OBIEE 11g v406 SampleApp, and copy that across to an Oracle 12.1.0.2.0 database to test it out with the in-memory option. This dataset comes with a set of dashboard pages and analyses and contains around 6m rows of data, and in SampleApp is accelerated by an aggregate schema, also stored in Oracle, that generally reduces query times down to just a few seconds.

NewImage

So what I thought would be interesting to check would be first, whether enabling the in-memory option sped-up the dashboards, and second, whether having the in-memory option removed the need for the aggregate tables altogether. I was also interested to see whether putting the aggregate tables into memory improved their access time significantly, as anecdotally I’d heard that there wasn’t much clock-time difference between accessing aggregates in a TimesTen database vs. just creating them in the same data warehouse database as the source data. To set this all up then I created the new 12.1.0.2.0 database,. exported the BI_AIRLINES and BI_AIRLINES_AGG schemas out of the 12c database that comes with SampleApp v406, and then imported them into the same schema names in the new 12.1.0.2.0 instance, on the new VM (rather than on the SampleApp VM), and then amended the connection pool details in the SampleApp RPD to point to this new, external Oracle 12.1.0.2.0 database instance. The 12.1.0.2.0 database at this point had the following memory allocation:

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
 
SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
 
Total System Global Area 6325010432 bytes
Fixed Size               2938448 bytes
Variable Size            1207962032 bytes
Database Buffers         5100273664 bytes
Redo Buffers             13836288 bytes
Database mounted.
Database opened.

So I don’t end-up comparing the larger database instance on my 12.1.0.2.0 VM with the smaller one the airlines data came from on the VM, I created my baseline response time measurements on this new VM and with none of the in-memory features enabled, and ran some of the dashboard pages and clicked-on some of the prompt values – and the response time wasn’t bad, with just the odd analysis that took an excessive time to return. Mostly though, the results for each dashboard came back straight-away, what you’d expect when summary tables have been deployed to speed-up reporting.

chart1

Looking at the SQL being generated by the BI Server, you can quickly see why this is the case; the BI Server is using aggregate navigation and actually querying the BI_AIRLINES_AGG schema to return the aggregated results the dashboard, as you can see from one of the SQL statements retrieved from the NQQuery.log file below:

select sum(T255906.Dep_Delay_00039B26) as c1,
     sum(T255906.Arr_Delay_00039B22) as c2,
     sum(T255906.Z_of_Fligh00039B28) as c3,
     substr(T255216.Carrier00039BA9 , 1, 25) as c4,
     T255216.Carrier00039BA9 as c5,
     T255216.Carrier_Co00039BAA as c6
from 
     BI_AIRLINES_AGG.SA_16_Dest00039D06 T255357,
     BI_AIRLINES_AGG.SA_Time_Mo00039CFB T255737,
     BI_AIRLINES_AGG.SA_31_Carr00039CEB T255216,
     BI_AIRLINES_AGG.FACT_AGG_OR_06M T255906
where  ( T255216.Carrier_Co00039BAA = T255906.Carrier_Co00039BAA and T255357.Dest_Airpo00039C2A = T255906.Dest_Airpo00039C2A and T255737.Dep_Month00039C07 = T255906.Dep_Month00039C07 and substr(T255216.Carrier00039BA9 , 1, 25) = 'SunFlower Airlines' and (T255357.Dest_Regio00039C31 in ('Midwest Region', 'Northeast Region', 'South Region', 'West Region')) and T255737.Month_of_Y00039C0F between 3 and 12 ) 
group by T255216.Carrier00039BA9, T255216.Carrier_Co00039BAA
order by c5, c6, c4

OBIEE will always use these aggregate tables if they’re available in the repository, so to enable me to test the reports with these aggregates disabled I borrowed the technique Robin introduced in his blog on testing aggregate navigation, and added a request variable prompt to the dashboard page that allows me to pass an INACTIVE_SCHEMAS value to the logical SQL queries issued for the dashboard analyses, and therefore tell the BI Server to ignore the aggregate schema and just use the detail-level BI_AIRLINES schema instead.

NewImage

I therefore test the dashboard again, this time with the BI_AIRLINES_AGG schema disabled; this time, every dashboard took a while to display properly, with every analysis showing the “spinning clock” for twenty, thirty seconds a time. Comparing the response times to the ones a moment ago when aggregates were enabled, and you can see the difference.

chart2

So let’s set-up the in-memory option and see how it affects these two scenarios. The in-memory option for Oracle Database 12c 12.1.0.2.0 is installed by default as part of the core database, but doesn’t start working until you set a value for the INMEMORY_SIZE initialisation parameter – as you can see from the SQL*Plus output below, it’s currently set to zero, effectively disabling the feature:

SQL> conn / as sysdba
Connected.

SQL> show parameter INMEMORY

NAME                     TYPE    VALUE
------------------------------------ ----------- ------------------------------
inmemory_clause_default          string
inmemory_force               string  DEFAULT
inmemory_max_populate_servers        integer     0
inmemory_query               string  ENABLE
inmemory_size                big integer 0
inmemory_trickle_repopulate_servers_ integer     1
percent
optimizer_inmemory_aware         boolean     TRUE

The memory Oracle allocates to the in-memory option, via this initialisation parameter, comes out of the SGA and therefore before I set the INMEMORY_SIZE parameter, it’d be a good idea to increase the SGA_TARGET value to accommodate the 1G I’m then going to assign to INMEMORY_SIZE. I do both from the SQL*Plus command-prompt and then bounce the database to bring the new values into use:

SQL> alter system set inmemory_size = 1G scope=spfile;

System altered.

SQL> show parameter sga_target

NAME                     TYPE    VALUE
------------------------------------ ----------- ------------------------------
sga_target               big integer 6032M
SQL> alter system set sga_target = 7032M scope=spfile;

System altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.

Total System Global Area 7381975040 bytes
Fixed Size          2941480 bytes
Variable Size        1207963096 bytes
Database Buffers     5083496448 bytes
Redo Buffers           13832192 bytes
In-Memory Area       1073741824 bytes
Database mounted.
Database opened.

Now we’re at the point where we can enable the tables for in-memory access. Tables to be copied into memory can either be created like that at the start, or you can ALTER TABLE them afterwards and specify that they go into memory (or you can configure the in-memory settings at the tablespace level, or even specify particular columns or partitions to go into memory). The underlying data still gets stored row-wise on disk, but enabling a table for in-memory access tells the Oracle database to create column-store in-memory representations of the table and direct suitable queries to those copies, all the time in the background keeping the copy in-sync with the row-store base data.

I’ll start by enabling all of the BI_AIRLINES schema for in-memory access, as the biggest pay-off would be if then meant we didn’t need to maintain the aggregate tables. After running a SELECT table_name FROM user_tables to list out the table names, I then run a script to enable all the tables for in-memory access, with the in-memory copies being populated immediately:

alter table AIRCRAFT_GROUP inmemory priority high;
alter table AIRCRAFT_TYPE inmemory priority high;
alter table AIRLINES_USER_DATA inmemory priority high;
alter table AIRLINE_ID inmemory priority high;
alter table CANCELLATION inmemory priority high;
alter table CARRIER_GROUP_NEW inmemory priority high;
alter table CARRIER_REGION inmemory priority high;
alter table DEPARBLK inmemory priority high;
alter table DISTANCE_GROUP_250 inmemory priority high;
alter table DOMESTIC_SEGMENT inmemory priority high;
alter table OBIEE_COUNTY_HIER inmemory priority high;
alter table OBIEE_GEO_AIRPORT_BRIDGE inmemory priority high;
alter table OBIEE_GEO_ORIG inmemory priority high;
alter table OBIEE_ROUTE inmemory priority high;
alter table OBIEE_TIME_DAY_D inmemory priority high;
alter table OBIEE_TIME_MTH_D inmemory priority high;
alter table ONTIME_DELAY_GROUPS inmemory priority high;
alter table PERFORMANCE inmemory priority high;
alter table PERFORMANCE_ENDECA_MV inmemory priority high;
alter table ROUTES_FOR_LINKS inmemory priority high;
alter table SCHEDULES inmemory priority high;
alter table SERVICE_CLASS inmemory priority high;
alter table UNIQUE_CARRIERS inmemory priority high;

Looking at USER_TABLES from this schema, I can see all of the tables I selected now marked for in-memory access, for immediate loading:

SQL> @display_table_inmem_details.sql
SQL> select table_name
  2  ,    inmemory
  3  ,    inmemory_priority
  4  from   user_tables
  5  /
 
TABLE_NAME           INMEMORY INMEMORY
-------------------- -------- --------
AIRCRAFT_GROUP       ENABLED  HIGH
UNIQUE_CARRIERS      ENABLED  HIGH
SERVICE_CLASS        ENABLED  HIGH
SCHEDULES            ENABLED  HIGH
ROUTES_FOR_LINKS     ENABLED  HIGH
PERFORMANCE          ENABLED  HIGH
ONTIME_DELAY_GROUPS  ENABLED  HIGH
OBIEE_TIME_MTH_D     ENABLED  HIGH
OBIEE_TIME_DAY_D     ENABLED  HIGH
OBIEE_ROUTE          ENABLED  HIGH
OBIEE_GEO_ORIG       ENABLED  HIGH
 
TABLE_NAME    INMEMORY INMEMORY
-------------------- -------- --------
OBIEE_GEO_AIRPORT_BR ENABLED  HIGH
IDGE
 
OBIEE_COUNTY_HIER    ENABLED  HIGH
DOMESTIC_SEGMENT     ENABLED  HIGH
DISTANCE_GROUP_250   ENABLED  HIGH
DEPARBLK             ENABLED  HIGH
CARRIER_REGION       ENABLED  HIGH
CARRIER_GROUP_NEW    ENABLED  HIGH
CANCELLATION         ENABLED  HIGH
AIRLINE_ID           ENABLED  HIGH
 
TABLE_NAME           INMEMORY INMEMORY
-------------------- -------- --------
AIRLINES_USER_DATA   ENABLED  HIGH
AIRLINES_PBLOB$      DISABLED
AIRLINES_PART$       DISABLED
AIRLINES_NODE_TZ$    DISABLED
AIRLINES_NODE$       DISABLED
AIRLINES_LINK_TZ$    DISABLED
AIRLINES_LINK_SCH$   DISABLED
AIRLINES_LINK$       DISABLED
AIRLINES_AIRPORT_TZ$ DISABLED
AIRCRAFT_TYPE        ENABLED  HIGH
 
30 rows selected.

And I can track the progress of the tables being copied into memory using the V$IM_SEGMENTS v$ view, like this:

SQL> @display_im_segments.sql
SQL> set echo on
SQL> set pages 200
SQL> col owner for a20
SQL> col name for a20
SQL> col status for a10
SQL> select v.owner
  2  ,      v.segment_name name
  3  ,      v.populate_status status
  4  from   v$im_segments v;

OWNER            NAME         STATUS
-------------------- -------------------- ----------
BI_AIRLINES      OBIEE_COUNTY_HIER    COMPLETED
BI_AIRLINES      PERFORMANCE      STARTED
BI_AIRLINES      UNIQUE_CARRIERS      COMPLETED
BI_AIRLINES      AIRLINES_LINK_TZ$    COMPLETED
BI_AIRLINES      OBIEE_TIME_MTH_D     COMPLETED
BI_AIRLINES      AIRLINES_LINK_SCH$   COMPLETED
BI_AIRLINES      OBIEE_ROUTE      COMPLETED
BI_AIRLINES      DOMESTIC_SEGMENT     COMPLETED
BI_AIRLINES      AIRLINES_LINK$   COMPLETED
BI_AIRLINES      AIRLINE_ID       COMPLETED
BI_AIRLINES      OBIEE_GEO_ORIG   COMPLETED
BI_AIRLINES      AIRLINES_NODE$   COMPLETED
BI_AIRLINES      OBIEE_GEO_AIRPORT_BR COMPLETED
             IDGE

BI_AIRLINES      AIRLINES_NODE_TZ$    COMPLETED
BI_AIRLINES      OBIEE_TIME_DAY_D     COMPLETED

Note that most of the tables went into memory immediately, but one (PERFORMANCE) is taking a while because it’s so big. Also note that not all tables are listed in the v$ view yet, as the database hasn’t got around to adding them into memory yet, or it might choose not to populate them if it feels the memory could be used better elsewhere. You can alter the priority of these in-memory copy processes if you want, and decide whether the copying happens immediately, or when the table is first scanned (accessed).

Running the dashboards again, with the request variable prompt set to disallow the aggregate schema, gave me the response times below – the in-memory queries were much faster than the row-based non in-memory ones, but they weren’t down to the response time of the dashboards right at the start, where all data is stored row-wise but we’ve got aggregate tables to speed up the queries (note I’ve reordered the bars so the non in-memory queries with no aggregate tables are on the left of each series, as the slowest of all approaches)

chart3

Taking a look at one of the physical SQL queries for a cross-tab (hierarchical columns) analysis, you can see that in-memory table access is happening:

WITH 
SAWITH0 AS (select sum(T233937.ACTUALELAPSEDTIME) as c1,
     sum(T233937.ARRDELAYMINUTES) as c2,
     concat(concat(T233484.AIRPORT, ' - '), substr(T233484.TR_AIRPORT_NAME , instr(T233484.TR_AIRPORT_NAME , ': ') + 2 , 50)) as c3,
     T233820.D_NAME as c4,
     T233820.R_NAME as c5
from 
     BI_AIRLINES.OBIEE_GEO_AIRPORT_BRIDGE T233484 /* 10 GEO_AIRPORT_ORIGIN */ ,
     BI_AIRLINES.OBIEE_COUNTY_HIER T233820 /* 11 COUNTY_HIER_ORIGIN */ ,
     BI_AIRLINES.OBIEE_TIME_MTH_D T233732 /* 41 TIME MONTH */ ,
     BI_AIRLINES.PERFORMANCE T233937 /* 00 PERFORMANCE FULL */ 
where  ( T233484.AIRPORT = T233937.ORIGIN and T233484.STCTY_FIPS = T233820.SC_CODE and T233732.Cal_Month = T233937.MONTH and T233732.Cal_Qtr = T233937.QUARTER and T233732.Cal_Year = T233937.YEAR and T233732.Cal_Month between 6 and 12 and T233937.MONTH between 6 and 12 ) 
group by T233820.D_NAME, T233820.R_NAME, concat(concat(T233484.AIRPORT, ' - '), substr(T233484.TR_AIRPORT_NAME , instr(T233484.TR_AIRPORT_NAME , ': ') + 2 , 50))),
SAWITH1 AS (select sum(T233609.PASSENGERS) as c1,
     T233820.R_NAME as c2,
     T233820.D_NAME as c3,
     concat(concat(T233484.AIRPORT, ' - '), substr(T233484.TR_AIRPORT_NAME , instr(T233484.TR_AIRPORT_NAME , ': ') + 2 , 50)) as c4,
     sum(T233609.PASSENGERS_MILES) as c5
from 
     BI_AIRLINES.OBIEE_GEO_AIRPORT_BRIDGE T233484 /* 10 GEO_AIRPORT_ORIGIN */ ,
     BI_AIRLINES.OBIEE_COUNTY_HIER T233820 /* 11 COUNTY_HIER_ORIGIN */ ,
     BI_AIRLINES.OBIEE_TIME_MTH_D T233732 /* 41 TIME MONTH */ ,
     BI_AIRLINES.DOMESTIC_SEGMENT T233609 /* 01 DOMESTIC Segment */ 
where  ( T233484.AIRPORT = T233609.ORIGIN and T233484.STCTY_FIPS = T233820.SC_CODE and T233609.MONTH = T233732.Cal_Month and T233609.QUARTER = T233732.Cal_Qtr and T233609.YEAR = T233732.Cal_Year and T233609.MONTH between 6 and 12 and T233732.Cal_Month between 6 and 12 ) 
group by T233820.D_NAME, T233820.R_NAME, concat(concat(T233484.AIRPORT, ' - '), substr(T233484.TR_AIRPORT_NAME , instr(T233484.TR_AIRPORT_NAME , ': ') + 2 , 50)))
select 2 as c1,
     case  when D1.c3 is not null then D1.c3 when D2.c4 is not null then D2.c4 end  as c2,
     case  when D1.c4 is not null then D1.c4 when D2.c3 is not null then D2.c3 end  as c3,
     case  when D1.c5 is not null then D1.c5 when D2.c2 is not null then D2.c2 end  as c4,
     'All Orig Airports' as c5,
     1 as c6,
     case  when D1.c4 is not null then D1.c4 when D2.c3 is not null then D2.c3 end  as c7,
     case  when D1.c5 is not null then D1.c5 when D2.c2 is not null then D2.c2 end  as c8,
     cast(D1.c2 as  DOUBLE PRECISION  ) / nullif( D1.c1, 0) * 100 as c9,
     D2.c5 as c10,
     D2.c1 as c14
from 
     SAWITH0 D1 full outer join SAWITH1 D2 On D1.c3 = D2.c4 and  SYS_OP_MAP_NONNULL(D1.c4) = SYS_OP_MAP_NONNULL(D2.c3) 
order by c4, c3, c2
SQL> @complex_query_explain.sql
 
Explained.

SQL> set lines 200
SQL> set pages 0
SQL> select * from table(dbms_xplan.display);
Plan hash value: 3097908901

----------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |              |  2673K|  1392M|   |   112K  (1)| 00:00:05 |    |      |        |
|   1 |  PX COORDINATOR                    |              |   |   |   |        |      |    |      |        |
|   2 |   PX SEND QC (ORDER)                   | :TQ10006         |  2673K|  1392M|   |   112K  (1)| 00:00:05 |  Q1,06 | P->S | QC (ORDER) |
|   3 |    SORT ORDER BY                   |              |  2673K|  1392M|  1492M|   112K  (1)| 00:00:05 |  Q1,06 | PCWP |        |
|   4 |     PX RECEIVE                     |              |  2673K|  1392M|   | 26819   (1)| 00:00:02 |  Q1,06 | PCWP |        |
|   5 |      PX SEND RANGE                 | :TQ10005         |  2673K|  1392M|   | 26819   (1)| 00:00:02 |  Q1,05 | P->P | RANGE      |
|   6 |       VIEW                     | VW_FOJ_0         |  2673K|  1392M|   | 26819   (1)| 00:00:02 |  Q1,05 | PCWP |        |
|*  7 |        HASH JOIN FULL OUTER BUFFERED           |              |  2673K|  1392M|   | 26819   (1)| 00:00:02 |  Q1,05 | PCWP |        |
|   8 |     PX RECEIVE                 |              |   136 | 37128 |   |   144   (7)| 00:00:01 |  Q1,05 | PCWP |        |
|   9 |      PX SEND HASH                  | :TQ10003         |   136 | 37128 |   |   144   (7)| 00:00:01 |  Q1,03 | S->P | HASH       |
|  10 |       PX SELECTOR                  |              |   |   |   |        |      |  Q1,03 | SCWC |        |
|  11 |        VIEW                    |              |   136 | 37128 |   |   144   (7)| 00:00:01 |  Q1,03 | SCWC |        |
|  12 |         HASH GROUP BY              |              |   136 | 17408 |   |   144   (7)| 00:00:01 |  Q1,03 | SCWC |        |
|* 13 |          HASH JOIN                 |              |   136 | 17408 |   |   143   (6)| 00:00:01 |  Q1,03 | SCWC |        |
|  14 |           JOIN FILTER CREATE           | :BF0000          |   136 | 11288 |   |   142   (6)| 00:00:01 |  Q1,03 | PCWP |        |
|* 15 |            HASH JOIN               |              |   136 | 11288 |   |   142   (6)| 00:00:01 |  Q1,03 | SCWC |        |
|  16 |         JOIN FILTER CREATE         | :BF0001          |   136 |  4896 |   |   142   (6)| 00:00:01 |  Q1,03 | PCWP |        |
|* 17 |          HASH JOIN             |              |   136 |  4896 |   |   142   (6)| 00:00:01 |  Q1,03 | SCWC |        |
|* 18 |           TABLE ACCESS INMEMORY FULL   | OBIEE_TIME_MTH_D     |   161 |  1610 |   | 1   (0)| 00:00:01 |  Q1,03 | SCWP |        |
|* 19 |           TABLE ACCESS INMEMORY FULL   | DOMESTIC_SEGMENT     |   771 | 20046 |   |   141   (6)| 00:00:01 |  Q1,03 | SCWP |        |
|  20 |         JOIN FILTER USE            | :BF0001          |  1787 | 83989 |   | 1   (0)| 00:00:01 |  Q1,03 | PCWP |        |
|* 21 |          TABLE ACCESS INMEMORY FULL    | OBIEE_GEO_AIRPORT_BRIDGE |  1787 | 83989 |   | 1   (0)| 00:00:01 |  Q1,03 | SCWP |        |
|  22 |           JOIN FILTER USE              | :BF0000          |  3221 |   141K|   | 1   (0)| 00:00:01 |  Q1,03 | PCWP |        |
|* 23 |            TABLE ACCESS INMEMORY FULL      | OBIEE_COUNTY_HIER    |  3221 |   141K|   | 1   (0)| 00:00:01 |  Q1,03 | SCWP |        |
|  24 |     PX RECEIVE                 |              |  2255K|   587M|   | 26673   (1)| 00:00:02 |  Q1,05 | PCWP |        |
|  25 |      PX SEND HASH                  | :TQ10004         |  2255K|   587M|   | 26673   (1)| 00:00:02 |  Q1,04 | P->P | HASH       |
|  26 |       VIEW                     |              |  2255K|   587M|   | 26673   (1)| 00:00:02 |  Q1,04 | PCWP |        |
|  27 |        HASH GROUP BY               |              |  2255K|   264M|   489M| 26673   (1)| 00:00:02 |  Q1,04 | PCWP |        |
|  28 |         PX RECEIVE                 |              |  2255K|   264M|   | 26673   (1)| 00:00:02 |  Q1,04 | PCWP |        |
|  29 |          PX SEND HASH              | :TQ10002         |  2255K|   264M|   | 26673   (1)| 00:00:02 |  Q1,02 | P->P | HASH       |
|  30 |           HASH GROUP BY            |              |  2255K|   264M|   489M| 26673   (1)| 00:00:02 |  Q1,02 | PCWP |        |
|* 31 |            HASH JOIN               |              |  3761K|   441M|   |   841   (9)| 00:00:01 |  Q1,02 | PCWP |        |
|  32 |         PX RECEIVE             |              |   161 |  1610 |   | 1   (0)| 00:00:01 |  Q1,02 | PCWP |        |
|  33 |          PX SEND BROADCAST         | :TQ10000         |   161 |  1610 |   | 1   (0)| 00:00:01 |  Q1,00 | S->P | BROADCAST  |
|  34 |           PX SELECTOR              |              |   |   |   |        |      |  Q1,00 | SCWC |        |
|* 35 |            TABLE ACCESS INMEMORY FULL  | OBIEE_TIME_MTH_D     |   161 |  1610 |   | 1   (0)| 00:00:01 |  Q1,00 | SCWP |        |
|* 36 |         HASH JOIN              |              |  3773K|   406M|   |   838   (9)| 00:00:01 |  Q1,02 | PCWP |        |
|  37 |          PX RECEIVE            |              |  1787 |   160K|   | 2   (0)| 00:00:01 |  Q1,02 | PCWP |        |
|  38 |           PX SEND BROADCAST        | :TQ10001         |  1787 |   160K|   | 2   (0)| 00:00:01 |  Q1,01 | S->P | BROADCAST  |
|  39 |            PX SELECTOR             |              |   |   |   |        |      |  Q1,01 | SCWC |        |
|* 40 |             HASH JOIN              |              |  1787 |   160K|   | 2   (0)| 00:00:01 |  Q1,01 | SCWC |        |
|  41 |              TABLE ACCESS INMEMORY FULL| OBIEE_GEO_AIRPORT_BRIDGE |  1787 | 83989 |   | 1   (0)| 00:00:01 |  Q1,01 | SCWP |        |
|  42 |              TABLE ACCESS INMEMORY FULL| OBIEE_COUNTY_HIER    |  3221 |   141K|   | 1   (0)| 00:00:01 |  Q1,01 | SCWP |        |
|  43 |          PX BLOCK ITERATOR         |              |  3773K|    75M|   |   834   (9)| 00:00:01 |  Q1,02 | PCWC |        |
|* 44 |           TABLE ACCESS INMEMORY FULL   | PERFORMANCE          |  3773K|    75M|   |   834   (9)| 00:00:01 |  Q1,02 | PCWP |        |
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   7 - access("D1"."C3"="D2"."C4" AND SYS_OP_MAP_NONNULL("D1"."C4")=SYS_OP_MAP_NONNULL("D2"."C3"))
  13 - access("T233484"."STCTY_FIPS"="T233820"."SC_CODE")
  15 - access("T233484"."AIRPORT"="T233609"."ORIGIN")
  17 - access("T233732"."CAL_MONTH"=TO_NUMBER("T233609"."MONTH") AND "T233732"."CAL_QTR"=TO_NUMBER("T233609"."QUARTER") AND
          "T233732"."CAL_YEAR"=TO_NUMBER("T233609"."YEAR"))
  18 - inmemory("T233732"."CAL_MONTH">=6 AND "T233732"."CAL_MONTH"<=12)
       filter("T233732"."CAL_MONTH">=6 AND "T233732"."CAL_MONTH"<=12)
  19 - inmemory(TO_NUMBER("T233609"."MONTH")>=6 AND TO_NUMBER("T233609"."MONTH")<=12)
       filter(TO_NUMBER("T233609"."MONTH")>=6 AND TO_NUMBER("T233609"."MONTH")<=12)
  21 - inmemory(SYS_OP_BLOOM_FILTER(:BF0001,"T233484"."AIRPORT"))
       filter(SYS_OP_BLOOM_FILTER(:BF0001,"T233484"."AIRPORT"))
  23 - inmemory(SYS_OP_BLOOM_FILTER(:BF0000,"T233820"."SC_CODE"))
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T233820"."SC_CODE"))
  31 - access("T233732"."CAL_MONTH"="T233937"."MONTH" AND "T233732"."CAL_QTR"="T233937"."QUARTER" AND "T233732"."CAL_YEAR"="T233937"."YEAR")
  35 - inmemory("T233732"."CAL_MONTH">=6 AND "T233732"."CAL_MONTH"<=12)
       filter("T233732"."CAL_MONTH">=6 AND "T233732"."CAL_MONTH"<=12)
  36 - access("T233484"."AIRPORT"="T233937"."ORIGIN")
  40 - access("T233484"."STCTY_FIPS"="T233820"."SC_CODE")
  44 - inmemory("T233937"."MONTH">=6 AND "T233937"."MONTH"<=12)
       filter("T233937"."MONTH">=6 AND "T233937"."MONTH"<=12)

Note
-----
   - dynamic statistics used: dynamic sampling (level=AUTO)
   - Degree of Parallelism is 2 because of table property

80 rows selected.

Indeed, looking at the queries the BI Server is sending to the database it’s not too surprising the in-memory difference in this case wasn’t too dramatic. According to the docs (emphasis mine):

Storing a database object in the IM column store can improve performance significantly for the following types of operations performed on the database object 

  • A query that scans a large number of rows and applies filters that use operators such as the following: =, <, >, and IN
  • A query that selects a small number of columns from a table or materialized view with a large number of columns, such as a query that selects five columns from a table with 100 columns
  • A query that joins a small table to a large table
  • A query that aggregates data

The IM column store does not improve performance for the following types of operations:

  • Queries with complex predicates
  • Queries that select a large number of columns
  • Queries that return a large number of rows
  • Queries with multiple large table joins

and our query certainly has complex predicates, returns a fair few rows, has large and lots of joins etc. Taking a more simple query that you’d likely write yourself if querying a data warehouse, you can see the in-memory table access being used again but a much simpler, an cheaper explain plan:

SQL> @inmem_explain.sql
SQL> set echo on
SQL> explain plan for
  2  select /*+ INMEMORY */ sum(T233937.ACTUALELAPSEDTIME) as c1,
  3   sum(T233937.WEATHERDELAY) as c2,
  4   sum(T233937.SECURITYDELAY) as c3,
  5   sum(T233937.NASDELAY) as c4,
  6   sum(T233937.LATEAIRCRAFTDELAY) as c5,
  7   sum(T233937.ARRDELAYMINUTES) as c6,
  8   sum(T233937.CARRIERDELAY) as c7,
  9   sum(nvl(casewhen T233937.CANCELLED < 1 then T233937.FLIGHTS end  , 0)) as c8,
 10   T233820.D_NAME as c9
 11  from
 12   BI_AIRLINES.OBIEE_GEO_AIRPORT_BRIDGE T233484 /* 10 GEO_AIRPORT_ORIGIN */ ,
 13   BI_AIRLINES.OBIEE_COUNTY_HIER T233820 /* 11 COUNTY_HIER_ORIGIN */ ,
 14   BI_AIRLINES.OBIEE_GEO_AIRPORT_BRIDGE T233497 /* 12 GEO_AIPORT_DEST */ ,
 15   BI_AIRLINES.OBIEE_COUNTY_HIER T233831 /* 13 COUNTY_HIER_DEST */ ,
 16   BI_AIRLINES.OBIEE_TIME_MTH_D T233732 /* 41 TIME MONTH */ ,
 17   BI_AIRLINES.DISTANCE_GROUP_250 T233594 /* 19 DISTANCE_GROUP_250 */ ,
 18   BI_AIRLINES.PERFORMANCE T233937 /* 00 PERFORMANCE FULL */
 19  where  ( T233484.AIRPORT = T233937.ORIGIN and T233484.STCTY_FIPS = T233820.SC_CODE and T233497.AIRPORT = T233937.DEST and T233497.STCTY_FIPS = T233831.SC_CODE and T233594.DESCRIPTION = '1000-1249 Miles' and T233594.CODE = T233937.DISTANCEGROUP and T233732.Cal_Month = T233937.MONTH and T233732.Cal_Qtr = T233937.QUARTER and T233732.Cal_Year = T233937.YEAR and T233831.R_NAME = 'Northeast Region' and T233732.Cal_Month between 6 and 12 and T233937.MONTH between 6 and 12 )
 20  group by T233820.D_NAME
 21  order by c9
 22  /
 
Explained.
 
SQL> set lines 300
SQL> set pages 0
SQL> select * from table(dbms_xplan.display);
Plan hash value: 3055743864

-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name             | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |                  |     9 |  1314 |   883  (13)| 00:00:01 |    |      |        |
|   1 |  PX COORDINATOR                 |                  |       |       |        |          |    |      |        |
|   2 |   PX SEND QC (ORDER)                | :TQ10006             |     9 |  1314 |   883  (13)| 00:00:01 |  Q1,06 | P->S | QC (ORDER) |
|   3 |    SORT GROUP BY                |                  |     9 |  1314 |   883  (13)| 00:00:01 |  Q1,06 | PCWP |        |
|   4 |     PX RECEIVE                  |                  |     9 |  1314 |   883  (13)| 00:00:01 |  Q1,06 | PCWP |        |
|   5 |      PX SEND RANGE              | :TQ10005             |     9 |  1314 |   883  (13)| 00:00:01 |  Q1,05 | P->P | RANGE       |
|   6 |       HASH GROUP BY             |                  |     9 |  1314 |   883  (13)| 00:00:01 |  Q1,05 | PCWP |        |
|*  7 |        HASH JOIN                |                  | 60775 |  8665K|   882  (13)| 00:00:01 |  Q1,05 | PCWP |        |
|   8 |     PX RECEIVE              |                  |  3221 | 99851 |     1   (0)| 00:00:01 |  Q1,05 | PCWP |        |
|   9 |      PX SEND BROADCAST          | :TQ10000             |  3221 | 99851 |     1   (0)| 00:00:01 |  Q1,00 | S->P | BROADCAST  |
|  10 |       PX SELECTOR               |                  |       |       |        |          |  Q1,00 | SCWC |        |
|  11 |        TABLE ACCESS INMEMORY FULL       | OBIEE_COUNTY_HIER        |  3221 | 99851 |     1   (0)| 00:00:01 |  Q1,00 | SCWP |        |
|* 12 |     HASH JOIN               |                  | 60775 |  6825K|   881  (13)| 00:00:01 |  Q1,05 | PCWP |        |
|  13 |      PX RECEIVE             |                  |  1787 | 17870 |     1   (0)| 00:00:01 |  Q1,05 | PCWP |        |
|  14 |       PX SEND BROADCAST         | :TQ10001             |  1787 | 17870 |     1   (0)| 00:00:01 |  Q1,01 | S->P | BROADCAST  |
|  15 |        PX SELECTOR              |                  |       |       |        |          |  Q1,01 | SCWC |        |
|  16 |         TABLE ACCESS INMEMORY FULL      | OBIEE_GEO_AIRPORT_BRIDGE |  1787 | 17870 |     1   (0)| 00:00:01 |  Q1,01 | SCWP |        |
|* 17 |      HASH JOIN              |                  | 60775 |  6231K|   880  (13)| 00:00:01 |  Q1,05 | PCWP |        |
|  18 |       PX RECEIVE                |                  |   161 |  1610 |     1   (0)| 00:00:01 |  Q1,05 | PCWP |        |
|  19 |        PX SEND BROADCAST            | :TQ10002             |   161 |  1610 |     1   (0)| 00:00:01 |  Q1,02 | S->P | BROADCAST  |
|  20 |         PX SELECTOR             |                  |       |       |        |          |  Q1,02 | SCWC |        |
|* 21 |          TABLE ACCESS INMEMORY FULL     | OBIEE_TIME_MTH_D         |   161 |  1610 |     1   (0)| 00:00:01 |  Q1,02 | SCWP |        |
|* 22 |       HASH JOIN             |                  | 60964 |  5655K|   879  (13)| 00:00:01 |  Q1,05 | PCWP |        |
|  23 |        JOIN FILTER CREATE           | :BF0000              |     1 |    19 |     1   (0)| 00:00:01 |  Q1,05 | PCWP |        |
|  24 |         PX RECEIVE              |                  |     1 |    19 |     1   (0)| 00:00:01 |  Q1,05 | PCWP |        |
|  25 |          PX SEND BROADCAST          | :TQ10003             |     1 |    19 |     1   (0)| 00:00:01 |  Q1,03 | S->P | BROADCAST  |
|  26 |           PX SELECTOR           |                  |       |       |        |          |  Q1,03 | SCWC |        |
|* 27 |            TABLE ACCESS INMEMORY FULL   | DISTANCE_GROUP_250       |     1 |    19 |     1   (0)| 00:00:01 |  Q1,03 | SCWP |        |
|* 28 |        HASH JOIN                |                  |   670K|    48M|   878  (13)| 00:00:01 |  Q1,05 | PCWP |        |
|  29 |         JOIN FILTER CREATE          | :BF0001              |   318 |  9540 |     2   (0)| 00:00:01 |  Q1,05 | PCWP |        |
|  30 |          PX RECEIVE             |                  |   318 |  9540 |     2   (0)| 00:00:01 |  Q1,05 | PCWP |        |
|  31 |           PX SEND BROADCAST         | :TQ10004             |   318 |  9540 |     2   (0)| 00:00:01 |  Q1,04 | S->P | BROADCAST  |
|  32 |            PX SELECTOR          |                  |       |       |        |          |  Q1,04 | SCWC |        |
|* 33 |         HASH JOIN           |                  |   318 |  9540 |     2   (0)| 00:00:01 |  Q1,04 | SCWC |        |
|  34 |          JOIN FILTER CREATE     | :BF0002              |   217 |  4340 |     1   (0)| 00:00:01 |  Q1,04 | PCWP |        |
|* 35 |           TABLE ACCESS INMEMORY FULL| OBIEE_COUNTY_HIER        |   217 |  4340 |     1   (0)| 00:00:01 |  Q1,04 | SCWP |        |
|  36 |          JOIN FILTER USE        | :BF0002              |  1787 | 17870 |     1   (0)| 00:00:01 |  Q1,04 | PCWP |        |
|* 37 |           TABLE ACCESS INMEMORY FULL| OBIEE_GEO_AIRPORT_BRIDGE |  1787 | 17870 |     1   (0)| 00:00:01 |  Q1,04 | SCWP |        |
|  38 |         JOIN FILTER USE         | :BF0000              |  3773K|   165M|   874  (13)| 00:00:01 |  Q1,05 | PCWP |        |
|  39 |          JOIN FILTER USE            | :BF0001              |  3773K|   165M|   874  (13)| 00:00:01 |  Q1,05 | PCWP |        |
|  40 |           PX BLOCK ITERATOR         |                  |  3773K|   165M|   874  (13)| 00:00:01 |  Q1,05 | PCWC |        |
|* 41 |            TABLE ACCESS INMEMORY FULL   | PERFORMANCE          |  3773K|   165M|   874  (13)| 00:00:01 |  Q1,05 | PCWP |        |
-----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   7 - access("T233484"."STCTY_FIPS"="T233820"."SC_CODE")
  12 - access("T233484"."AIRPORT"="T233937"."ORIGIN")
  17 - access("T233732"."CAL_MONTH"="T233937"."MONTH" AND "T233732"."CAL_QTR"="T233937"."QUARTER" AND "T233732"."CAL_YEAR"="T233937"."YEAR")
  21 - inmemory("T233732"."CAL_MONTH">=6 AND "T233732"."CAL_MONTH"<=12)
       filter("T233732"."CAL_MONTH">=6 AND "T233732"."CAL_MONTH"<=12)
  22 - access("T233594"."CODE"="T233937"."DISTANCEGROUP")
  27 - inmemory("T233594"."DESCRIPTION"='1000-1249 Miles')
       filter("T233594"."DESCRIPTION"='1000-1249 Miles')
  28 - access("T233497"."AIRPORT"="T233937"."DEST")
  33 - access("T233497"."STCTY_FIPS"="T233831"."SC_CODE")
  35 - inmemory("T233831"."R_NAME"='Northeast Region')
       filter("T233831"."R_NAME"='Northeast Region')
  37 - inmemory(SYS_OP_BLOOM_FILTER(:BF0002,"T233497"."STCTY_FIPS"))
       filter(SYS_OP_BLOOM_FILTER(:BF0002,"T233497"."STCTY_FIPS"))
  41 - inmemory("T233937"."MONTH">=6 AND "T233937"."MONTH"<=12 AND SYS_OP_BLOOM_FILTER_LIST(SYS_OP_BLOOM_FILTER(:BF0001,"T233937"."DEST"),SYS
          _OP_BLOOM_FILTER(:BF0000,"T233937"."DISTANCEGROUP")))
       filter("T233937"."MONTH">=6 AND "T233937"."MONTH"<=12 AND SYS_OP_BLOOM_FILTER_LIST(SYS_OP_BLOOM_FILTER(:BF0001,"T233937"."DEST"),SYS_O
          P_BLOOM_FILTER(:BF0000,"T233937"."DISTANCEGROUP")))

Note
-----
   - dynamic statistics used: dynamic sampling (level=AUTO)
   - Degree of Parallelism is 2 because of table property

75 rows selected.

If I then turn-off the in-memory feature and regenerate the execution plan, you can see without in-memory the plan is around 5x as expensive:

SQL> alter system set INMEMORY_SIZE = 0 scope = spfile;

System altered.

SQL> shutdown immediate;
ORA-01097: cannot shutdown while in a transaction - commit or rollback first
SQL> rollback;

Rollback complete.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup
ORACLE instance started.
Total System Global Area 7398752256 bytes
Fixed Size          2941528 bytes
Variable Size        1056968104 bytes
Database Buffers     6325010432 bytes
Redo Buffers           13832192 bytes
Database mounted.
Database opened.
SQL> @noinmem_explain.sql

Explained.

SQL> set lines 300
SQL> set pages 0
SQL> select * from table(dbms_xplan.display);
Plan hash value: 2990499928

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name              | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |               |     9 |  1341 |  4086   (1)| 00:00:01 |        |      |        |
|   1 |  TEMP TABLE TRANSFORMATION                |               |       |       |        |          |        |      |        |
|   2 |   LOAD AS SELECT                      | SYS_TEMP_0FD9D6605_275335 |       |       |        |          |        |      |        |
|*  3 |    HASH JOIN                          |               |   318 |  9540 |    22   (0)| 00:00:01 |        |      |        |
|*  4 |     TABLE ACCESS FULL                     | OBIEE_COUNTY_HIER         |   217 |  4340 |    13   (0)| 00:00:01 |        |      |        |
|   5 |     TABLE ACCESS FULL                     | OBIEE_GEO_AIRPORT_BRIDGE  |  1787 | 17870 |     9   (0)| 00:00:01 |        |      |        |
|   6 |   PX COORDINATOR                      |               |       |       |        |          |        |      |        |
|   7 |    PX SEND QC (ORDER)                     | :TQ10008              |     9 |  1341 |  4086   (1)| 00:00:01 |  Q1,08 | P->S | QC (ORDER) |
|   8 |     SORT GROUP BY                     |               |     9 |  1341 |  4086   (1)| 00:00:01 |  Q1,08 | PCWP |        |
|   9 |      PX RECEIVE                       |               |     9 |  1341 |  4086   (1)| 00:00:01 |  Q1,08 | PCWP |        |
|  10 |       PX SEND RANGE                   | :TQ10007              |     9 |  1341 |  4086   (1)| 00:00:01 |  Q1,07 | P->P | RANGE       |
|  11 |        HASH GROUP BY                      |               |     9 |  1341 |  4086   (1)| 00:00:01 |  Q1,07 | PCWP |        |
|* 12 |     HASH JOIN                     |               |   281 | 41869 |  4085   (1)| 00:00:01 |  Q1,07 | PCWP |        |
|  13 |      PX RECEIVE                   |               |   281 | 33158 |  4072   (1)| 00:00:01 |  Q1,07 | PCWP |        |
|  14 |       PX SEND HYBRID HASH                 | :TQ10005              |   281 | 33158 |  4072   (1)| 00:00:01 |  Q1,05 | P->P | HYBRID HASH|
|  15 |        STATISTICS COLLECTOR               |               |       |       |        |          |  Q1,05 | PCWC |        |
|* 16 |         HASH JOIN BUFFERED                |               |   281 | 33158 |  4072   (1)| 00:00:01 |  Q1,05 | PCWP |        |
|  17 |          VIEW                     | VW_GBC_29             |   281 | 30348 |  4063   (1)| 00:00:01 |  Q1,05 | PCWP |        |
|  18 |           HASH GROUP BY               |               |   281 | 79523 |  4041   (1)| 00:00:01 |  Q1,05 | PCWP |        |
|  19 |            PX RECEIVE                 |               |   281 | 79523 |  4041   (1)| 00:00:01 |  Q1,05 | PCWP |        |
|  20 |         PX SEND HASH                  | :TQ10003              |   281 | 79523 |  4041   (1)| 00:00:01 |  Q1,03 | P->P | HASH    |
|  21 |          HASH GROUP BY                |               |   281 | 79523 |  4041   (1)| 00:00:01 |  Q1,03 | PCWP |        |
|* 22 |           HASH JOIN               |               | 60853 |    16M|  4039   (1)| 00:00:01 |  Q1,03 | PCWP |        |
|  23 |            BUFFER SORT                |               |       |       |        |          |  Q1,03 | PCWC |        |
|  24 |             PX RECEIVE                |               |   318 |  1272 |     2   (0)| 00:00:01 |  Q1,03 | PCWP |        |
|  25 |              PX SEND BROADCAST            | :TQ10000              |   318 |  1272 |     2   (0)| 00:00:01 |        | S->P | BROADCAST  |
|  26 |               TABLE ACCESS FULL           | SYS_TEMP_0FD9D6605_275335 |   318 |  1272 |     2   (0)| 00:00:01 |        |      |        |
|* 27 |            HASH JOIN                  |               | 60853 |    16M|  4037   (1)| 00:00:01 |  Q1,03 | PCWP |        |
|  28 |             PX RECEIVE                |               |   160 |  4640 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |        |
|  29 |              PX SEND BROADCAST            | :TQ10002              |   160 |  4640 |     4   (0)| 00:00:01 |  Q1,02 | S->P | BROADCAST  |
|  30 |               PX SELECTOR             |               |       |       |        |          |  Q1,02 | SCWC |        |
|  31 |                MERGE JOIN CARTESIAN       |               |   160 |  4640 |     4   (0)| 00:00:01 |  Q1,02 | SCWC |        |
|* 32 |             VIEW                  | index$_join$_006          |     1 |    19 |     2   (0)| 00:00:01 |  Q1,02 | SCWC |        |
|* 33 |              HASH JOIN            |               |       |       |        |          |  Q1,02 | SCWC |        |
|  34 |               BITMAP CONVERSION TO ROWIDS     |               |     1 |    19 |     1   (0)| 00:00:01 |  Q1,02 | SCWC |        |
|* 35 |                BITMAP INDEX SINGLE VALUE      | M_INDEX32             |       |       |        |          |  Q1,02 | SCWP |        |
|  36 |               BITMAP CONVERSION TO ROWIDS     |               |     1 |    19 |     1   (0)| 00:00:01 |  Q1,02 | SCWC |        |
|  37 |                BITMAP INDEX FULL SCAN     | INDEX4            |       |       |        |          |  Q1,02 | SCWP |        |
|  38 |             BUFFER SORT           |               |   161 |  1610 |     2   (0)| 00:00:01 |  Q1,02 | SCWC |        |
|  39 |              BITMAP CONVERSION TO ROWIDS      |               |   161 |  1610 |     2   (0)| 00:00:01 |  Q1,02 | SCWC |        |
|* 40 |               BITMAP INDEX FAST FULL SCAN     | M_INDEX28             |       |       |        |          |  Q1,02 | SCWP |        |
|* 41 |             VIEW                  | VW_ST_167D3604        | 61043 |    14M|  4033   (1)| 00:00:01 |  Q1,03 | PCWP |        |
|  42 |              NESTED LOOPS             |               | 61043 |  4768K|  4029   (1)| 00:00:01 |  Q1,03 | PCWP |        |
|  43 |               BUFFER SORT             |               |       |       |        |          |  Q1,03 | PCWC |        |
|  44 |                PX RECEIVE             |               |       |       |        |          |  Q1,03 | PCWP |        |
|  45 |             PX SEND HASH (BLOCK ADDRESS)      | :TQ10001              |       |       |        |          |        | S->P | HASH (BLOCK|
|  46 |              BITMAP CONVERSION TO ROWIDS      |               | 61042 |  1311K|   365   (1)| 00:00:01 |        |      |        |
|  47 |               BITMAP AND              |               |       |       |        |          |        |      |        |
|  48 |                BITMAP MERGE           |               |       |       |        |          |        |      |        |
|  49 |                 BITMAP KEY ITERATION      |               |       |       |        |          |        |      |        |
|* 50 |                  VIEW             | index$_join$_255          |     1 |    19 |     2   (0)| 00:00:01 |        |      |        |
|* 51 |                   HASH JOIN           |               |       |       |        |          |        |      |        |
|  52 |                    BITMAP CONVERSION TO ROWIDS|               |     1 |    19 |     1   (0)| 00:00:01 |        |      |        |
|* 53 |                 BITMAP INDEX SINGLE VALUE | M_INDEX32             |       |       |        |          |        |      |        |
|  54 |                    BITMAP CONVERSION TO ROWIDS|               |     1 |    19 |     1   (0)| 00:00:01 |        |      |        |
|  55 |                 BITMAP INDEX FULL SCAN    | INDEX4            |       |       |        |          |        |      |        |
|* 56 |                  BITMAP INDEX RANGE SCAN      | PERF_DISTANCEGRP          |       |       |        |          |        |      |        |
|  57 |                BITMAP MERGE           |               |       |       |        |          |        |      |        |
|  58 |                 BITMAP KEY ITERATION      |               |       |       |        |          |        |      |        |
|  59 |                  TABLE ACCESS FULL        | SYS_TEMP_0FD9D6605_275335 |   318 |  1272 |     2   (0)| 00:00:01 |        |      |        |
|* 60 |                  BITMAP INDEX RANGE SCAN      | PERF_DEST             |       |       |        |          |        |      |        |
|  61 |                BITMAP MERGE           |               |       |       |        |          |        |      |        |
|* 62 |                 BITMAP INDEX RANGE SCAN   | PERF_MONTH            |       |       |        |          |        |      |        |
|  63 |               TABLE ACCESS BY USER ROWID      | PERFORMANCE           |     1 |    58 |  3669   (1)| 00:00:01 |  Q1,03 | PCWP |        |
|  64 |          PX RECEIVE                   |               |  1787 | 17870 |     9   (0)| 00:00:01 |  Q1,05 | PCWP |        |
|  65 |           PX SEND BROADCAST               | :TQ10004              |  1787 | 17870 |     9   (0)| 00:00:01 |  Q1,04 | S->P | BROADCAST  |
|  66 |            PX SELECTOR                |               |       |       |        |          |  Q1,04 | SCWC |        |
|  67 |         TABLE ACCESS FULL             | OBIEE_GEO_AIRPORT_BRIDGE  |  1787 | 17870 |     9   (0)| 00:00:01 |  Q1,04 | SCWP |        |
|  68 |      PX RECEIVE                   |               |  3221 | 99851 |    13   (0)| 00:00:01 |  Q1,07 | PCWP |        |
|  69 |       PX SEND HYBRID HASH                 | :TQ10006              |  3221 | 99851 |    13   (0)| 00:00:01 |  Q1,06 | S->P | HYBRID HASH|
|  70 |        PX SELECTOR                    |               |       |       |        |          |  Q1,06 | SCWC |        |
|  71 |         TABLE ACCESS FULL                 | OBIEE_COUNTY_HIER         |  3221 | 99851 |    13   (0)| 00:00:01 |  Q1,06 | SCWP |        |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("T233497"."STCTY_FIPS"="T233831"."SC_CODE")
   4 - filter("T233831"."R_NAME"='Northeast Region')
  12 - access("T233484"."STCTY_FIPS"="T233820"."SC_CODE")
  16 - access("T233484"."AIRPORT"="ITEM_1")
  22 - access("C0"="ITEM_1")
  27 - access("T233732"."CAL_YEAR"="ITEM_5" AND "T233732"."CAL_QTR"="ITEM_4" AND "T233732"."CAL_MONTH"="ITEM_3" AND "T233594"."CODE"="ITEM_2")
  32 - filter("T233594"."DESCRIPTION"='1000-1249 Miles')
  33 - access(ROWID=ROWID)
  35 - access("T233594"."DESCRIPTION"='1000-1249 Miles')
  40 - filter("T233732"."CAL_MONTH">=6 AND "T233732"."CAL_MONTH"<=12)
  41 - filter("ITEM_3"<=12 AND "ITEM_3">=6)
  50 - filter("T233594"."DESCRIPTION"='1000-1249 Miles')
  51 - access(ROWID=ROWID)
  53 - access("T233594"."DESCRIPTION"='1000-1249 Miles')
  56 - access("T233937"."DISTANCEGROUP"="T233594"."CODE")
  60 - access("T233937"."DEST"="C0")
  62 - access("T233937"."MONTH">=6 AND "T233937"."MONTH"<=12)

Note
-----
   - dynamic statistics used: dynamic sampling (level=AUTO)
   - Degree of Parallelism is 2 because of table property
   - star transformation used for this statement

105 rows selected.

Running the actual queries in this case gives me a wall-time of around 4 seconds for the in-memory version, and 14 seconds when in-memory query is disabled … but the response time isn’t anywhere near the initial run where we had data stored row-wise but with aggregate tables, so let’s finish-off the testing by putting the aggregate table in-memory too, and see if that makes a difference.

And of course, it absolutely flew:

chart4

So to conclude from my look at the Oracle Database 12c In-Memory option with OBIEE11g v406 SampleApp, I’d say the following based on my initial tests:

  • For BI-type reporting where you’re typically summarising lots of data, the in-memory option doesn’t remove the need for aggregate tables – you’ll still benefit significantly from having them, in my observation
  • Where the in-memory option does benefit you is when you’re querying the detail-level data – it helps with aggregation but it’s main strength is fast filtering against subsets of columns
  • Some of the more complex SQL queries issued by OBIEE’s BI Server, for example when creating lots of subtotals and totals against a dataset, reduce the effectiveness of the in-memory option – you’ll get the biggest speed improvement, at least at the moment, with queries with simpler predicates and not so complex joins

There might be more to the aggregation story in the end, though. Looking at the Oracle Database 12c In-Memory Option Technical White Paper, the in-memory option should in-fact help with aggregation through a new optimiser transformation called “vector group by”, a transformation that’s likened to a star transformation that uses CPU-efficient algorithms and a multi-dimensional array created on-the-fly in the PGA called an “in-memory accumulator”.

NewImage

In fact, what we’ve heard is that many of the old Oracle OLAP team have moved over to the in-memory option team and were responsible for this feature, so I’ll be taking a closer look at in-memory aggregation in this new feature over the next few months. In my examples though, I didn’t see any examples of vector group by in the query execution plans, so I’m assuming either conditions weren’t right for it, or like star transformations there’s some combination of setting and query factors that need to be in place before it’ll appear in the execution plan (and queries presumably run that much faster).

For now though – that’s my first run-through of the 12c In-Memory Option in the context of a typical BI workload. Check back later in the year for more in-memory option postings, including hopefully something more on the in-memory aggregation feature.

Categories: BI & Warehousing

Rittman Mead in the OTN TOUR Latin America 2014

Tue, 2014-08-26 11:28

Another OTN Tour Latin America has come and gone. This is the most important technical event in the region visiting 12 countries and with more than 2500 attendees in two weeks.

This year Rittman Mead was part of the OTN Tour in Buenos Aires (Argentina) and Montevideo (Uruguay) presenting about ODI and OGG.

We have started in Buenos Aires on August 11 for the first day of the OTN Tour in Argentina. I’ve talked about the integration of ODI and OGG 12c, explaining all the technical details to configure and how to implement it. Most of the attendees didn’t work with these tools (but were curious about them) so I personalised a little the presentation giving them first an introduction of ODI and OGG.

As the vice-president of the UYOUG (Uruguayan Oracle User Group) I’m part of the organisation of the OTN Tour in my country, so we needed to come back in the same Monday to adjust some last details to have everything ready for the event in Uruguay.

Most of the speakers came on Wednesday, and we have spent a great day with Michelle Malcher, Kamran Agayev, Hans Forbrich and Mike Dietrich. First, we went to lunch at the Mercado del Puerto, an emblematic place that has lot of “parrillas” (kind of barbecues) and then we gave them a little city tour which included a visit to El Cerro de Montevideo. Finally we visited one of the most important wineries in Uruguay, Bodega Bouza where we have a wine tour followed by an amazing wine tasting of a variety of wines including Tannat which is our insignia grape. You know…it is important to be relaxed before a conference :-)

luch_allCerro_Montevideo

parrilla

The first day of the event in Uruguay was dedicated exclusively to technical sessions and in the second one we had the hands-on labs. The conference covered a wide range of topics from BI Mobile, e-Business Suite to how to upgrade to Oracle Database12c, Oracle Virtualization and Oracle RAC. All the sessions were packed with attendees.

Mike session Edel session

The next day, we had labs with PCs with software already installed but attendees could came with their own laptops to install all the software needed for the hands-on. We had the famous RAC Attack! lead by Kamran and with the help of the ninjas Michelle, Hans and Nelson Calero, and an Oracle Virtualization lab by Hernan Petitti for 7 hours!

Rac AttackRAC Attack2

It was a great event. You can see more pictures here and download the presentations here. The attendees as well as all the speakers were really happy with the result. And so did we.

This  is only the beginning for Rittman Mead in Latin America. There are a lot of things to come, so stay tuned!

Categories: BI & Warehousing

Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations

Tue, 2014-08-26 03:34

If you’ve been following our postings on the blog over the past year, you’ll probably have seen quite a lot of activity around big data and Hadoop and in particular, what these technologies bring to the world of Oracle Business Intelligence, Oracle Data Warehousing and Oracle Data Integration. For anyone who’s not had a chance to read the posts and articles, the three links below are a great introduction to what we’ve been up to:

In addition, we recently took part in an OTN ArchBeat podcast with Stewart Bryson and Andrew Bond on the updated Oracle Information Management Reference Architecture we co-developed with Oracle’s Enterprise Architecture team, where you can hear me talk with Stewart and Andrew about how the updated architecture came about, the thinking behind it, and how concepts like the data reservoir and data factory can be delivered in an agile way.

I’m also pleased to be delivering a number of presentations and seminars over the next few months, on Oracle and Cloudera’s Hadoop technology and how it applies to Oracle BI, DW and DI developers – if you’re part of a local Oracle user group and you’d like me to deliver one of them for your group, drop me an email at mark.rittman@rittmanmead.com.

Slovenian Oracle User Group / Croatian Oracle User Group Conferences, October 2014

These two events run over consecutive days in Slovenia and Croatia, and I’m delivering the keynote at each on Analytics and Big Data, and a one-day seminar running on the Tuesday in Slovenia, and over the Wednesday and Thursday in Croatia. The theme of the seminar is around applying Hadoop and big data technologies to Oracle BI, DW and data integration, and is made up of four sessions:

Part 1 : Introduction to Hadoop and Big Data Technologies for Oracle BI & DW Developers

“In this session we’ll introduce some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We’ll explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracle’s products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).”

Part 2 : Hadoop and NoSQL Data Ingestion using Oracle Data Integrator 12c and Hadoop Technologies

“There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.”

Part 3 : Big Data Analysis using Hive, Pig, Spark and Oracle R Enterprise / Oracle R Advanced Analytics for Hadoop

“Data within a Hadoop cluster is typically analysed and processed using technologies such as Pig, Hive and Spark before being made available for wider use using products like Oracle Big Data SQL and Oracle Business Intelligence. In this session, we’ll introduce Pig and Hive as key analysis tools for working with Hadoop data using MapReduce, and then move on to Spark as the next-generation analysis platform typically being used on Hadoop clusters today. We’ll also look at the role of Oracle’s R technologies in this scenario, using Oracle R Enterprise and Oracle R Advanced Analytics for Hadoop to analyse and understand larger datasets than we could normally accommodate with desktop analysis environments.”

Part 4 : Visualizing Hadoop Datasets using Oracle Business Intelligence, Oracle BI Publisher and Oracle Endeca Information Discovery

“Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, it’s usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle BI Publisher and Oracle Endeca Information Discovery can be used to visualise and publish Hadoop data. In this final session we’ll look at what’s involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets.”

Oracle Openworld 2014 (ODTUG Sunday Symposium), September 2014

Along with another session later in the week on the upcoming Oracle BI Cloud Services, I’m doing a session on the User Group Sunday for ODTUG on ODI12c and the Big Data Connectors for ETL on Hadoop:

Deep Dive into Big Data ETL with Oracle Data Integrator 12c and Oracle Big Data Connectors [UGF9481]

“Much of the time required to work with big data sources is spent in the data acquisition, preparation, and transformation stages of a project before your data reaches a state suitable for analysis by your users. Oracle Data Integrator, together with Oracle Big Data Connectors, provides a means to efficiently load and unload data to and from Oracle Database into a Hadoop cluster and perform transformations on the data, either in raw form or in technologies such as Apache Hive or R. This presentation looks at how Oracle Data Integrator can form the centerpiece of your big data ETL strategy, within either a custom-built big data environment or one based on Oracle Big Data Appliance.”

UK Oracle User Group Tech’14 Conference, December 2014

I’m delivering an extended version of my OOW presentation on the UKOUG Tech’14’s “Super Sunday” event, this time over 90 minutes rather than the 45 at OOW, giving me a bit more time for demos and discussion:

Deep-Dive into Big Data ETL using ODI12c and Oracle Big Data Connectors

“Much of the time required to work with Big Data sources is spent in the data aquisition, preparation and transformation stages of a project; before your data is in a state suitable for analysis by your users.Oracle Data Integrator, together with Oracle Big Data Connectors, provides a means to efficiently load and unload data from Oracle Database into a Hadoop cluster, and perform transformations on the data either in raw form or technologies such as Apache Hive or R. In this presentation, we will look at how ODI can form the centrepiece of your Big Data ETL strategy, either within a custom-built Big Data environment or one based on Oracle Big Data Appliance.”

Oracle DW Global Leaders’ Meeting, Dubai, December 2014

The Oracle DW Global Leaders forum is an invite-only group organised by Oracle and attended by select customers and associate partners, one of which is Rittman Mead. I’ll be delivering the technical seminar at the end of the second day, which will run over two sessions and will be based on the main points from the one-day seminars I’m running in Croatia and Slovenia.

From Hadoop to dashboards, via ODI and the BDA – the complete trail : Part 1 and Part 2

“Join Rittman Mead for this afternoon workshop, taking you through data acquisition and transformation in Hadoop using ODI, Cloudera CDH and Oracle Big Data Appliance, through to reporting on that data using OBIEE, Endeca and Oracle Big Data SQL. Hear our project experiences, and tips and techniques based on real-world implementations”

Keep an eye out for more Hadoop and big data content over the next few weeks, including a look at MongoDB and NoSQL-type databases, and how they can be used alongside Oracle BI, DW and data integration tools.

 

Categories: BI & Warehousing

Smoothing the Transition – The New Smart View 11.1.2.1.102 for Microsoft Office and OBIEE

Fri, 2014-08-15 12:24

Introductions

There’s a good chance that, if you’re reading this, you likely perform some reporting, analytics, data stewardship role or probably some combination of all three. And be it for a large corporation or a small company, there are likely standards and practices that pertain to how the above jobs are performed on a day to day basis; not easily changed and perpetually validated by big budgets and long careers. It is equally likely that deeply ingrained within these reporting practices lies some moderate to heavy implementation of Excel. It wasn’t long ago that I found myself utilizing the spreadsheet program on a daily basis and for hours upon hours at a time.

What this essentially amounted to:

  • Pulling down large amounts of data from our department’s data model using large SQL queries that themselves could take most of the day to elucidate, let alone waiting on the query to yield results, which could easily warrant a bathroom break, a phone call, or if you were feeling adventurous, catching up on email.
  • Validating your results
  • Exporting to Excel (key step here!)Massaging and formatting your data by implementing innumerable and often unwieldy functions that deserved their own time slot on your schedule for the day to figure out
  • Proofing your analysis so that it got to management in ship shape
  • Hoping that an analyst from another department who utilized the same metric on their report and who would be at the same meeting actually coincided with yours

                   

Fast forward a bit and I’m sitting here, writing this blog as a sort of proverbial white flag in the great battle between Excel and the behemoth that is OBIEE. And just what is this white flag? Why, it’s Oracle’s most recent iteration of Smart View, which provides expanded functionality and support for the Microsoft Office Suite of programs. Namely, its golden boy, Excel. That’s right, Excel, the darling of office staff everywhere, the program upon which empires rise and fall. In paraphrasing a quote from www.cfo.com, some 64% of public and private companies still use Excel and other “manual” solutions to perform their finance functions. So, in the world of the spreadsheet, when does it makes sense to cross that blurry line from cell to subject area? Smart View now makes answering that question much easier. It seems that they’ve really gotten a grasp on the formatting shortcomings of the last version and made up for it in spades. Or, so at least they claim.

The Test Run – OBIEE to Excel 

The example below illustrates a simple import via Smart View. I generated a dashboard in Answers which mimics that of an Excel design I found online. Thank the good folks over at www.chandoo.org for their excellent skills in Excel dashboarding and for providing plenty of great examples. The dashboard contains a table with a selection of KPI’s that the user may then choose to sort on via a View Selector (each view has been sorted on a different KPI and is on a different Compound Layout). Upon selecting a KPI, the analysis will then display the Top 10 products by the KPI selected. In addition, the table contains conditional formatting which simply alerts users to the variance between different KPI’s and their targets. Lastly, there is a scatter plot view which displays our Product dimension as seen through the lens of Revenue and Quantity. Per the most recent Oracle documentation, we shouldn’t have any trouble including the current selections of a dashboard prompt either. Let’s see how it performs when we move it over to Excel.

 “OBIEE report and page prompts are fully supported as part of the import process. Dashboards can be imported through Oracle Smart View on a per page basis or the entire dashboard. Prompts are applied at the current state of the logged in user. Future releases of the product will support dashboard prompts directly through Microsoft Office.”

SVB 2

The Results

And there you have it! Excel displays our table and graph views as per the most recent selection from the Dashboard prompt. But wait! Our conditional formatting seems to be missing and to prove this, this is even the case when exported directly from the analysis view as an Excel workbook.

SVB 3

Conditional Formatting

For our second scenario, let’s see how Excel handles a simpler, heat map style conditional formatting. I’ve made a simple table on our dashboard that measures Revenue, Quantity Sold, and the Average Order in $. I set up conditional formatting around the Average Order measure to see how Excel handles importing the color scheme for the currently selected Time parameters on the dashboard.

Contrastingly, we see that Smart View has preserved a simpler, Heatmap style of conditional formatting when imported from OBIEE through Smart View. So, perhaps it is Excel’s lack of corresponding graphic in the previous example that has caused the migration snafu? OBIEE doesn’t even seem to render our arrow graphics as per the documentation.

“Oracle BI Customizations and View Standards – The Import of Oracle BI content can leverage the customizations and view standards used within an OBIEE environment. All view designed modifications such as conditional formatting, background colors or data configuration is automatically translated to the Microsoft Office environment.”

SVB 4

SVB 5

 

Excel to OBIEE

Let’s see what the latest edition of Smart View offers when moving an analysis from Excel to OBIEE.
Because we weren’t able to import our full table view, why don’t we construct it using the View Designer? The interface looks clean and provides an intuitive approach to producing basic Answers views. Accessing our subject area, I simply selected the columns that matched those on our Answers analysis. After clicking ‘OK’, sorting on our Revenue column from largest to smallest and doing a little deleting, we have a pseudo ‘Top 10’ analysis by Revenue. Given the aesthetic attributes of our Answers analysis, lets see how we’re going to replicate this in Excel.

 

SVB 6

SVB 8

SVB 7

After selecting the table, we can navigate to the Design tab under ‘Table Tools’ and select an alternating Grey scheme which gives us the ‘Enable Alternate Styling’ design quality. Now lets add some formulas and conditional formatting that will give us our Calculated column equivalents. We can insert two rows, one between Revenue and Target, and between Qty and Target, to make room for conditional formatting and Excel’s Icon Sets feature. We then create a simple formula that subtracts Revenue and Quantity from their respective targets in the column between the two, assign conditional formatting and voila! Excel even has a check box that lets you show the arrow only.

 

SVB 9

SVB 10

From Excel, we can select Publish View to deposit our analysis into our Shared Folder. The results indicate a sort of ‘two way street’ between Smart View and Excel and vice versa. Neither totally supports the formatting capabilities of the other, as if to say Smart View is giving ground with every new release. In this blog, we’ve taken a look at how Smart View handles some mildly complex conditional formatting and what it takes to replicate this feature in native Excel. In a user environment where reports are flying back and forth between the two platforms, Smart View definitely makes sense, however it might be advisable to simply deliver the minimum of what is needed and let an end user make any formatting based modifications. After all, who would want to do all that work only to have it lost in translation?

Categories: BI & Warehousing

The Business Value In Training

Mon, 2014-08-11 14:59

One of the main things I get asked to do here at Rittman Mead, is deliver the OBIEE front-end training course (TRN 202). This a great course that has served both us, and our clients well over the years. It has always been in high demand and always delivered with great feedback from those in attendance. However, as with all things in life and business, there is going to be room for improvement and opportunities to provide even more value to our clients. Of all the feedback I receive from delivering the course, my favorite is that we do an incredible job delivering both the content and providing real business scenarios on how we have used this tool in the consulting field. Attendees will ask me how a feature works, and how I have used it with current and former clients, 100% of the time.

This year at KSCope ’14 in Seattle, we were asked to deliver a 2 hour front-end training course. Our normal front-end course runs a span of two days and covers just about every feature you can use all the way from Answers and Dashboards, to BI Publisher. Before the invitation to KScope ’14, we had bee tooling with the idea to deliver a course that not only teaches attendees on how to navigate OBIEE and use it’s features, but also emphasizes the business value behind why those features exist in the first place. We felt that too often users are given a quick overview of what the tool includes, but left figure out on their own how to extract the most value. It is one thing to create a graph in Answers, and another to know what the best graph to use might be. So in preparation for the KScope session, we decided to build the content around not only how to develop in OBIEE, but also why, as a business user, you would choose one layout/graph/feature over another. As you would expect, the turn out for the session was fantastic, we had over 70 plus pre-register, with another 10 on the waiting list. This was proof that there is an impending need to pull as much business value out of the tool as there is to simply learn how to use it. We were so encouraged by the attendance and feedback from this event, that we spent the next several weeks developing what is called the “Business Enablement Bootcamp”. It is a 3 day course that will cover Answers, Dashboards, Action Framework, BI Publisher, and the new Mobile App Designer. This is an exciting time for us in that we not only get show people how to use all of the great features that are built into the tool, but to also incorporate years of consulting experience and hundreds of client engagements right into the content. Below I have listed a breakdown of the material and the value it will provide.

Answers

Whenever we deliver our OBIEE 5-day bootcamp, which covers everything from infrastructure to the front end, Answers is one of the key components that we teach. Answers is the building block for analysis in OBIEE. While this portion of the tool is relatively intuitive to get started with, there are so many valuable nuances and settings that can get over looked without proper instruction. In order to get the most out of the tool, a business user needs be able to not only create basic analyses, but be able to use many of the advanced features such as hierarchical columns, master-detail, and selection steps. Knowing how and why to use these features is a key component to gaining valuable insight for your business users.

Dashboards

This one in particular is dear to my heart. To create an analysis and share it on a dashboard is one thing, but to tell a particular story with a series of visualizations strategically placed on a dashboard is something entirely different. Like anything else business intelligence, optimal visualization and best practices are learned skills that take time and practice. Valuable skills like making the most of your white space, choosing the correct visualizations, and formatting will be covered. When you provide your user base with the knowledge and skills to tell the best story, there will be no time wasted with clumsy iterations and guesswork as to what is the best way to present your data. This training will provide some simple parameters to work within, so that users can quickly gather requirements and develop dashboards that more polish and relevance than ever before.

 Dashboard

 Action Framework

Whenever I deliver any form of front end training, I always feel like this piece of OBIEE is either overlooked, undervalued, or both. This is because most users are either unaware of it’s use, or really don’t have a clear idea of its value and functionality. It’s as if it is viewed as an add-on in the sense that is just simply a nice feature. The action framework is something that when properly taught how to navigate, or given demonstration of its value, it will indeed become an invaluable piece of the stack. In order to get the most out of your catalog, users need to be shown how to strategically place action links to give the ability to drill across to analyses and add more context for discovery. These are just a few capabilities within the action framework that when shown how and when to use it, can add valuable insight (not to mention convenience) to an organization.

Bi Publisher/Mobile App Designer

Along with the action framework, this particular piece of the tool has the tendency to get overlooked, or simply give users cold feet about implementing it to complement answers. I actually would have agreed with these feelings before the release of 11.1.1.7. Before this release, a user would need to have a pretty advanced knowledge of data modeling. However, users can now simply pick any subject area, and use the report creation wizard to be off and running creating pixel perfect reports in no time. Also, the new Mobile App Designer on top of the publisher platform is another welcomed addition to this tool. Being the visual person that I am, I think that this is where this pixel perfect tool really shines. Objects just look a lot more polished right out of the box, without having to spend a lot of time formatting the same way you would have to in answers. During training, attendees will be exposed the many of the new features within BIP and MAD, as well as how to use them to complement answers and dashboards.

Third Party Visualizations

While having the ability to implement third party visualizations like D3 and Flot into OBIEE is more of an advanced skill, the market and need seems to be growing for this. While Oracle has done some good things in past releases with new visualizations like performance tiles and waterfall charts, we all know that business requirements can be demanding at times and may require going elsewhere to appease the masses. You can visit https://github.com/mbostock/d3/wiki/Gallery to see some of the other available visualizations beyond what is available in OBIEE. During training, attendees will learn the value of when and why external visualizations might be useful, as well as a high level view of how they can be implemented.

Bullet Chart

Users often make the mistake of viewing each piece of the front end stack as separate entities, and without proper training this is very understandable. Even though they are separate pieces of the product, they are all meant to work together and enhance the “Business Intelligence” of an organization. Without training the business to complement one piece to another, it will always be viewed as just another frustrating tool that they don’t have enough time to learn on their own. This tool is meant to empower your organization to have everything they need to make the most informed and timely decisions, let us use our experience to enable your business.

Categories: BI & Warehousing

Rittman Mead and Oracle Big Data Appliance

Mon, 2014-08-11 07:00

Over the past couple of years Rittman Mead have been broadening our skills and competencies out from core OBIEE, ODI and Oracle data warehousing into the new “emerging” analytic platforms: R and database advanced analytics, Hadoop, cloud and clustered/distributed systems. As we talked about in the recent series of updated Oracle Information Management Reference Architecture blog posts and my initial look at the Oracle Big Data SQL product, our customers are increasingly looking to complement their core Oracle analytics platform with ones to handle unstructured and big data, and as technologists we’re always interesting in what else we can use to help our customers get more insight out of their (total) dataset.

An area we’ve particularly focused on over the past year has been Hadoop and R analysis, with the recent announcement of our partnering with Cloudera and the recruitment of a big data and advanced analytics team operating our of our Brighton, UK office. We’ve also started to work on a number of projects and proof of concepts with customers in the UK and Europe, working mainly with core Oracle BI, DW and ETL customers looking to make their first move into Hadoop and big data. The usual pattern of engagement is for us to engage with some business users looking to analyse a dataset hitherto too large or too unstructured to load into their Oracle data warehouse, or where they recognise the need for more advanced analytics tools such as R, MapReduce and Spark but need some help getting started. Most often we put together a PoC Hadoop cluster for them using virtualization technology on existing hardware they own, allowing them to get started quickly and with no initial licensing outlay, with our preferred Hadoop distribution being Cloudera CDH, the same Hadoop distribution that comes on the Oracle Big Data Appliance. Projects then typically move on to Hadoop running directly on physical hardware, in a couple of cases Oracle’s Big Data Appliance, usually in conjunction with Oracle Database, Oracle Exadata and Oracle Exalytics for reporting.

One such project started off by the customer wanting to analyse a dataset that was too large for the space available in their Oracle database and that they couldn’t easily process or analyse using the SQL-based tools they usually used; in addition, like most large organisations, database and hardware provisioning took a long time and they needed to get the project moving quickly. We came in and quickly put together a virtualised Hadoop cluster together for them, on re-purposed hardware and using the free (Standard) edition of Cloudera CDH4, and then used the trial version of Oracle Big Data Connectors along with SFTP transfers to get data into the cluster and then analysed.

NewImage

The PoC itself then ran for just over a month with the bulk of the analysis being done using Oracle R Advanced Analytics for Hadoop, an extension to R that allows you to use Hive tables as a data source and create MapReduce jobs from within R itself; the output from the exercise was a series of specific-answer-to-specific-question R graphs that solved an immediate problem for the client, and showed the value of further investment in the technology and our services – the screenshot below shows a typical ORAAH session, in this case analyzing the flight delays dataset that you can also find on the Exalytics server and in smaller form in OBIEE 11g’s SampleApp dataset.

NewImage

That project has now moved onto a larger phase of work with Oracle Big Data Appliance used as the Hadoop platform rather than VMs, and Cloudera Hadoop upgraded from the free, unsupported Standard version to Cloudera Enterprise. The VMs in fact worked pretty well and had the advantage that they could be quickly spun-up and housed temporarily on an existing server, but were restricted by the RAM that we could assign to each VM – 2GB initially, quickly upgraded to 8GB per VM, and the fact that they were sharing CPU and IO resources. Big Data Appliance, by contrast, has 64GB or RAM per node – something that’s increasingly important now in-memory tools like Impala are begin used – and has InfiniBand networking between the nodes as well as fast network connections out to the wider network, something thats often overlooked when speccing up a Hadoop system.

The support setup for the BDA is pretty good as well; from a sysadmin perspective there’s a lights-out ILOM console for low-level administration, as well as plugins for Oracle Enterprise Manager 12c (screenshot below), and Oracle support the whole package, typically handling the hardware support themselves and delegating to Cloudera for more Hadoop-specific queries. I’ve raised several SRs on client support contracts since starting work on BDAs, and I’ve not had any problem with questions not being answered or buck-passing between Oracle and Cloudera.

NewImageOne thing that’s been interesting is the amount of actual work that you need to do with the Big Data Appliance beyond the actual installation and initial configuration by Oracle to “on-board” it into the typical enterprise environment. BDAs are left with customers in a fully-working state, but like Exalytics and Exadata though, initial install and configuration is just the start, and you’ve then got to integrate the platform in with your corporate systems and get developers on-boarded onto the platform. Tasks we’ve typically provided assistance with on projects like these include:

  • Configuring Cloudera Manager and Hue to connect to the corporate LDAP directory, and working with their security team to create LDAP groups for developer and administrative access that we then used to restrict and control access to these tools
  • Configuring other tools such as RStudio Server so that developers can be more productive on the platform
  • Putting in place an HDFS directory structure to support incoming data loads and data archiving, as well as directories to hold the output datasets from the analysis work we’re doing – all within the POSIX security setup that HDFS currently uses which limits us to just granting owner, group and world permissions on directories
  • Working with the client’s infrastructure team on things like alerting, troubleshooting and setting up backup and recovery – something that’s surprisingly tricky in the Hadoop world as Cloudera’s backup tools only backup from Hadoop-to-Hadoop, and by definition your Hadoop system is going to hold a lot of data, the volume of which your current backup tools aren’t going to easily handle

Once things are set up though you’ve got a pretty comprehensive platform that can be expanded up from the initial six nodes our customers’ systems typically start with to the full eighteen node cluster, and can use tools such as ODI to do data loading and movement, Spark and MapReduce to process and analyse data, and Hive, Impala and Pig to provide end-user access. The diagram below shows a typical future-state architecture we propose for clients on this initial BDA “starter config” where we’ve moved up to CDH5.x, with Spark and YARN generally used as the processing framework and with additional products such as MongoDB used for document-type storage and analysis:

NewImage

 

Something that’s turned out to be more of an issue on projects than I’d originally anticipated is complying with corporate security policies. By definition, most customers who buy an Oracle Big Data Appliance and going to be large customers with an existing Oracle database estate, and if they deal with the public they’re going to have pretty strict security and privacy rules you’ll need to adhere to. Something that’s surprising therefore to most customers new to Hadoop is how insecure or at least easily compromised the average Hadoop cluster is, with Hadoop FS shell security relying on trusted networks and incoming user connections and interfaces such as ODBC not checking passwords at all.

Hadoop and the BDA only becomes what’s termed “secure” when you link it to a Kerebos server, but not every customer has Kerebos set up and unless you enable this feature right at the start when you set up the BDA, it’s a fairly involved task to add retrospectively. Moreover, customers are used to fine-grained access control to their data, a single security model over their data and a good understanding in their heads as to how security works on their database, whereas Hadoop is still a collection of fairly-loosely coupled components with pretty primitive access controls, and no easy way to delete or redact data, for example, when a particular country’s privacy laws in-theory mandate this.

Like everything there’s a solution if you’re creative enough, with tools such as Apache Sentry providing role-based access control over Hive and Impala tables, alternative storage tools like HBase that permit read, write, update and delete operations on data rather than just HDFS’s insert and (table or partition-level) delete, and tools like Cloudera Navigator and BDA features like Oracle Audit Vault that provide administrators with some sort of oversight as to who’s accessing what data and when. As I mentioned in my blog post a couple of weeks ago, Oracle’s Big Data SQL product addresses this requirement pretty well, potentially allowing us to apply Oracle security over both relational, and Hadoop, datasets, but for now we’re working within current CDH4 capabilities and planning on introducing Apache Sentry for role-based access control to Hive and Impala in the coming weeks. We’re also looking at implementing Cloudera’s “secure gateway” cluster topology with all access restricted to just a single gateway Hadoop node, and the cluster itself firewalled-off with external access to just that gateway node and HTTP / REST API access to the various cluster services, for example as shown in the diagram below:

NewImage

My main focus on Hadoop projects has been on the overall Hadoop system architecture, and interacting with the client’s infrastructure and security teams to help them adopt the BDA and take over its maintenance. From the analysis side, it’s been equally as interesting, with a number of projects using tools such as R, Oracle R Advanced Analytics for Hadoop and core Hive/MapReduce for data analysis, Flume, Java and Python for data ingestion and processing, and most recently OBIEE11g for publishing the results out to a wider audience. Following the development model that we outlined in the second post in our updated Information Management Reference Architecture blog series, we typically split delivery of each project’s output into two distinct phases; a discovery phase, typically done using RStudio and Oracle R Advanced Analytics for Hadoop, where we explore and start understanding the dataset, presenting initial findings to the business and using their feedback and direction to inform the second phase; and a second, commercial exploitation phase where we use the discovery phases’ outputs and models to drive a more structured dimensional model with output begin in the form of OBIEE analyses and dashboards.

NewImage

We looked at several options for providing the datasets for OBIEE to query, with our initial idea being to connect OBIEE directly to Hive and Impala and let the users query the data in-place, directly on the Hadoop cluster, with an architecture like the one in the diagram below:

NewImage

In fact this turned out to not be possible, as whilst OBIEE 11.1.1.7 can access Apache Hive datasources, it currently only ships with HiveServer1 ODBC support, and no support for Cloudera Impala, which means we need to wait for a subsequent release of OBIEE11g to be able to report against the ODBC interfaces provided by CDH4 and CDH5 on the BDA (although ironically, you can get HiveServer2 and Impala working on OBIEE 11.1.1.7 on Windows, though this platform isn’t officially supported by Oracle for Hadoop access, only Linux). Whichever way though, it soon became apparent that even if we could get Hive and Impala access working, in reality it made more sense to use Hadoop as the data ingestion and processing platform – providing access to data analysts at this point if they wanted access to the raw datasets – but with the output of this then being loaded into an Oracle Exadata database, either via Sqoop or via Oracle Loader for Hadoop and ideally orchestrated by Oracle Data Integrator 12c, and users then querying these Oracle tables rather than the Hive and Impala ones on the BDA, as shown in the diagram below.

NewImage

In-practice, Oracle SQL is far more complete and expressive than HiveQL and Impala SQL and it makes more sense to use Oracle as the query platform for the vast majority of users, with data analysts and data scientists still able to access the raw data on Hadoop using tools like Hive, R and (when we move to CDH5) Spark.

The final thing that’s been interesting about working on Hadoop and Big Data Appliance projects is that 80% of it, in my opinion, is just the same as working on large enterprise data warehouse projects, with 20% being “the magic”. A large portion of your time is spent on analysing and setting up feeds into the system, just in this case you use tools like Flume instead of GoldenGate (though GoldenGate can also load into HDFS and Hive, something that’s useful for transactional database data sources vs. Flume’s focus on file and server log data sources). Another big part of the work is data processing, ingestion, reformatting and combining, again skills an ETL developer would have (though there’s much more reliance, at this point, on command-line tools and Unix utilities, albeit with a place for tools like ODI once you get to the set-based filtering, joining and aggregating phase). In most cases, the output of your analysis and processing will be Hive and Impala tables so that results can be analysed using tools such as OBIEE, and you therefore need skills in areas such as dimensional modelling, business analysis and dashboard prototyping as well as tool-specific skills such as OBIEE RPD development.

Where the “magic” happens, of course, is the data preparation and analysis that you do once the data is loaded, quite intensively and interactively in the discovery phase and then in the form of MapReduce and Spark jobs, Sqoop loads and Oozie workflows once you know what you’re after and need to process the data into something more tabular for tools like OBIEE to access. We’re building up a team competent in techniques such as large-scale data analysis, data visualisation, statistical analysis, text classification and sentiment analysis, and use of NoSQL and JSON-type data sources, which combined with our core BI, DW and ETL teams allows us to cover the project from end-to-end. It’s still relatively early days but we’re encouraged by the response from our project customers so far, and – to be honest – the quality of the Oracle big data products and the Cloudera platform they’re based around – and we’re looking forward to helping other Oracle customers get the most out of their adoption of these new technologies. 

If you’re an Oracle customer looking to make their first move into the worlds of Hadoop, big data and advanced analytics techniques, feel free to drop me an email at mark.rittman@rittmanmead.com  for some initial advice and guidance – the fact we come from an Oracle-centric background as well typically makes it easier for us to relate these new concepts to the ones you’re typically more familiar with. Similarly, if you’re about to bring on-board an Oracle Big Data Appliance system and want to know how best to integrate it in with your existing Oracle BI, DW, data integration and systems management estate, get in contact and I’d be happy to share experiences and our delivery approach.

Categories: BI & Warehousing

Vote for Rittman Mead at the UKOUG Partner of the Year Awards 2014!

Mon, 2014-08-11 03:00

Rittman Mead are proud to announce that we’ve been nominated by UKOUG members and Oracle customers for five categories in the upcoming UKOUG Parter of the Year Awards 2014;  Business Intelligence, Training, Managed Services, Operating Systems Storage and Hardware, and Emerging Partner, reflecting the range of products and services we now offer for customers in the UK and around the world.

NewImage

Although Rittman Mead are a worldwide organisation with offices in the US, India, Australia and now South Africa, our main operation is in the UK and for many years we’ve been a partner member of the UK Oracle User Group (UKOUG). Our consultants speak at UKOUG Special Interest Group events as well as the Tech and Apps conferences in December each year, we write articles for Oracle Scene, the UKOUG members’ magazine, and several of our team including Jon and myself have held various roles including SIG chair and deputy chair, board member and even editor of Oracle Scene.

Partners, along with Oracle customers and of course Oracle themselves, are a key part of the UK Oracle ecosystem and to recognise their contribution the UKOUG recently brought in their Partner of the Year Awards that are voted on by UKOUG members and Oracle customers in the region. As these awards are voted on by actual users and customers we’ve been especially pleased over the years to win several Oracle Business Intelligence Partner of the Year Gold awards, and last year we were honoured to receive awards in five categories, including Business Intelligence Partner of the Year, Training Partner of the Year and Engineered Systems Partner of the Year.

This year we’ve been nominated again in five categories, and if you like what we do we’d really appreciate your vote, which you can cast at any time up to the closing date, September 15th 2014. Voting is open to UKOUG members and Oracle customers and only takes a few minutes – the voting form is here and you don’t need to be a UKOUG member, only an Oracle end-user or customer – these awards are a great recognition for the hard work out team puts in, so thanks in advance for any votes you can put in for us!

Categories: BI & Warehousing

Why Oracle Big Data SQL Potentially Solves a Big Issue with Hadoop Security

Sat, 2014-07-26 07:21

Oracle announced their Big Data SQL product a couple of weeks ago, which effectively extends Exadata’s query-offloading to Hadoop data sources. I covered the launch a few days afterwards, focusing on how it implements Exadata’s SmartScan on Hive and NoSQL data sources and provides a single metadata catalog over both relational, and Hadoop, data sources. In a Twitter conversation later in the day though, I made the comment that in my opinion, the biggest benefit of Big Data SQL will be in its ability to extend Oracle’s security model to Hadoop data sources, because Hadoop security Hadoop security is still a bit of a mess:

lang=”en”>To me the greatest benefit of Big Data SQL is the single security model; even with Sentry, Hadoop security is fragmented and a mess (IMO)

— Mark Rittman (@markrittman) July 17, 2014

I’ve been working on an Oracle Big Data Appliance project over the past few weeks, as the technical architect and initial sysadmin for the cluster, and it’s given me a first-hand experience of what security’s like on a Hadoop cluster. Over the past few weeks I’ve had to come up with a security policy covering HDFS, Hive and the Cloudera management tools (Cloudera Manager, Hue etc), and try and implement an access and authorisation approach that ensures only designated people can log in, and when they’re in, they can only see the data they’re supposed to see. Hadoop at this point, to my mind, suffers from a couple of major issues when it comes to security:

  • It’s fragmented, in that each tool or Hadoop product tends to have its own security setup, and the documentation is all split up, rapidly goes out of date, and is more of a reference than a tutorial (Cloudera’s Security documentation is one of the better examples, but it still splits the key information you need over several sections and several other docs)
  • It’s full of holes, such that the default security setup is considered insecure in terms of users being able to spoof their details, and making it more secure is again an exercise in hunting through docs, with some pretty complex configuration steps you need to perform (for example, configuring Kerebos authentication, a prerequisite for things like Apache Sentry)

If we take a typical security policy that a large enterprise customer’s going to want to put in place, it’ll look something like this:

  • Users should only be able to log in via their corporate LDAP account, and we’ll want that login process to be secure so it can’t easily be bypassed
  • We want to be able to secure our datasets, so that only authorised users can view particular datasets, and there’s likely to be some groups we grant read-only access to, and others we grant read-write
  • The data loading processes for the Hadoop cluster need to be locked-down so they can’t overwrite the datasets of other applications
  • Our security policy ideally needs to sync-up, or be an extension of, our existing enterprise security policy, not something we maintain separately
  • We need to be able to audit and review who’s actually accessing what dataset, to ensure that these policies are being followed and enforced
  • We also need the ability to obfuscate or depersonalise data before it gets into the cluster, and also have the option of encrypting the data at-rest as well as on-the-wire

Back in the early days of Hadoop these types of security policy weren’t often needed, as the users of the Hadoop cluster were typically a small set of data scientists or analysts who’d been cleared already to view and work with the data in the cluster (or more likely, they did it and just didn’t tell anyone). But as we move to enterprise information management architectures such as the one outlined in my two-part blog post series a few weeks ago (pt.1, pt.2), the users of Hadoop and other “data reservoir” data sources are likely to increase significantly in number as data from these systems becomes just another part of the general enterprise data set.

NewImage

But in practice, this is hard to do. Let’s start with HDFS first, the Hadoop Distributed File System on which most Hadoop data is stored. HDFS aims to look as similar to a Linux or Unix-type filesystem as possible, with similar commands (mkdir, ls, chmod etc) and the same POSIX permissions model, where files and directories are associated with an owner and a group and where permissions are set for that owner, the group and all others. For example, in the HDFS file listing below, the “/user/cust_segment_analysis” directory is owned by the user “mrittman” and the group “marketing”, with the directory owner having full read, write and subdirectory traversal access to the directory, the group having read-only and subdirectory traversal access, and all others having no access at all.

[root@bdanode1 ~]# hadoop fs -ls /user
Found 13 items
drwxrwxrwx   - admin    admin               0 2014-06-02 16:06 /user/admin
drwxr-x---   - mrittman marketing           0 2014-07-26 21:31 /user/cust_segment_analysis
drwxr-xr-x   - hdfs     supergroup          0 2014-05-27 13:19 /user/hdfs
drwxrwxrwx   - mapred   hadoop              0 2014-05-25 20:47 /user/history
drwxrwxr-t   - hive     hive                0 2014-06-04 16:31 /user/hive
drwxr-xr-x   - hue      hue                 0 2014-05-31 18:51 /user/hue
drwxrwxr-x   - impala   impala              0 2014-05-25 20:54 /user/impala
drwxrwxr-x   - oozie    oozie               0 2014-05-25 20:52 /user/oozie
drwxrwxrwx   - oracle   oracle              0 2014-06-09 21:38 /user/oracle
drwxr-xr-x   - root     root                0 2014-06-06 16:25 /user/root
drwxr-xr-x   - sample   sample              0 2014-05-31 18:51 /user/sample
drwxr-x--x   - spark    spark               0 2014-05-25 20:45 /user/spark
drwxrwxr-x   - sqoop2   sqoop               0 2014-05-25 20:53 /user/sqoop2

Which all sounds great until you then have another group that needs read-write access to the directory, but you’re limited to just one group permissions setting for the directory which you’ve already used to set up read-only access for that particular group. If you therefore need to set up different sets of security access for different groups, you typically then end-up creating multiple HDFS directories and multiple copies of the dataset in question, assigning each copy to a different group, which isn’t all that convenient and gives you other problems in terms of maintenance and keeping it all in-sync.

What you of course need is something like the “access control lists” (ACLs) you get with operating systems like Windows NT and MacOS, where you can define an arbitrary number of user groups and then assign each of them their own permission set on the directory and the files it contains. The most recent versions of Hadoop actually implement a form of ACL for HDFS, with this feature making its way into the recently-released Cloudera CDH5.1, but these ACLs are an addition to the standard POSIX user, group, others model and aren’t recommended for all files in your HDFS filesystem as according to the Hadoop docs “Best practice is to rely on traditional permission bits to implement most permission requirements, and define a smaller number of ACLs to augment the permission bits with a few exceptional rules. A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that has only permission bits.” Still, it’s better than not having them at all, and I’d imagine using this feature for particular directories and sets of files that need more than one set of group permissions configured for them.

In most cases though, the way you’ll present data out to non-technical end-users and applications is through Hive and Impala tables, or through tools like Pig and Spark. Under the covers, these tools still use HDFS permissions to control access to the data within Hive and Impala tables, but again by default you’re limited to granting access to whole HDFS directories, or the files contained within those directories. Something that addresses this issue is a product called Apache Sentry, an open-source project within the Hadoop family that enables role-based access control for Hive and Impala tables. Oracle are one of the co-founders of the Sentry project and include it in the base software on the Big Data Appliance, and using Sentry you can grant SELECT, INSERT or ALL privileges to a group on a particular Hive or Impala table, rather than on the underlying HDFS directories and files. A form of fine-grained access control can be set up using Sentry by creating views with particular row-level security settings, giving you the basics of a database-like security policy that you can apply over the main way that users access data in the cluster.

But Sentry itself has a few significant prerequisites – you have to enable Kerebos authentication on your cluster, which you should do anyway because of the risk of account spoofing, but is still a significant thing to set up – and of course you need to link Hive and Impala to your corporate LDAP server and configure them to work in the way that Sentry requires. Most importantly though, you’re still left with the situation where you’ve got two separate security setups – the one for your corporate data warehouse and relational data sources, and another for data accessed on Hadoop, and it’s still hard to be sure, what with all the disparate products and partially-complete open-source products, whether data in your Hadoop cluster is still really secure (though products like Cloudera Navigator aim to provide some form of data governance and auditing over these datasets); and, there’s still no straightforward way to remove individual customers’ data out of the Hadoop dataset (“data redaction”), no easy way to obfuscate or mask data, and no easy way (apart from the Hive views mentioned before) to restrict users to accessing only certain columns in a Hive or Impala table.

And so this is where Oracle’s Big Data SQL product could be very interesting. Big Data SQL takes the Exadata model of moving as much filtering and column-projection as it can to the storage server, adding Oracle SmartScan functionality to the Hadoop node and allowing it to understand the full Oracle SQL dialect (and PL/SQL security functions), rather than just the subset of SQL provided by HiveQL and Impala SQL.

NewImage

More importantly, it’ll enable a single unified data dictionary over both Oracle and Hadoop data sources, presenting Hive tables and NoSQL data as regular Oracle tables and allowing the DBA to create data security, redaction and row-level filtering policies over both relational and Hadoop data – giving you potentially the ability to define a single security policy across all data in your overall information management architecture.

NewImage

So I think this is actually a “big deal”, and potentially even more game-changing that the SmartScan functionality that got most of the attention with the Big Data SQL product launch. How well it’ll work in-practice, and how much will be enabled on day one it’s hard to say, but this feature meets a real need that our customers are finding now, so I’ll be very interested to try it out when the product becomes available (presumably) later in the year.

Categories: BI & Warehousing

Taking a Look at the New Oracle Big Data SQL

Thu, 2014-07-17 13:15

Oracle launched their Oracle Big Data SQL product earlier this week, and it’ll be of interest to anyone who saw our series of posts a few weeks ago about the updated Oracle Information Management Reference Architecture, where Hadoop now sits alongside traditional Oracle data warehouses to provide what’s termed a “data reservoir”. In this type of architecture, Hadoop and its underlying technologies HDFS, Hive and schema-on-read databases provide an extension to the more structured relational Oracle data warehouses, making it possible to store and analyse much larger sets of data with much more diverse data types and structures; the issue that customers face when trying to implement this architecture is that Hadoop is a bit of a “wild west” in terms of data access methods, security and metadata, making it difficult for enterprises to come up with a consistent, over-arching data strategy that works for both types of data store.

Oracle Big Data SQL attempts to address this issue by providing a SQL access layer over Hadoop, managed by the Oracle database and integrated in with the regular SQL engine within the database. Where it differs from SQL on Hadoop technologies such as Apache Hive and Cloudera Impala is that there’s a single unified data dictionary, single Oracle SQL dialect and the full management capabilities of the Oracle database over both sources, giving you the ability to define access controls over both sources, use full Oracle SQL (including analytic functions, complex joins and the like) without having to drop down into HiveQL or other Hadoop SQL dialects. Those of you who follow the blog or work with Oracle’s big data connector products probably know of a couple of current technologies that sound like this; Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that copies Hive or HDFS data into an Oracle database typically faster than a tool like Sqoop, whilst Oracle Direct Connector for HDFS (ODCH) gives the database the ability to define external tables over Hive or HDFS data, and then query that data using regular Oracle SQL.

Where ODCH falls short is that it treats the HDFS and Hive data as a single stream, making it easy to read once but, like regular external tables, slow to access frequently as there’s no ability to define indexes over the Hadoop data; OLH is also good but you can only use it to bulk-load data into Oracle, you can’t use it to query data in-place. Oracle Big Data SQL uses an approach similar to ODCH but crucially, it uses some Exadata concepts to move processing down to the Hadoop cluster, just as Exadata moves processing down to the Exadata storage cells (so much so that the project was called “Project Exadoop” internally within Oracle up to the launch) – but also meaning that it’s Exadata only, and not available for Oracle Databases running on non-Exadata hardware.

As explained by the launch blog post by Oracle’s Dan McClary, Oracle Big Data SQL includes components that install on the Hadoop cluster nodes that provide the same “SmartScan” functionality that Exadata uses to reduce network traffic between storage servers and compute servers. In the case of Big Data SQL, this SmartScan functionality retrieves just the columns of data requested in the query (a process referred to as “column projection”), and also only sends back those rows that are requested by the query predicate.

NewImage

Combined with Hive’s ability to map unstructured data sources into regular columns and tables, and Big Data SQL’s support for Oracle NoSQL database, the promise of this new technology is the ability to run queries against both relational, Hadoop and NoSQL data sources using a common data dictionary and common set of identity and data access controls.

There’s a couple of potential downsides, though. First-off, Big Data SQL will only be available as part of Oracle Big Data Appliance, which though an impressive bit of hardware and software is a much smaller market than the total set of Oracle customers looking to combine relational and Hadoop-based data; it’s also restricted to Oracle 12c on Exadata meaning you’ll most probably need to do a database upgrade even if you’ve already got the required Exadata servers in-place. Finally, it’s also restricted to the Oracle-specific distribution of Cloudera Hadoop, though if you’re using the BDA you’ll be using this anyway.

My other concern though is that Oracle now focus on SQL as their only access mechanism into Hadoop and big data, in a similar way to how they focused on SQL as their access route into OLAP when they incorporated Oracle Express into the Oracle Database, back in the mid-2000’s. Focusing on SQL over multidimensional languages such as Express 4GL and MDX meant you missed the real point of using a multidimensional, OLAP database – which of course was being able to use a multidimensional query language, and my concern with Big Data SQL is that we’ll end up focusing on that rather than languages such as Spark, Pig and NoSQL query languages which, combined with schema-on-read, is the real differentiator for Hadoop-based systems. As long as Big Data SQL is positioned as a “bonus” – a convenient way of getting data out of Hadoop once it’s been processed and analysed using more Hadoop-native technologies – then Big Data SQL will be a great enabling and acceptance technology for enterprises, rather than one that ends up restricting them.

We’re not aware of any beta program and I don’t think the launch webcast mentioned a specific date or BDA version when Big Data SQL will be out, but with Openworld coming up soon I’d expect to hear more about this over the next few months. We’re involved in a couple of significant Oracle Big Data Appliance implementations at the moment and this product would address a real, pressing need at the moment with our customers, so I’m looking forward to getting more involved in it over the next few months.

This article was updated on 18th July to add the fact that Big Data SQL is only available on Exadata, and is not a generic Oracle Database 12c technology.

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 4: Start Journalizing!

Thu, 2014-07-10 15:26

In this post, the finale of the four-part series “GoldenGate and Oracle Data Integrator – A Perfect Match in 12c”, I’ll walk through the setup of the ODI Models and start journalizing in “online” mode. This will utilize our customized JKM to build the GoldenGate parameter files based on the ODI Metadata and deploy them to both the source and target GoldenGate installation locations. Before I get into all of the details, let’s recap the first 3 posts and see how we arrived at this point.

Part one of the series, Getting Started, led us through a quick review of the Oracle Reference Architecture for Information Management and the tasks we’re trying to accomplish; loading both the Raw Data Reservoir (RDR) and Foundation schemas simultaneously, using Oracle GoldenGate 12c replication, and set it all up via Oracle Data Integrator 12c. We also reviewed the setup of the GoldenGate JAgent process, necessary for communication between ODI and GoldenGate when using the “online” version of the Journalizing Knowledge Module.

In part two, we reviewed the Journalizing Knowledge Module, “JKM Oracle to Oracle Consistent (OGG Online)”, its new features, and how much it has been improved in ODI 12c. Full integration with Oracle GoldenGate, through the use of the new ODI Tool “OdiOggCommand”, allows for the setup and configuration of GoldenGate process groups, trail file directories, and table-level supplemental logging, all from the ODI JKM.

Most recently, part 3, titled Setup Journalizing, walked us through the customizations of the “JKM Oracle to Oracle Consistent (OGG Online)” that will allow us to create the source-to-foundation replication alongside the standard source-to-RDR setup. We added a set of options, the first to control whether or not we replicat to foundation and the second to capture the ODI Logical Schema corresponding to the foundation schema. Then we added the task that will create the source-to-foundation table mapping inside the GoldenGate replicat parameter file and set the options appropriately. I’ve been using the ODI 12c Getting Started VM, with the 12.1.2 version of ODI, for my demo setup. If you haven’t done so already, you can download the latest version of the VM, with ODI 12.1.3, from the Oracle Technical Network. I’d say that’s enough recap, now on to the final steps for GoldenGate and ODI 12c integration and let’s start journalizing!

Setup ODI Models Create Models

We first need to create the ODI Models and Datastores for the Source, Staging (Raw Data Reservoir) and Foundation tables. I will typically reverse engineer the source tables into a Model first, then copy them to the Staging and Foundation Models. This approach will ensure the column names and data types remain consistent with the source. I then execute a Groovy script to create the additional data warehouse audit columns in each of the Foundation Datastores.

models

Configure JKM

Unlike the 11g version of ODI, in 12c the “JKM Oracle to Oracle Consistent (OGG Online)” Knowledge Module will be set on the source Model. Open up the Model, in this example, PM_SRC, and switch to the Journalizing tab.

model-journalizing

We’ll set the Journalizing Mode to “Consistent Set” and then choose the customized JKM that we have been working with in this example, “JKM Oracle to Oracle Consistent (OGG Online) RM”, from the dropdown list. Now we are presented with the GoldenGate Process Selection parameters and a list of KM Options to configure.

Set the Process Selection parameters for the Capture Process and Delivery Process by selecting the Logical Schemas created in Part 3 - Setup GoldenGate Topology – Schemas. This setting drives the naming of the Extract, Pump, and Replicat parameter files and process groups in GoldenGate. If you plan to use GoldenGate for the initial load, select the processes here as well. I’m not setting mine, as I typically use a batch load tool, such as Oracle Datapump or insert across DBLink to perform the initial load of the target. As you can see in the image below, you can also create the Oracle GoldenGate Logical Schemas from the Model.

model-jkm-process

Next are the set of Options, including the 2 new Options added in the previous post. We can leave several of the values as the default, as they are specific to particular character sets or implementation on a multi-node Oracle RAC setup.

model-jkm-options

The Options we do want to set:

ONLINE - Set to “true” to enable the automatic GoldenGate configuration when Start Journal is run.
LOCAL_TEMP_DIR – Enter a directory local to the machine on which the Start Journal process will be executed. Be sure the user executing the Start Journal process has privileges to create/modify/remove directories and files.
APPLY_FOUNDATION – Custom Option, set to “true” to enable the addition of the source-to-foundation mapping to the GoldenGate Replicat parameter file.
FND_LSCHEMA – The Logical Schema for the Foundation layer, necessary when APPLY_FOUNDATION is true.

After the Options are set, the Model can be saved and closed. Back in the Designer Navigator, add the Datastores to CDC by either selecting each individual Datastore and adding it or by right-clicking the Model and choosing to add the entire set all at once.

model-add-to-cdc

Before we get on with using the JKM, there is one thing I forgot to mention in the previous post. When setting up the Logical Schema for the GoldenGate “Delivery” process, you must also set the target Logical Schema. If you fail to do so, you’ll get an error stating “SnpLSchema does not exist” when attempting any Change Data Capture commands on the source Model.

logical-schema-set-target

With the JKM set and the tables added to Change Data Capture, we can now add a Subscriber. Subscribers allow multiple mappings to consume the change data from the J$ tables at different intervals. For example, a table may be consumed by one mapping every hour, and then by an additional mapping each night. Two different Subscribers would be used in this case. In our example, I’ll create a single Subscriber named “PERFECT_MATCH”. Make sure the process runs successfully, then it’s time to start Journalizing.

Start Capturing Changes

With the setup and configuration out of the way, the rest is up to the JKM. We’re now able to right-click the source Model and select Change Data Capture–>Start Journal. This executes all of the Start Journal related steps in the JKM, which will create the CDC Framework (J$ tables, JV$ views, etc.), generate and deploy the GoldenGate parameter files (Extract, Pump, and Replicat), and configure and start the GoldenGate process groups. Be sure that the source and target GoldenGate Manager and JAgent processes are running prior to executing the Start Journal process. Also, make sure that the database is in ArchiveLog mode and ready for GoldenGate to capture transactions.

After the Start Journal process is successfully completed, you can browse to the source GoldenGate home directory, run GGSCI, and view the status of the OGG process groups.

ogg-src-info-all

It looks like the Extract and Pump are in place and running. Now let’s check out the Replicat on the target GoldenGate installation.

ogg-trg-info-all

Here we see that the Replicat process is in place, but not actually running. Remember from our JKM editing that we commented out the step that will start the Replicat process. This is to ensure we perform an initial load prior to applying any captured change data to the target.

The last bit of work that must be completed is the initial load. I won’t go into details here, but the way I like to do it is to load the source to the target data based on a captured SCN using Oracle Datapump or DBLink rather than using GoldenGate process groups to perform the load. Note: The ODI 12c Getting Started Guide doesn’t even use the OGG initial load! Then, we can start the replicat in GoldenGate after the captured SCN using the same approach I wrote about in a previous blog on ODI and GoldenGate 11g.

JKM “Online” Mode

Beyond adding the parameter files and process groups, what exactly did the “online” version of the JKM do for us? If you browse to the directory that we set in the JKM Options under LOCAL_TEMP_DIR, you’ll find all of the GoldenGate files generated by the JKM. These files were generated locally, then uploaded to their proper GoldenGate home directory. Without “online” mode, they would had to have been manually copied to GoldenGate.

jkm-steps

Once uploaded, the obey files (batch files for GoldenGate commands) were executed.

OdiOggCommand "-LSCHEMA=EXTPMSRC" "-OPERATION=EXECUTEOBEY" "-OBEY_FILE=/home/oracle/Oracle/Middleware/oggsrc/EXTPMSRC.oby"

And finally, JKM steps were generated to perform the creation of the process groups in GoldenGate.

OdiOggCommand "-LSCHEMA=EXTPMSRC" "-OPERATION=EXECUTECMD"
add extract EXTPMSRC, tranlog, begin now 
add exttrail /home/oracle/Oracle/Middleware/oggsrc/dirdat/oc, extract EXTPMSRC, megabytes 100
stop extract EXTPMSRC
start extract

If you ever need to stop the change data capture process, either to add additional tables or make modifications to the metadata, you can run the Drop Journal process. Not only will the CDC Framework of tables and views be removed, but the “online” mode also reaches into GoldenGate and drops the process groups that were generated by the JKM.

OdiOggCommand "-LSCHEMA=EXTPMSRC" "-OPERATION=EXECUTECMD"
stop extract RPMEDWP
delete extract RPMEDWP

In conclusion, the integration between GoldenGate and Oracle Data Integrator  in 12c has been vastly improved over the 11g version. The ability to manage the entire setup process from within ODI is a big step forward, and I can only see these two products being further integrated in future releases. If you have any questions or comments about ODI or GoldenGate, or would like some help with your own implementation, feel free to add a comment below or reach out to me at michael.rainey@rittmanmead.com.

 

Categories: BI & Warehousing

Simple Ways to Simplify: Quick Fixes to Enhance OBIEE Visuals

Tue, 2014-07-08 13:18

Over the past couple of months I worked with a few clients in order to conduct an assessment on dashboard and analysis design. In most cases I was noticing that the dashboards were overflowing with content to the point that there was really no discernible “story” or “flow” to the information. Most of the dashboards were an overwhelming hub of large tables, excess dashboard prompts, and complicated charts and graphs. Many times the client complaints were that the dashboards were not being fully adopted by the user base. The users knew that something was in fact wrong with what they were looking at, but just couldn’t put their finger on why the analysis process was so frustrating. While this may seem unavoidable during a time when you are trying to provide your users with what they want and “need,” there are a few simple ways to aid this issue. During the development process, simplicity is often overlooked and undervalued, and that is why a few key design principles can drastically improve the user experience.

In many cases that I’ve seen, all dashboards start with the good intention of quickly relaying information to the user. However, as the requirements process grows in length, more and more compromise is made in keeping clean, consistent, functional design. Many of the dashboards in my assessments often hand the user everything they ask for without ever questioning the value. Developers often settle on poor design just to have something accepted by their users and released into production.

While each client is unique and has their own set of issues to resolve, there are a few consistent principles of dashboard design that can be applied for everyone.

Top-Down Analysis—The practice of allowing users to drill from summary to detail, gives your user community the ability to make their dashboards as dynamic as they need to be. This method will never give the user too much, or too little information. With detail being a choice, not the default, the user that can log in and be prompted to have to drill into further detail, rather than being presented everything all at once. While this is a widely accepted best practice when it comes to dashboard design, this principle is ignored more than you would think. Odds are that your VP does not need to see 50 rows of detail level information in a table every time they view their dashboard. The primary purpose of presenting this detail level is to answer a very important question that should be posed by analyzing your data at the summary level. The user can examine something simple like a trend analysis, and then decide whether further examination is needed. The benefit of this common principle is that your user is never overwhelmed and irritated with an overload of information. The dashboard should be treated with the same care and consideration you would . Your organization’s dashboard (product) should be a joy to use, not an experience coupled with frustration.

top down image

Simplicity and Trimming the Fat-This is another very simple, yet often ignored design principle. From my observation, many developers  will create a chart or graph and will leave all the default settings, no modifications needed right? While there is nothing inherently wrong with this, the default will leave a lot of extra pixels on the graph such as unneeded axis titles, shadowing, canvas size, etc. With just a little effort here, you can remove all of these unnecessary data pixels (pictured below) and provide a more professional, clean design. The point that I’m trying to make here is, let’s not get lazy with our visualizations. Instead, we should try to give a lot of thought what is useful in the visualization, and what can be discarded without hindering the message. The less cluttered we make our visualizations, the more pleasant the user experience will be. And as we all know, the happier our user community is, the easier your life as a developer’s will be.

default to flat

 Options for Option’s Sake-The types of visualizations used for a dashboard are one of the most important criteria for user adoption. By choosing visualizations that do not adequately display the type of analysis needed, or tell the correct story about the data, can be a frustrating waste of time for the user. Just because there are a lot of available graphing options in your analytical tool, does not mean they need to be used. This mistake is often made in an attempt to visually enhance the dashboard, or add “variety” . Try to consider things like what type of scale is in my graph (nominal, or interval pictured below), or do I want to provide summary or detail? Be sure to choose your graph based on these factors, rather than picking a graph that you think will add to variety and then figuring out how you can make it useful.

interval graphs

Visually appealing dashboards are important, however this is only relevant when the graphs are enhancing the users analytical experience. These mistakes are very costly because the overall goal of a dashboard is not to provide variety for varieties sake, but to quickly and accurately relay a message. By focusing only on visualization variety, we run a terrible risk of rendering a dashboard useless.

There are a lot of great resources out there that can provide more detail that can surely take your dashboards to the next level so I certainly suggest reading up information design methodology. I think the principles I’ve listed above are a great way to get started and provide some quick fixes on the road to enhancing the user experience within your organization.

 

Categories: BI & Warehousing