Skip navigation.

BI & Warehousing

Why Oracle Big Data SQL Potentially Solves a Big Issue with Hadoop Security

Rittman Mead Consulting - Sat, 2014-07-26 07:21

Oracle announced their Big Data SQL product a couple of weeks ago, which effectively extends Exadata’s query-offloading to Hadoop data sources. I covered the launch a few days afterwards, focusing on how it implements Exadata’s SmartScan on Hive and NoSQL data sources and provides a single metadata catalog over both relational, and Hadoop, data sources. In a Twitter conversation later in the day though, I made the comment that in my opinion, the biggest benefit of Big Data SQL will be in its ability to extend Oracle’s security model to Hadoop data sources, because Hadoop security Hadoop security is still a bit of a mess:

lang=”en”>To me the greatest benefit of Big Data SQL is the single security model; even with Sentry, Hadoop security is fragmented and a mess (IMO)

— Mark Rittman (@markrittman) July 17, 2014

I’ve been working on an Oracle Big Data Appliance project over the past few weeks, as the technical architect and initial sysadmin for the cluster, and it’s given me a first-hand experience of what security’s like on a Hadoop cluster. Over the past few weeks I’ve had to come up with a security policy covering HDFS, Hive and the Cloudera management tools (Cloudera Manager, Hue etc), and try and implement an access and authorisation approach that ensures only designated people can log in, and when they’re in, they can only see the data they’re supposed to see. Hadoop at this point, to my mind, suffers from a couple of major issues when it comes to security:

  • It’s fragmented, in that each tool or Hadoop product tends to have its own security setup, and the documentation is all split up, rapidly goes out of date, and is more of a reference than a tutorial (Cloudera’s Security documentation is one of the better examples, but it still splits the key information you need over several sections and several other docs)
  • It’s full of holes, such that the default security setup is considered insecure in terms of users being able to spoof their details, and making it more secure is again an exercise in hunting through docs, with some pretty complex configuration steps you need to perform (for example, configuring Kerebos authentication, a prerequisite for things like Apache Sentry)

If we take a typical security policy that a large enterprise customer’s going to want to put in place, it’ll look something like this:

  • Users should only be able to log in via their corporate LDAP account, and we’ll want that login process to be secure so it can’t easily be bypassed
  • We want to be able to secure our datasets, so that only authorised users can view particular datasets, and there’s likely to be some groups we grant read-only access to, and others we grant read-write
  • The data loading processes for the Hadoop cluster need to be locked-down so they can’t overwrite the datasets of other applications
  • Our security policy ideally needs to sync-up, or be an extension of, our existing enterprise security policy, not something we maintain separately
  • We need to be able to audit and review who’s actually accessing what dataset, to ensure that these policies are being followed and enforced
  • We also need the ability to obfuscate or depersonalise data before it gets into the cluster, and also have the option of encrypting the data at-rest as well as on-the-wire

Back in the early days of Hadoop these types of security policy weren’t often needed, as the users of the Hadoop cluster were typically a small set of data scientists or analysts who’d been cleared already to view and work with the data in the cluster (or more likely, they did it and just didn’t tell anyone). But as we move to enterprise information management architectures such as the one outlined in my two-part blog post series a few weeks ago (pt.1, pt.2), the users of Hadoop and other “data reservoir” data sources are likely to increase significantly in number as data from these systems becomes just another part of the general enterprise data set.

NewImage

But in practice, this is hard to do. Let’s start with HDFS first, the Hadoop Distributed File System on which most Hadoop data is stored. HDFS aims to look as similar to a Linux or Unix-type filesystem as possible, with similar commands (mkdir, ls, chmod etc) and the same POSIX permissions model, where files and directories are associated with an owner and a group and where permissions are set for that owner, the group and all others. For example, in the HDFS file listing below, the “/user/cust_segment_analysis” directory is owned by the user “mrittman” and the group “marketing”, with the directory owner having full read, write and subdirectory traversal access to the directory, the group having read-only and subdirectory traversal access, and all others having no access at all.

[root@bdanode1 ~]# hadoop fs -ls /user
Found 13 items
drwxrwxrwx   - admin    admin               0 2014-06-02 16:06 /user/admin
drwxr-x---   - mrittman marketing           0 2014-07-26 21:31 /user/cust_segment_analysis
drwxr-xr-x   - hdfs     supergroup          0 2014-05-27 13:19 /user/hdfs
drwxrwxrwx   - mapred   hadoop              0 2014-05-25 20:47 /user/history
drwxrwxr-t   - hive     hive                0 2014-06-04 16:31 /user/hive
drwxr-xr-x   - hue      hue                 0 2014-05-31 18:51 /user/hue
drwxrwxr-x   - impala   impala              0 2014-05-25 20:54 /user/impala
drwxrwxr-x   - oozie    oozie               0 2014-05-25 20:52 /user/oozie
drwxrwxrwx   - oracle   oracle              0 2014-06-09 21:38 /user/oracle
drwxr-xr-x   - root     root                0 2014-06-06 16:25 /user/root
drwxr-xr-x   - sample   sample              0 2014-05-31 18:51 /user/sample
drwxr-x--x   - spark    spark               0 2014-05-25 20:45 /user/spark
drwxrwxr-x   - sqoop2   sqoop               0 2014-05-25 20:53 /user/sqoop2

Which all sounds great until you then have another group that needs read-write access to the directory, but you’re limited to just one group permissions setting for the directory which you’ve already used to set up read-only access for that particular group. If you therefore need to set up different sets of security access for different groups, you typically then end-up creating multiple HDFS directories and multiple copies of the dataset in question, assigning each copy to a different group, which isn’t all that convenient and gives you other problems in terms of maintenance and keeping it all in-sync.

What you of course need is something like the “access control lists” (ACLs) you get with operating systems like Windows NT and MacOS, where you can define an arbitrary number of user groups and then assign each of them their own permission set on the directory and the files it contains. The most recent versions of Hadoop actually implement a form of ACL for HDFS, with this feature making its way into the recently-released Cloudera CDH5.1, but these ACLs are an addition to the standard POSIX user, group, others model and aren’t recommended for all files in your HDFS filesystem as according to the Hadoop docs “Best practice is to rely on traditional permission bits to implement most permission requirements, and define a smaller number of ACLs to augment the permission bits with a few exceptional rules. A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that has only permission bits.” Still, it’s better than not having them at all, and I’d imagine using this feature for particular directories and sets of files that need more than one set of group permissions configured for them.

In most cases though, the way you’ll present data out to non-technical end-users and applications is through Hive and Impala tables, or through tools like Pig and Spark. Under the covers, these tools still use HDFS permissions to control access to the data within Hive and Impala tables, but again by default you’re limited to granting access to whole HDFS directories, or the files contained within those directories. Something that addresses this issue is a product called Apache Sentry, an open-source project within the Hadoop family that enables role-based access control for Hive and Impala tables. Oracle are one of the co-founders of the Sentry project and include it in the base software on the Big Data Appliance, and using Sentry you can grant SELECT, INSERT or ALL privileges to a group on a particular Hive or Impala table, rather than on the underlying HDFS directories and files. A form of fine-grained access control can be set up using Sentry by creating views with particular row-level security settings, giving you the basics of a database-like security policy that you can apply over the main way that users access data in the cluster.

But Sentry itself has a few significant prerequisites – you have to enable Kerebos authentication on your cluster, which you should do anyway because of the risk of account spoofing, but is still a significant thing to set up – and of course you need to link Hive and Impala to your corporate LDAP server and configure them to work in the way that Sentry requires. Most importantly though, you’re still left with the situation where you’ve got two separate security setups – the one for your corporate data warehouse and relational data sources, and another for data accessed on Hadoop, and it’s still hard to be sure, what with all the disparate products and partially-complete open-source products, whether data in your Hadoop cluster is still really secure (though products like Cloudera Navigator aim to provide some form of data governance and auditing over these datasets); and, there’s still no straightforward way to remove individual customers’ data out of the Hadoop dataset (“data redaction”), no easy way to obfuscate or mask data, and no easy way (apart from the Hive views mentioned before) to restrict users to accessing only certain columns in a Hive or Impala table.

And so this is where Oracle’s Big Data SQL product could be very interesting. Big Data SQL takes the Exadata model of moving as much filtering and column-projection as it can to the storage server, adding Oracle SmartScan functionality to the Hadoop node and allowing it to understand the full Oracle SQL dialect (and PL/SQL security functions), rather than just the subset of SQL provided by HiveQL and Impala SQL.

NewImage

More importantly, it’ll enable a single unified data dictionary over both Oracle and Hadoop data sources, presenting Hive tables and NoSQL data as regular Oracle tables and allowing the DBA to create data security, redaction and row-level filtering policies over both relational and Hadoop data – giving you potentially the ability to define a single security policy across all data in your overall information management architecture.

NewImage

So I think this is actually a “big deal”, and potentially even more game-changing that the SmartScan functionality that got most of the attention with the Big Data SQL product launch. How well it’ll work in-practice, and how much will be enabled on day one it’s hard to say, but this feature meets a real need that our customers are finding now, so I’ll be very interested to try it out when the product becomes available (presumably) later in the year.

Categories: BI & Warehousing

Taking a Look at the New Oracle Big Data SQL

Rittman Mead Consulting - Thu, 2014-07-17 13:15

Oracle launched their Oracle Big Data SQL product earlier this week, and it’ll be of interest to anyone who saw our series of posts a few weeks ago about the updated Oracle Information Management Reference Architecture, where Hadoop now sits alongside traditional Oracle data warehouses to provide what’s termed a “data reservoir”. In this type of architecture, Hadoop and its underlying technologies HDFS, Hive and schema-on-read databases provide an extension to the more structured relational Oracle data warehouses, making it possible to store and analyse much larger sets of data with much more diverse data types and structures; the issue that customers face when trying to implement this architecture is that Hadoop is a bit of a “wild west” in terms of data access methods, security and metadata, making it difficult for enterprises to come up with a consistent, over-arching data strategy that works for both types of data store.

Oracle Big Data SQL attempts to address this issue by providing a SQL access layer over Hadoop, managed by the Oracle database and integrated in with the regular SQL engine within the database. Where it differs from SQL on Hadoop technologies such as Apache Hive and Cloudera Impala is that there’s a single unified data dictionary, single Oracle SQL dialect and the full management capabilities of the Oracle database over both sources, giving you the ability to define access controls over both sources, use full Oracle SQL (including analytic functions, complex joins and the like) without having to drop down into HiveQL or other Hadoop SQL dialects. Those of you who follow the blog or work with Oracle’s big data connector products probably know of a couple of current technologies that sound like this; Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that copies Hive or HDFS data into an Oracle database typically faster than a tool like Sqoop, whilst Oracle Direct Connector for HDFS (ODCH) gives the database the ability to define external tables over Hive or HDFS data, and then query that data using regular Oracle SQL.

Where ODCH falls short is that it treats the HDFS and Hive data as a single stream, making it easy to read once but, like regular external tables, slow to access frequently as there’s no ability to define indexes over the Hadoop data; OLH is also good but you can only use it to bulk-load data into Oracle, you can’t use it to query data in-place. Oracle Big Data SQL uses an approach similar to ODCH but crucially, it uses some Exadata concepts to move processing down to the Hadoop cluster, just as Exadata moves processing down to the Exadata storage cells (so much so that the project was called “Project Exadoop” internally within Oracle up to the launch) – but also meaning that it’s Exadata only, and not available for Oracle Databases running on non-Exadata hardware.

As explained by the launch blog post by Oracle’s Dan McClary, Oracle Big Data SQL includes components that install on the Hadoop cluster nodes that provide the same “SmartScan” functionality that Exadata uses to reduce network traffic between storage servers and compute servers. In the case of Big Data SQL, this SmartScan functionality retrieves just the columns of data requested in the query (a process referred to as “column projection”), and also only sends back those rows that are requested by the query predicate.

NewImage

Combined with Hive’s ability to map unstructured data sources into regular columns and tables, and Big Data SQL’s support for Oracle NoSQL database, the promise of this new technology is the ability to run queries against both relational, Hadoop and NoSQL data sources using a common data dictionary and common set of identity and data access controls.

There’s a couple of potential downsides, though. First-off, Big Data SQL will only be available as part of Oracle Big Data Appliance, which though an impressive bit of hardware and software is a much smaller market than the total set of Oracle customers looking to combine relational and Hadoop-based data; it’s also restricted to Oracle 12c on Exadata meaning you’ll most probably need to do a database upgrade even if you’ve already got the required Exadata servers in-place. Finally, it’s also restricted to the Oracle-specific distribution of Cloudera Hadoop, though if you’re using the BDA you’ll be using this anyway.

My other concern though is that Oracle now focus on SQL as their only access mechanism into Hadoop and big data, in a similar way to how they focused on SQL as their access route into OLAP when they incorporated Oracle Express into the Oracle Database, back in the mid-2000’s. Focusing on SQL over multidimensional languages such as Express 4GL and MDX meant you missed the real point of using a multidimensional, OLAP database – which of course was being able to use a multidimensional query language, and my concern with Big Data SQL is that we’ll end up focusing on that rather than languages such as Spark, Pig and NoSQL query languages which, combined with schema-on-read, is the real differentiator for Hadoop-based systems. As long as Big Data SQL is positioned as a “bonus” – a convenient way of getting data out of Hadoop once it’s been processed and analysed using more Hadoop-native technologies – then Big Data SQL will be a great enabling and acceptance technology for enterprises, rather than one that ends up restricting them.

We’re not aware of any beta program and I don’t think the launch webcast mentioned a specific date or BDA version when Big Data SQL will be out, but with Openworld coming up soon I’d expect to hear more about this over the next few months. We’re involved in a couple of significant Oracle Big Data Appliance implementations at the moment and this product would address a real, pressing need at the moment with our customers, so I’m looking forward to getting more involved in it over the next few months.

This article was updated on 18th July to add the fact that Big Data SQL is only available on Exadata, and is not a generic Oracle Database 12c technology.

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 4: Start Journalizing!

Rittman Mead Consulting - Thu, 2014-07-10 15:26

In this post, the finale of the four-part series “GoldenGate and Oracle Data Integrator – A Perfect Match in 12c”, I’ll walk through the setup of the ODI Models and start journalizing in “online” mode. This will utilize our customized JKM to build the GoldenGate parameter files based on the ODI Metadata and deploy them to both the source and target GoldenGate installation locations. Before I get into all of the details, let’s recap the first 3 posts and see how we arrived at this point.

Part one of the series, Getting Started, led us through a quick review of the Oracle Reference Architecture for Information Management and the tasks we’re trying to accomplish; loading both the Raw Data Reservoir (RDR) and Foundation schemas simultaneously, using Oracle GoldenGate 12c replication, and set it all up via Oracle Data Integrator 12c. We also reviewed the setup of the GoldenGate JAgent process, necessary for communication between ODI and GoldenGate when using the “online” version of the Journalizing Knowledge Module.

In part two, we reviewed the Journalizing Knowledge Module, “JKM Oracle to Oracle Consistent (OGG Online)”, its new features, and how much it has been improved in ODI 12c. Full integration with Oracle GoldenGate, through the use of the new ODI Tool “OdiOggCommand”, allows for the setup and configuration of GoldenGate process groups, trail file directories, and table-level supplemental logging, all from the ODI JKM.

Most recently, part 3, titled Setup Journalizing, walked us through the customizations of the “JKM Oracle to Oracle Consistent (OGG Online)” that will allow us to create the source-to-foundation replication alongside the standard source-to-RDR setup. We added a set of options, the first to control whether or not we replicat to foundation and the second to capture the ODI Logical Schema corresponding to the foundation schema. Then we added the task that will create the source-to-foundation table mapping inside the GoldenGate replicat parameter file and set the options appropriately. I’ve been using the ODI 12c Getting Started VM, with the 12.1.2 version of ODI, for my demo setup. If you haven’t done so already, you can download the latest version of the VM, with ODI 12.1.3, from the Oracle Technical Network. I’d say that’s enough recap, now on to the final steps for GoldenGate and ODI 12c integration and let’s start journalizing!

Setup ODI Models Create Models

We first need to create the ODI Models and Datastores for the Source, Staging (Raw Data Reservoir) and Foundation tables. I will typically reverse engineer the source tables into a Model first, then copy them to the Staging and Foundation Models. This approach will ensure the column names and data types remain consistent with the source. I then execute a Groovy script to create the additional data warehouse audit columns in each of the Foundation Datastores.

models

Configure JKM

Unlike the 11g version of ODI, in 12c the “JKM Oracle to Oracle Consistent (OGG Online)” Knowledge Module will be set on the source Model. Open up the Model, in this example, PM_SRC, and switch to the Journalizing tab.

model-journalizing

We’ll set the Journalizing Mode to “Consistent Set” and then choose the customized JKM that we have been working with in this example, “JKM Oracle to Oracle Consistent (OGG Online) RM”, from the dropdown list. Now we are presented with the GoldenGate Process Selection parameters and a list of KM Options to configure.

Set the Process Selection parameters for the Capture Process and Delivery Process by selecting the Logical Schemas created in Part 3 - Setup GoldenGate Topology – Schemas. This setting drives the naming of the Extract, Pump, and Replicat parameter files and process groups in GoldenGate. If you plan to use GoldenGate for the initial load, select the processes here as well. I’m not setting mine, as I typically use a batch load tool, such as Oracle Datapump or insert across DBLink to perform the initial load of the target. As you can see in the image below, you can also create the Oracle GoldenGate Logical Schemas from the Model.

model-jkm-process

Next are the set of Options, including the 2 new Options added in the previous post. We can leave several of the values as the default, as they are specific to particular character sets or implementation on a multi-node Oracle RAC setup.

model-jkm-options

The Options we do want to set:

ONLINE - Set to “true” to enable the automatic GoldenGate configuration when Start Journal is run.
LOCAL_TEMP_DIR – Enter a directory local to the machine on which the Start Journal process will be executed. Be sure the user executing the Start Journal process has privileges to create/modify/remove directories and files.
APPLY_FOUNDATION – Custom Option, set to “true” to enable the addition of the source-to-foundation mapping to the GoldenGate Replicat parameter file.
FND_LSCHEMA – The Logical Schema for the Foundation layer, necessary when APPLY_FOUNDATION is true.

After the Options are set, the Model can be saved and closed. Back in the Designer Navigator, add the Datastores to CDC by either selecting each individual Datastore and adding it or by right-clicking the Model and choosing to add the entire set all at once.

model-add-to-cdc

Before we get on with using the JKM, there is one thing I forgot to mention in the previous post. When setting up the Logical Schema for the GoldenGate “Delivery” process, you must also set the target Logical Schema. If you fail to do so, you’ll get an error stating “SnpLSchema does not exist” when attempting any Change Data Capture commands on the source Model.

logical-schema-set-target

With the JKM set and the tables added to Change Data Capture, we can now add a Subscriber. Subscribers allow multiple mappings to consume the change data from the J$ tables at different intervals. For example, a table may be consumed by one mapping every hour, and then by an additional mapping each night. Two different Subscribers would be used in this case. In our example, I’ll create a single Subscriber named “PERFECT_MATCH”. Make sure the process runs successfully, then it’s time to start Journalizing.

Start Capturing Changes

With the setup and configuration out of the way, the rest is up to the JKM. We’re now able to right-click the source Model and select Change Data Capture–>Start Journal. This executes all of the Start Journal related steps in the JKM, which will create the CDC Framework (J$ tables, JV$ views, etc.), generate and deploy the GoldenGate parameter files (Extract, Pump, and Replicat), and configure and start the GoldenGate process groups. Be sure that the source and target GoldenGate Manager and JAgent processes are running prior to executing the Start Journal process. Also, make sure that the database is in ArchiveLog mode and ready for GoldenGate to capture transactions.

After the Start Journal process is successfully completed, you can browse to the source GoldenGate home directory, run GGSCI, and view the status of the OGG process groups.

ogg-src-info-all

It looks like the Extract and Pump are in place and running. Now let’s check out the Replicat on the target GoldenGate installation.

ogg-trg-info-all

Here we see that the Replicat process is in place, but not actually running. Remember from our JKM editing that we commented out the step that will start the Replicat process. This is to ensure we perform an initial load prior to applying any captured change data to the target.

The last bit of work that must be completed is the initial load. I won’t go into details here, but the way I like to do it is to load the source to the target data based on a captured SCN using Oracle Datapump or DBLink rather than using GoldenGate process groups to perform the load. Note: The ODI 12c Getting Started Guide doesn’t even use the OGG initial load! Then, we can start the replicat in GoldenGate after the captured SCN using the same approach I wrote about in a previous blog on ODI and GoldenGate 11g.

JKM “Online” Mode

Beyond adding the parameter files and process groups, what exactly did the “online” version of the JKM do for us? If you browse to the directory that we set in the JKM Options under LOCAL_TEMP_DIR, you’ll find all of the GoldenGate files generated by the JKM. These files were generated locally, then uploaded to their proper GoldenGate home directory. Without “online” mode, they would had to have been manually copied to GoldenGate.

jkm-steps

Once uploaded, the obey files (batch files for GoldenGate commands) were executed.

OdiOggCommand "-LSCHEMA=EXTPMSRC" "-OPERATION=EXECUTEOBEY" "-OBEY_FILE=/home/oracle/Oracle/Middleware/oggsrc/EXTPMSRC.oby"

And finally, JKM steps were generated to perform the creation of the process groups in GoldenGate.

OdiOggCommand "-LSCHEMA=EXTPMSRC" "-OPERATION=EXECUTECMD"
add extract EXTPMSRC, tranlog, begin now 
add exttrail /home/oracle/Oracle/Middleware/oggsrc/dirdat/oc, extract EXTPMSRC, megabytes 100
stop extract EXTPMSRC
start extract

If you ever need to stop the change data capture process, either to add additional tables or make modifications to the metadata, you can run the Drop Journal process. Not only will the CDC Framework of tables and views be removed, but the “online” mode also reaches into GoldenGate and drops the process groups that were generated by the JKM.

OdiOggCommand "-LSCHEMA=EXTPMSRC" "-OPERATION=EXECUTECMD"
stop extract RPMEDWP
delete extract RPMEDWP

In conclusion, the integration between GoldenGate and Oracle Data Integrator  in 12c has been vastly improved over the 11g version. The ability to manage the entire setup process from within ODI is a big step forward, and I can only see these two products being further integrated in future releases. If you have any questions or comments about ODI or GoldenGate, or would like some help with your own implementation, feel free to add a comment below or reach out to me at michael.rainey@rittmanmead.com.

 

Categories: BI & Warehousing

Simple Ways to Simplify: Quick Fixes to Enhance OBIEE Visuals

Rittman Mead Consulting - Tue, 2014-07-08 13:18

Over the past couple of months I worked with a few clients in order to conduct an assessment on dashboard and analysis design. In most cases I was noticing that the dashboards were overflowing with content to the point that there was really no discernible “story” or “flow” to the information. Most of the dashboards were an overwhelming hub of large tables, excess dashboard prompts, and complicated charts and graphs. Many times the client complaints were that the dashboards were not being fully adopted by the user base. The users knew that something was in fact wrong with what they were looking at, but just couldn’t put their finger on why the analysis process was so frustrating. While this may seem unavoidable during a time when you are trying to provide your users with what they want and “need,” there are a few simple ways to aid this issue. During the development process, simplicity is often overlooked and undervalued, and that is why a few key design principles can drastically improve the user experience.

In many cases that I’ve seen, all dashboards start with the good intention of quickly relaying information to the user. However, as the requirements process grows in length, more and more compromise is made in keeping clean, consistent, functional design. Many of the dashboards in my assessments often hand the user everything they ask for without ever questioning the value. Developers often settle on poor design just to have something accepted by their users and released into production.

While each client is unique and has their own set of issues to resolve, there are a few consistent principles of dashboard design that can be applied for everyone.

Top-Down Analysis—The practice of allowing users to drill from summary to detail, gives your user community the ability to make their dashboards as dynamic as they need to be. This method will never give the user too much, or too little information. With detail being a choice, not the default, the user that can log in and be prompted to have to drill into further detail, rather than being presented everything all at once. While this is a widely accepted best practice when it comes to dashboard design, this principle is ignored more than you would think. Odds are that your VP does not need to see 50 rows of detail level information in a table every time they view their dashboard. The primary purpose of presenting this detail level is to answer a very important question that should be posed by analyzing your data at the summary level. The user can examine something simple like a trend analysis, and then decide whether further examination is needed. The benefit of this common principle is that your user is never overwhelmed and irritated with an overload of information. The dashboard should be treated with the same care and consideration you would . Your organization’s dashboard (product) should be a joy to use, not an experience coupled with frustration.

top down image

Simplicity and Trimming the Fat-This is another very simple, yet often ignored design principle. From my observation, many developers  will create a chart or graph and will leave all the default settings, no modifications needed right? While there is nothing inherently wrong with this, the default will leave a lot of extra pixels on the graph such as unneeded axis titles, shadowing, canvas size, etc. With just a little effort here, you can remove all of these unnecessary data pixels (pictured below) and provide a more professional, clean design. The point that I’m trying to make here is, let’s not get lazy with our visualizations. Instead, we should try to give a lot of thought what is useful in the visualization, and what can be discarded without hindering the message. The less cluttered we make our visualizations, the more pleasant the user experience will be. And as we all know, the happier our user community is, the easier your life as a developer’s will be.

default to flat

 Options for Option’s Sake-The types of visualizations used for a dashboard are one of the most important criteria for user adoption. By choosing visualizations that do not adequately display the type of analysis needed, or tell the correct story about the data, can be a frustrating waste of time for the user. Just because there are a lot of available graphing options in your analytical tool, does not mean they need to be used. This mistake is often made in an attempt to visually enhance the dashboard, or add “variety” . Try to consider things like what type of scale is in my graph (nominal, or interval pictured below), or do I want to provide summary or detail? Be sure to choose your graph based on these factors, rather than picking a graph that you think will add to variety and then figuring out how you can make it useful.

interval graphs

Visually appealing dashboards are important, however this is only relevant when the graphs are enhancing the users analytical experience. These mistakes are very costly because the overall goal of a dashboard is not to provide variety for varieties sake, but to quickly and accurately relay a message. By focusing only on visualization variety, we run a terrible risk of rendering a dashboard useless.

There are a lot of great resources out there that can provide more detail that can surely take your dashboards to the next level so I certainly suggest reading up information design methodology. I think the principles I’ve listed above are a great way to get started and provide some quick fixes on the road to enhancing the user experience within your organization.

 

Categories: BI & Warehousing

GoldenGate and Oracle Data Integrator – A Perfect Match in 12c… Part 3: Setup Journalizing

Rittman Mead Consulting - Wed, 2014-07-02 23:39

After a short vacation, some exciting news, and a busy few weeks (including KScope14 in Seattle, WA), it’s time to get the “GoldenGate and Oracle Data Integrator – A Perfect Match in 12c” blog series rolling again. Hopefully readers can find some time between World Cup matches to try integrating ODI and GoldenGate on their own!

To recap my previous two posts on this subject, I first started by showing the latest Information Management Reference Architecture at a high-level (described in further detail by Mark Rittman) and worked through the JAgent configuration, necessary for communication between ODI and GoldenGate. In the second post, I walked through the changes made to the GoldenGate JKM in ODI 12c and laid out the necessary edits for loading the Foundation layer at a high-level. Now, it’s time to make the edits to the JKM and set up the ODI metadata.

Before I jump into the JKM customization, let’s go through a brief review of the foundation layer and its purpose. The foundation schema contains tables that are essentially duplicates of the source table structure, but with the addition of the foundation audit columns, described below, that allow for the storage of all transactional history in the tables.

FND_SCN (System Change Number)
FND_COMMIT_DATE (when the change was committed)
FND_DML_TYPE (DML type for the transaction: insert, update, delete)

The GoldenGate replicat parameter file must be setup to map the source transactions into the foundation tables using the INSERTALLRECORDS option. This is the same option that the replicat uses to load the J$ tables, allowing only inserts and no updates or deletes. A few changes to the JKM will allow us to choose whether or not we want to load the Foundation schema tables via GoldenGate.

Edit the Journalizing Knowledge Module

To start, make a copy of the “JKM Oracle to Oracle Consistent (OGG Online)” so we don’t modify the original. Now we’re ready to make our changes.

Add New Options

A couple of new Options will need to be added to enable the additional feature of loading the foundation schema, while still maintaining the original JKM code. Option values are set during the configuration of the JKM on the Model, but can also have a default in the JKM.

APPLY_FOUNDATION

new-option-apply-fnd

This option, when true, will enable this step during the Start Journal process, allowing it to generate the source-to-foundation mapping statement in the Replicat (apply) parameter file.

FND_LSCHEMA

new-option-fnd-schema

This option will be set with Logical Schema name for the Foundation layer, and will be used to find the physical database schema name when output in the GoldenGate replicat parameter file.

Add a New Task

With the options created, we can now add the additional task to the JKM that will create the source to foundation table mappings in the GoldenGate replicat parameter file. The quickest way to add the task is to duplicate a current task. Open the JKM to the Tasks tab and scroll down to the “Create apply prm (3)” step. Right click the task and select Duplicate. A copy of the task will be created and in the order that we want, just after the step we duplicated.

Rename the step to “Create apply prm (4) RM”, adding the additional RM tag so it’s easily identifiable as a custom step. From the properties, open the Edit Expression dialog for the Target Command. The map statement, just below the OdiOutFile line, will need to be modified. First, remove the IF statement code, as the execution of this step will be driven by the APPLY_FOUNDATION option being set to true.

Here’s a look at the final code after editing.

map <%= odiRef.getObjectName("L", odiRef.getJrnInfo("TABLE_NAME"), odiRef.getOggModelInfo("SRC_LSCHEMA"), "D") %>, TARGET <%= odiRef.getSchemaName("" + odiRef.getOption("FND_LSCHEMA") + "","D") %>.<%= odiRef.getJrnInfo("TABLE_NAME") %>, KEYCOLS (<%= odiRef.getColList("", "[COL_NAME]", ", ", "", "PK") %>, FND_SCN)<%if (!odiRef.getOption("NB_APPLY_PROCESS").equals("1")) {%>, FILTER (@RANGE(#ODI_APPLY_NUMBER,<%= nbApplyProcesses %>,<%= odiRef.getColList("", "[COL_NAME]", ", ", "", "PK") %>))<% } %> INSERTALLRECORDS,
COLMAP (
USEDEFAULTS,
FND_COMMIT_DATE = @GETENV('GGHEADER' , 'COMMITTIMESTAMP'),
FND_SCN = @GETENV('TRANSACTION' , 'CSN'),
FND_DML_TYPE = @GETENV('GGHEADER' , 'OPTYPE')
);

The output of this step is going to be a mapping for each source-to-foundation table in the GoldenGate replicat parameter file, similar to this:

map PM_SRC.SRC_CITY, TARGET EDW_FND.SRC_CITY, KEYCOLS (CITY_ID, FND_SCN) INSERTALLRECORDS,
COLMAP (
USEDEFAULTS,
FND_COMMIT_DATE = @GETENV('GGHEADER' , 'COMMITTIMESTAMP'),
FND_SCN = @GETENV('TRANSACTION' , 'CSN'),
FND_DML_TYPE = @GETENV('GGHEADER' , 'OPTYPE')
);

The column mappings (COLMAP clause) are hard-coded into the JKM, with the parameter USEDEFAULTS mapping each column one-to-one. We also hard-code each foundation audit column mapping to the appropriate environment variable from the GoldenGate trail file. Learn more about the GETENV GoldenGate function here.

The bulk of the editing on this step is done to the MAP statement. The out-of-the-box JKM is setup to apply transactional changes to both the J$, or change table, and fully replicated table. Now we need to add the mapping to the foundation table. In order to do so, we first need to identify the foundation schema and table name for the target table using the ODI Substitution API.

map ... TARGET <%= odiRef.getSchemaName("" + odiRef.getOption("FND_LSCHEMA") + "", "D") %> ...

The nested Substitution API call allows us to get the physical database schema name based on the ODI Logical Schema that we will set in the option FND_LSCHEMA, during setup of the JKM on the ODI Model. Then, we concatenate the target table name with a dot (.) in between to get the fully qualified table name (e.g. EDW_FND.SRC_CITY).

... KEYCOLS (<%= odiRef.getColList("", "[COL_NAME]", ", ", "", "PK") %>, FND_SCN) ...

We also added the FND_SCN to the KEYCOLS clause, forcing the uniqueness of each row in the foundation tables. Because we only insert records into this table, the natural key will most likely be duplicated numerous times should a record be updated or deleted on the source.

Set Options

The previously created task,  “Create apply prm (4) RM”, should be set to execute only when the APPLY_FOUNDATION option is “true”. On this step, go to the Properties window and choose the Options tab. Deselect all options except APPLY_FOUNDATION, and when Start Journal is run, this step will be skipped unless APPLY_FOUNDATION is true.

jkm-set-task-option

Edit Task

Finally, we need to make a simple change to the “Execute apply commands online” task. First, add the custom step indicator (in my example, RM) to the end of the task name. In the target command expression, comment out the “start replicat …” command by using a double-dash.

--start replicat ...

This prevents GoldenGate from starting the replicat process automatically, as we’ll first need to complete an initial load of the source data to the target before we can begin replication of new transactions.

Additional Setup

The GoldenGate Manager and JAgent are ready to go, as is the customized “JKM Oracle to Oracle Consistent (OGG Online)” Journalizing Knowledge Module. Now we need to setup the Topology for both GoldenGate and the data sources.

Setup GoldenGate Topology - Data Servers

In order to properly use the “online” integration between GoldenGate and Oracle Data Integrator, a connection must be setup for the GoldenGate source and target. These will be created as ODI Data Servers, just as you would create an Oracle database connection. But, rather than provide a JDBC url, we will enter connection information for the JAgent that we configured in the initial post in the series.

First, open up the Physical Architecture under the Topology navigator and find the Oracle GoldenGate technology. Right-click and create a new Data Server.

create-ogg-dataserver

Fill out the information regarding the GoldenGate JAgent and Manager. To find the JAgent port, browse to the GG_HOME/cfg directory and open “Config.properties” in a text viewer. Down towards the bottom, the “jagent.rmi.port”, which is used when OEM is enabled, can be found.

####################################################################
## jagent.rmi.port ###
## RMI Port which EM Agent will use to connect to JAgent ###
## RMI Port will only be used if agent.type.enabled=OEM ###
####################################################################
jagent.rmi.port=5572

The rest of the connection information can be recalled from the JAgent setup.

setup-ogg-dataserver

Once completed, test the connection to ensure all of the parameters are correct. Be sure to setup a Data Server for both the source and target, as each will have its own JAgent connection information.

Setup GoldenGate Topology - Schemas

Now that the connection is set, the Physical Schema for both the GoldenGate source and target must be created. These schemas tie directly to the GoldenGate process groups and will be the name of the generated parameter files. Under the source Data Server, create a new Physical Schema. Choose the process type of “Capture”, provide a name (8 characters or less due to GoldenGate restrictions), and enter the trail file paths for the source and target trail files.

Create the Logical Schema just as you would with any other ODI Technology, and the extract process group schema is set.

For the target, or replicat, process group, perform the same actions on the GoldenGate target Data Server. This time, we just need to specify the target trail file directory, the discard directory (where GoldenGate reporting and discarded records will be stored), and the source definitions directory. The source definitions file is a GoldenGate representation of the source table structure, used when the source and target table structures do not match. The Online JKM will create and place this file in the source definitions directory.

Again, setup the Logical Schema as usual and the connections and process group schemas are ready to go!

The final piece of the puzzle is to setup the source and target data warehouse Data Servers, Physical Schemas, and Logical Schemas. Use the standard best practices for this setup, and then it’s time to create ODI Models and start journalizing. In the next post, Part 4 of the series, we’ll walk through applying the JKM to the source Model and start journalizing using the Online approach to GoldenGate and ODI integration.

Categories: BI & Warehousing

Going Beyond the Summary Advisor with TimesTen for Exalytics

Rittman Mead Consulting - Sun, 2014-06-22 17:18

I’m over in Seattle at the moment for ODTUG KScope’14, and one of the topics I’m presenting on this week is the use of TimesTen with OBIEE; what I’m going to talk about is taking TimesTen beyond its current use as an in-memory cache for aggregates created by the Summary Advisor, and instead using it to store additional custom aggregates, “hot” data from the source data warehouse, and other custom structures created using SQL*Developer and potentially loaded using Oracle Data Integrator. The point of the session is to show one of the main benefits of TimesTen even in the world of the new In-Memory Option for Oracle Database – it’s a flexible but Oracle-compatible database that’s typically under the control of the BI team and open to much more customisation than most customers realise.

To illustrate the point, I’m going to run through an example using the Oracle Database Sales History (SH) schema as the source in this article, with the example going through five stages of development:

1. First, I’ll load TimesTen for Exalytics in the usual way, by having the Summary Advisor run some recommendations and then generate aggregates to load into TimesTen.

2. Then, I’ll add some of my own aggregates into TimesTen, by using the Aggregate Persistence Wizard, so that we’ve now got a base set of aggregates from me (ones that I know we’ll need) plus whatever the Summary Advisor recommends based on historic query patterns.

3. Next, I’ll use TimesTen to hold what we refer to as “hot data” – the actual transactional data from the source database in this case from the last six months. So now, TimesTen will contain both the full transaction-level data that users are most likely to query at detail-level, plus aggregations of the full dataset over time, giving us even more chance that TimesTen will be able to fulfil a user’s query.

4. Then I’ll supplement this data in TimesTen with some additional reference data, that I’ll bring in from file using a tool like Oracle Data Integrator or the Oracle Database external table feature, modifying the existing TimesTen table structure using SQL*Developer to hold these additional dimension attributes.

5. Finally, I’ll set up incremental refresh of all of this, in order to avoid the full table-drop-and-reload approach that the Summary Advisor and Aggregate Persistence Wizard use, and to make data refreshing more efficient

Let’s start with steps 1 and 2, creating the aggregate layer for the underlying Sales History Oracle data source.

1. Creating Initial Aggregate using the Summary Advisor

We can use OBIEE’s Summary Advisor, and the underlying Aggregate Persistence mechanism that the Summary Advisor uses, to build the in-memory aggregates for our underlying datasource. Starting with the simplest part first, we’ll use the Summary Advisor to create a set of aggregates based off of recommendations from historic query data.

Running the Summary Advisor in my case brings up a recommendation to create three aggregates:

Sshot 4

Running the script that the Summary Advisor generates brings up one of the issues though that you’re likely to hit when using this tool – your RPD has to be absolutely “rock solid” in terms of correctness, otherwise your aggregate build will fail. In the case of this example, the summary advisor aggregate persistence script starts-off running OK, but then errors at the end when one of the aggregate tables fails to build:

c:\Middleware\Oracle_BI1\bifoundation\server\bin>nqcmd -d coreapplication_OH799
36239 -u weblogic -p welcome1 -s c:\ttscripts\summ_adv_SH_full_blog.sql

-------------------------------------------------------------------------------
Oracle BI ODBC Client
Copyright (c) 1997-2013 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------


create aggregates

"ag_561184940"
for "Sales History (Oracle + TT)"."SALES"("AMOUNT_SOLD","QUANTITY_SOLD")
at levels ("Sales History (Oracle + TT)"."CustomerDim"."City", "Sales History
Oracle + TT)"."TimesDim"."Month", "Sales History (Oracle + TT)"."ProductsDim"."Category")
using connection pool "tt_aggr_store"."Connection Pool"
in "tt_aggr_store".."EXALYTICS",

…

"ag_1162984574"
for "Sales History (Oracle + TT)"."SALES"("AMOUNT_SOLD","QUANTITY_SOLD")
at levels ("Sales History (Oracle + TT)"."TimesDim"."Year", "Sales History (Or
cle + TT)"."ProductsDim"."Category", "Sales History (Oracle + TT)"."PromoDim"."Promo Subcategory")
using connection pool "tt_aggr_store"."Connection Pool"
in "tt_aggr_store".."EXALYTICS"
[10058][State: S1000] [NQODBC] [SQL_STATE: S1000] [nQSError: 10058] A general error has occurred.
Statement preparation failed


Processed: 1 queries
Encountered 1 errors

Looking at the list of tables left in TimesTen at that point, I can see that one of the dimensions failed to build, which then cascading down to the dependent fact table build failing:

C:\Users\Administrator>ttisql

Copyright (c) 1996, 2013, Oracle and/or its affiliates. All rights reserved.

Type ? or "help" for help, type "exit" to quit ttIsql.

Command> connect "DSN=tt_exalytics_3;uid=exalytics;pwd=welcome1";

Connection successful: DSN=tt_exalytics_3;UID=exalytics;DataStore=C:\TimesTen\tt_data_3;DatabaseCharacterSet=AL32UTF8;ConnectionCharacterSet=US7ASCII;DRIVER=C:\TimesTen\TT1122~2\bin\ttdv1122.dll;LogDir=C:\TimesTen\tt_data_3;PermSize=4000;TempSize=2000;LockWait=60010.0;SQLQueryTimeout=60000;TypeMode=0;QueryThreshold=60000;PLSCOPE_SETTINGS=IDENTIFIERS:NONE;RangeIndexType=1;(Default setting AutoCommit=1)

Command> tables;

  EXALYTICS.SA_CATEGOR0000AFE8
  EXALYTICS.SA_CHANNEL0000B8F9
  EXALYTICS.SA_CITY0000AC6A
  EXALYTICS.SA_MONTH0000AC81
  EXALYTICS.SA_PROMO_S0000B8E8
  EXALYTICS.TEST

6 tables found.

Summary Advisor aggregates failing to build is something that regular Exalytics developers will be used to, at least from when they first use the tool. The trick to it is to make sure you run the Model Checker, within the BI Administration Tool, before you try and generate some aggregates; whilst your RPD might pass the regular consistency check and be valid as a repository that can go online and be used for reporting, it may not be consistent or logically correct from an aggregation standpoint, and looking at the NQQuery.log file, I can see that the first aggregate failed to build because aggregate persistence rejected the shared product dimension all of the aggregate fact tables used.

Running the model checker on my business model, I can see there are two errors listed for the product dimension, and a bunch of warnings where I’ve not added dimension attribute columns to their relevant dimension level.

Sshot 5

Running the logical SQL given in the two error messages at the start, I can see that there’s an error in the data that I’m using for my Products dimension data source, in that one or more of the dimension levels contains keys that are duplicates – in this case because  there are two “Accessories” product subcategories with different IDs. I go into SQL*Developer and correct this issue, and also add the logical columns listed as “warnings” into their respective logical dimension levels, so that running the model checker again gives me this message instead:

Sshot 6

I then pick two of the aggregate recommendations and run the script for them, so in the end I’m left with the aggregate tables below in my RPD.

Sshot 7

Thereafter, I can run the Summary Advisor again to generate some more recommendations, or I can add my own specific aggregates into the TimesTen database using the Aggregate Persistence Wizard, for example to add an aggregate I know users are going to want to use.

2. Adding Custom Aggregates using the Aggregate Persistence Wizard

Generating aggregates using the Aggregate Persistence Wizard uses the same underlying mechanism to create and populate the aggregate tables, but in this instance I select the specific measures, dimensions and hierarchy levels I want to aggregate by. In this instance, I use the Aggregate Persistence Wizard to create an additional aggregate using the Promotions dimension, as I know some new reports coming along will make use of this pre-aggregation.

Sshot 9

So where I am at the end of all this is with a TimesTen aggregate data mart containing three aggregate fact tables – two recommended by the Summary Advisor, one I added myself – along with supporting dimension tables.

Sshot 11

And this is great for handling dashboard queries that request summarised data, through a graph or pivot table analysis. But what about when users want to drill-down to the detail, or run reports against recent transactional activity? That’s where the second TimesTen use-case, “hot data”, comes in.

3. Populating TimesTen with Transactional “Hot Data”

So far we’ve looked at populating our TimesTen database with aggregate data, but what about making the most recent set of transactions available to users also in TimesTen, so that analyses running against recent activity even at the detail level run fast too? To do this we can use the ttImportFromOracle utility to replicate into TimesTen the Sales History schema, and then edit the data loading script it generates to only load the last six months of data.

This utility only works for Oracle sources, and ships with recent versions of TimesTen in the /support directory. Using it I can have the utility scan the data in my source database in order to recommend the most space-efficient TimesTen datatypes, and I can also use it to recommend compression settings for use with TimesTen for Exalytics Columnar Compression feature.

As ttImportFromOracle requires you to have a TimesTen schema with the same name as the one you’re importing from in Oracle, I create a new schema in my TimesTen for Exalytics database, ready for importing from the Oracle source:

create user sh identified by welcome1;
grant create session to sh;
grant create table to sh;
grant select on SYS.OBJ$ to sh;
grant admin to sh;

Then I can use the utility to create the DDL and data loading scripts that I’ll then use to import the hot data from Oracle.

c:\TimesTen\tt1122_64_3\support>ttImportFromOracle.exe -oraconn sh/password@orcl  -tables CUSTOMERS PRODUCTS SALES CHANNELS PROMOTIONS TIMES -typeMap 2,1 -compression 1

Beginning processing
Resolving any tablename wildcards
Eliminating any duplicate tables
Getting metadata from source
Generating database user list
Assigning TimesTen datatypes
Analyzing source tables (this may take some time)
Analyzing table 'SH.CUSTOMERS' ...
Analyzing table 'SH.PRODUCTS' ...
Analyzing table 'SH.SALES' ...
Analyzing table 'SH.CHANNELS' ...
Analyzing table 'SH.PROMOTIONS' ...
Analyzing table 'SH.TIMES' ...
Optimizing TimesTen datatypes
Estimating compression ratios
Generating output files
Finished processing

This set of scripts creates the tables and indexes to hold the replicated data from Oracle, and I can edit the “LoadData.sql” script that comes with the DDL scripts to load just the last six months of data – to do this I split the “ttLoadFromOracle” TimesTen procedure calls in this script into ones for the dimensions and one for the fact table load, which after amending it to load just six months data looks like this:

timing 1; 

call ttLoadFromOracle('SH', 'SALES', 'SELECT S.PROD_ID, S.CUST_ID, S.TIME_ID, S.CHANNEL_ID, S.PROMO_ID, S.QUANTITY_SOLD, S.AMOUNT_SOLD 
FROM SH.SALES S, SH.TIMES T 
WHERE S.TIME_ID = T.TIME_ID 
AND  T.CALENDAR_MONTH_DESC > ''2001-06''');

Then I connect to TimesTen as the SH user, providing the password to the Oracle database as part of the connect string, and then run the scripts I’ve just generated:

Command> connect "DSN=tt_exalytics_3;uid=sh;password=welcome1;oraclepwd=password";
Command> run c:\ttimportscripts\CreateTables.sql
Command> run c:\ttimportscripts\LoadData_dims.sql
Command> run c:\ttimportscripts\LoadData_fact.sql
Command> run c:\ttimportscripts\CreateIndexes.sql
Command> run c:\ttimportscripts\UpdateStats.sql

I can then go over to SQL*Developer and see the new TimesTen tables there, with the optimised datatypes provided by ttImportFromOracle:

Sshot 13

and also import these new tables into the RPD, adding an extra schema alongside the aggregate tables:

Sshot 12

Then, the way to make use of this six months of transactional data in TimesTen is to set up “fragmentation” in the OBIEE RPD, so that the BI Server goes to the TimesTen data source when queries require data from the past six months, and the standard Oracle datasource if data further back from that is required; in addition, because of the aggregates we’ve also set up, it’ll go to the TimesTen aggregate tables if queries request aggregated data, so TimesTen performs the role of holding both “hot” data, and aggregate data.

The first step in setting this up is to map in the new TimesTen table as additional logical table sources into the existing business model. In the screenshot below, you can see the new TimesTen logical table sources added to the existing Oracle, and TimesTen aggregate logical table sources, and when you do this make sure you remove any additional logical table keys that might come across when you map in the new TimesTen tables.

Sshot 14

Then, for just the fact table within the logical iodel, edit the Oracle detail-level and the TimesTen “hot data” detail level table sources, add fragmentation conditions to define what time period each source covers, like this:

NewImage

Do this just for the fact table logical table source; then, when a query comes through to the BI Server, if it needs to refer to the TimesTen logical table source it’ll join-out to the relevant TimesTen hot data logical table sources to get the rest of the required data, and when it needs to go to the Oracle fact table logical table source, it’ll join-out to the Oracle LTSs, as shown in the diagram below.

NewImage

Then, when you incude the TimesTen aggregate tables into the architecture as well, you’ve got three potential sources of data for user reports, all handled seamlessly and automatically by the BI Server; the Oracle and TimesTen “hot” data sources for detail-level reporting, one or the other (or both) used for queries at the transaction level, and the TimesTen aggregate tables when they would answer a query faster than rolling-up the detail-level sources on-the-fly.

NewImage

4. Supplementing with Additional Reference Data

One of the advantages of using a mid-tier database like TimesTen, that’s under the control of the BI department, is that you can be a lot more “agile” in terms of bringing in new data sources, than is often the case when you’re reporting against a data warehouse that’s strictly governed and a shared corporate resource. TimesTen lends itself well to agile development too, in that you can work with it using tools like SQL*Developer and easily load data into it using ETL tools such as Oracle Data Integrator, or leverage the connectivity features in the Oracle Database and then use ttImportFromOracle.

Working with TimesTen database structures is more or less the same process as working with Oracle ones, but there’s a couple of differences you need to be aware of when developing your TimesTen tables. Adding new columns to a table is possible, but these can end up stored “out of line” with the main column set and can cause performance issues when queried, or compatibility issues when loading using tools like ttImportfFromOracle. If you’re using TimesTen to store “hot” data, make sure you’ve got corresponding tables for this reference data in the source Oracle database, and make sure you select the most appropriate TimesTen datatypes for your new data (for example, TT_TINYINT and TT_INTEGER) rather than just the default Oracle-like datatypes.

Most importantly, for all of your TimesTen tables, make sure you run the Index Advisor after you start to run user reports in order to ensure you’re using the right indexes for your particular query patterns, and make sure you gather stats on your TimesTen tables as well as the Oracle ones, so that the TimeTen query optimiser can generate the most efficient query plans.

5. Setting up Incremental Refresh

The last thing we want to set up for our TimesTen analytic data mart is some way to incrementally refresh the aggregates we built using the Summary Advisor and Aggregate Persistence Wizard. The way these tools work is that they drop, and then recreate the aggregate tables each time you load them, and add and remove them from the RPD at the start and end of the data load, as shown in the diagram below:

NewImage

What you can do, though, as outlined in this earlier blog post by Robin Moffatt, is take the POPULATE BI Server command that you’ll find in the nqquery.log file after an aggregate build, and use it to just refresh the aggregates in-place, without dropping the tables beforehand or removing them from the RPD, as shown in the diagram below.

NewImage

Looking through the nqquery.log file I can see various POPULATE commands for the TimesTen dimension tables, like this:

populate "SA_Promo_C0000B8E4" mode ( append table connection pool "tt_aggr_store"."Connection Pool") as  select_business_model "Sales History (Oracle + TT)"."PROMOTIONS"."PROMO_CATEGORY" as "PROMO_CATE0000B8C6","Sales History (Oracle + TT)"."PROMOTIONS"."PROMO_CATEGORY_ID" as "PROMO_CATE0000B8C7", RCOUNT (1) as "Promo_C_0000B8E4SK" from "Sales History (Oracle + TT)"; [[

If I then couple that with the INACTIVE_SCHEMAS variable being set to the TimesTen physical data source, add a command to truncate the dimension table before the load, and then alter the fact table POPULATE command’s SELECT statement to only load in just this past month’s data, deleting that data down from the target fact table beforehand, I’ve got a full incremental refresh I can run from a cron job or other scheduler. The script excerpt below shows such an example, with just a few of the dimensions being reloaded.

For the “hot data” tables it’s a similar case of taking the existing LoadData.sql and either modifying it to load in just the new transactions, or reload all six months if it’s not easier to do just that.

execute physical connection pool "tt_aggr_store"."Connection Pool" truncate table SA_Categor0000AFE8;
 
SET VARIABLE INACTIVE_SCHEMAS='"tt_aggr_store".."EXALYTICS_AGG_WIZ"':
populate "SA_Categor0000AFE8" mode ( append table connection pool "tt_aggr_store"."Connection Pool") as  select_business_model "SH (Oracle)"."PRODUCTS"."PROD_CATEGORY_DESC" as "PROD_CATEG0000AFB4", RCOUNT (1) as "Categor_0000AFE8SK" from "SH (Oracle)";
 
execute physical connection pool "tt_aggr_store"."Connection Pool" truncate table SA_Month0000AC81;
 
SET VARIABLE INACTIVE_SCHEMAS='"tt_aggr_store"..”EXALYTICS"':
populate "SA_Month0000AC81" mode ( append table connection pool "tt_aggr_store"."Connection Pool") as  select_business_model "SH (Oracle)"."TIMES"."CALENDAR_MONTH_DESC" as "CALENDAR_M0000AC0A","SH (Oracle)"."TIMES"."CALENDAR_QUARTER_DESC" as "CALENDAR_Q0000AC0E","SH (Oracle)"."TIMES"."CALENDAR_YEAR" as "CALENDAR_Y0000AC12", RCOUNT (1) as "Month_0000AC81SK" from "SH (Oracle)";
 
execute physical connection pool "tt_aggr_store"."Connection Pool" truncate table SA_City0000AC6A;
 
SET VARIABLE INACTIVE_SCHEMAS='"tt_aggr_store".."EXALYTICS"':
populate "SA_City0000AC6A" mode ( append table connection pool "tt_aggr_store"."Connection Pool") as  select_business_model "SH (Oracle)"."CUSTOMERS"."CUST_CITY" as "CUST_CITY0000ABC5","SH (Oracle)"."CUSTOMERS"."CUST_STATE_PROVINCE" as "CUST_STATE0000ABD4", RCOUNT (1) as "City_0000AC6ASK" from "SH (Oracle)";
 
execute physical connection pool "tt_aggr_store"."Connection Pool" delete from ag_SALES where CALENDAR_M0000AC0A = '2001-12';
 
SET VARIABLE INACTIVE_SCHEMAS='"tt_aggr_store".."EXALYTICS"':
populate "ag_SALES" mode ( append table connection pool "tt_aggr_store"."Connection Pool") as  
select_business_model "SH (Oracle)"."CUSTOMERS"."CUST_CITY" as "CUST_CITY0000ABC5",
                      "SH (Oracle)"."SALES"."AMOUNT_SOLD" as "AMOUNT_SOL0000ABF6",
                      "SH (Oracle)"."SALES"."QUANTITY_SOLD" as "QUANTITY_S0000ABFB",
                      "SH (Oracle)"."TIMES"."CALENDAR_MONTH_DESC" as "CALENDAR_M0000AC0A",
                      "SH (Oracle)"."PRODUCTS"."PROD_CATEGORY_DESC" as "PROD_CATEG0000AFB4" 
                      from "SH (Oracle)"
                      where "SH (Oracle)"."TIMES"."CALENDAR_MONTH_DESC" = '2001-12';

Conclusion

So what we’ve got here then is a much-expanded use of TimesTen in the context of Exalytics and OBIEE; we’ve not only used it as an in-memory cache for Summary Advisor aggregates, but we’ve used the same underlying mechanism to create other aggregates that we know will be useful for future queries.

We’ve also made use of the 1-2TB of RAM on the Exalytics server to also cache the last six months of detail-level transactional data, making it even more likely that TimesTen will be able to answer all of the users’ queries.

Once you’ve got your data in TimesTen, you’ve got access to the same full set of analysis functions that are available when reporting against regular Oracle databases, with TimesTen for Exalytics giving you more native aggregation and analysis functions compared to standard TimesTen, and the BI Server “functionally compensating” for anything not natively available in TimesTen by performing the calculation itself, using the raw data provided by TimesTen

 Finally, we’ve talked about how we can load in additional datasets into TimesTen via ODI or ttImportFromOracle, potentially using the latter in conjunction with external tables to bring in file data, and then looked at what’s involved in trickle-feeding the TimesTen cache rather than dropping and reloading it each time.

If you’re at KScope’14 in Seattle this week, I’m presenting on this topic on the Wednesday; if not, and you’re potentially interested in us helping you get more out of your TimesTen for Exalytics install, just drop me at line at mark.rittman@rittmanmead.com.

Categories: BI & Warehousing

Rittman Mead at ODTUG KScope’14, Seattle

Rittman Mead Consulting - Fri, 2014-06-20 03:42

NewImage

Next week is ODTUG KScope’14 in Seattle, USA, and Rittman Mead will be delivering a number of presentations over the week. Coming over from Europe we have Jérôme Françoisse and myself, and we’ll be joined by Michael Rainey, Daniel Adams, Charles Elliott from Rittman Mead in the US. We’re also very pleased to be co-presenting with Nicholas Hurt from IFPI who some of you might know from the Brighton BI Forum this year, who’s talking with Michael Rainey about their Oracle data warehouse upgrade story.

Here’s details of the Rittman Mead sessions and speakers during the week:

  • Daniel Adams : “Hands-On Training: OBIEE Data Visualization: The “How” and the “Why”?” – Monday June 23rd 2014 1.15pm – 3.30pm, Room 2A/2B
  • Jérôme Françoisse : “Oracle Data Integrator 12c New features” – Monday June 23rd 1.15pm – 2.15pm, Room 616/615
  • Mark Rittman : “Deploying OBIEE in the Cloud: Options and Deployment Scenarios” – Monday June 23rd 3.45pm – 4.45pm, Room 615/616
  • Mark Rittman : “OBIEE, Hadoop and Big Data Analysis” – Tuesday June 24th 11.15am – 12.15pm, Room 602/603/604
  • Michael Rainey and Nicholas Hurt (IFPI) : “Real-Time Data Warehouse Upgrade: Success Stories” – Tuesday June 24th 2014 2pm – 3pm, Room 614
  • Charles Elliott : “OBIEE and Essbase – The Circle of Life” – Wednesday June 25th 11.15am – 12.15pm
  • Mark Rittman : “TimesTen as your OBIEE Analyic “Sandbox” – Wednesday June 25th 3.15pm – 4.15pm, Room 615/616

We’ll also be around the event and Seattle all week, so if you’ve got any questions you’d like answered, or would like to talk to us about how we could help you with an Oracle BI, data integration, data warehousing or big data implementation, stop one of us for a chat or drop me a line at mark.rittman@rittmanmead.com.

Categories: BI & Warehousing

SampleApp v406 – Automatic startup of OBIEE

Rittman Mead Consulting - Tue, 2014-06-17 00:20

Last week Oracle released v406 of SampleApp. SampleApp is a one-stop-shop for a demonstration of pretty much any conceivable thing that OBIEE is capable of, and then some more on top of it. It is a VirtualBox appliance with everything on the one machine (Database, OBIEE, Endeca, TimesTen, Essbase, etc), and demonstrates basic analysis building techniques through to dashboarding, Mobile App Designer, analysis visualisations with D3, ADF, JavaScript, backend hackery with undocumented NQS calls (hi Christian!), and much, much more.

So, SampleApp is awesome, but … there’s no such thing as absolute perfection ;-) One of the things that is still missing from it is the automatic start/stop of OBIEE when you bootup/shutdown the machine respectively. Setting it up is easy to do, as I demonstrate below. I’ve also put my cheatsheet for whipping SampleApp into line for my day-to-day use, focussed on the commandline and automation (because respectively that’s where I spend most of my time on a server and because I’m lazy).

The OBIEE init.d service script that I demonstrate here is available for use on any Linux installation of OBIEE. For more information, see the Rittman Mead public scripts github repository here: http://ritt.md/init.d

Before we get started I’m assuming here that you’ve:

  • downloaded the 28GB worth of zip files
  • Unpacked them using 7zip
  • Found 70+GB of disc space free and imported the OVF into VirtualBox
  • Started up the VM

There’s full instructions in the SampleApp_QuickDeploymentGuide–406.pdf, available in the SampleApp v406 Documentation download from the same page as the SampleApp image itself.

So to make OBIEE start automatically, the first thing we need to do is make sure that the database (where the RCU schemas are) is doing the same, by setting it up as a service (init.d). This is based on this article if you want more details about quite how it works, but for know you just need to copy and paste this whole code block into the command prompt on SampleApp to create the necessary files. If you can’t copy & paste between your host and the Virtualbox guest, just go to this blog from Firefox within the SampleApp VM itself.
sa08
(Or if you know what you’re doing, SSH onto the server and paste the text into an SSH client on your host machine.)

Copy and paste this whole code block into the command prompt:

# Create startup/shutdown script files
mkdir -p /home/oracle/scripts
chown oracle.oinstall /home/oracle/scripts
cat>>/home/oracle/scripts/startup.sh<<EEOF
#!/bin/bash

# Start Listener
lsnrctl start

# Start Database
sqlplus / as sysdba << EOF
STARTUP;
ALTER PLUGGABLE DATABASE ALL OPEN;
EXIT;
EOF
EEOF

cat>>/home/oracle/scripts/shutdown.sh<<EEOF
#!/bin/bash

# Stop Database
sqlplus / as sysdba << EOF
ALTER PLUGGABLE DATABASE ALL CLOSE IMMEDIATE;
SHUTDOWN IMMEDIATE;
EXIT;
EOF

# Stop Listener
lsnrctl stop
EEOF

# Make them executable
chmod u+x /home/oracle/scripts/startup.sh /home/oracle/scripts/shutdown.sh
chown oracle.oinstall /home/oracle/scripts/startup.sh /home/oracle/scripts/shutdown.sh

# Create service script
cat>/tmp/dbora<<EOF
#!/bin/sh
# chkconfig: 345 90 25
# description: Oracle auto start-stop script.
#
# Set ORA_OWNER to the user id of the owner of the 
# Oracle database software.

ORA_OWNER=oracle

case "\$1" in
    'start')
        # Start the Oracle databases:
        # The following command assumes that the oracle login 
        # will not prompt the user for any values
        su - \$ORA_OWNER -c "/home/oracle/scripts/startup.sh >> /home/oracle/scripts/startup_shutdown.log 2>&1"
        touch /var/lock/subsys/dbora
        ;;
    'stop')
        # Stop the Oracle databases:
        # The following command assumes that the oracle login 
        # will not prompt the user for any values
        su - \$ORA_OWNER -c "/home/oracle/scripts/shutdown.sh >> /home/oracle/scripts/startup_shutdown.log 2>&1"
        rm -f /var/lock/subsys/dbora
        ;;
esac
EOF
sudo mv /tmp/dbora /etc/init.d/dbora
sudo chown root. /etc/init.d/dbora

# Make the service script executable
sudo chmod 750 /etc/init.d/dbora

# Associate the dbora service with the appropriate run levels and set it to auto-start using the following command.
sudo chkconfig --add dbora

On SampleApp v406 there is an Oracle 12c container database (CDB), with two “pluggable” databases (PDB) within it. Assuming you’ve not started the database yet, trying to connect to one of the PDBs for the RCU schema will fail:

[oracle@demo ~]$ sqlplus BIEE_BIPLATFORM/Oracle123@localhost:1521/pdborcl

SQL*Plus: Release 12.1.0.1.0 Production on Tue Jun 17 03:03:51 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

ERROR:
ORA-12541: TNS:no listener

Now run the service start (which will happen automatically at boot)

sudo service dbora start

And check the status again:

[oracle@demo ~]$ sqlplus BIEE_BIPLATFORM/Oracle123@localhost:1521/pdborcl

SQL*Plus: Release 12.1.0.1.0 Production on Tue Jun 17 03:06:12 2014

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Tue Jun 17 2014 03:02:09 -04:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL>

Now we can set up OBIEE as a service that can start automatically at boot time. This is using a script that I wrote and is shared on the Rittman Mead public scripts github repository here: http://ritt.md/init.d. To install it on SampleApp v406 needs a couple of changes for the environment paths and dependencies etc, which I’ve incorporated in this code block. Again, copy and paste the whole thing into a command prompt on SampleApp.

# lsof is used in the script and isn't installed by default, so let's install it: 
sudo yum -y install lsof

# Now fetch the init.d script itself
sudo wget --no-check-certificate https://raw.githubusercontent.com/RittmanMead/scripts/master/obi/init.d/obiee -O /etc/init.d/obiee

# Update the FMW_HOME path in the script
# If you're doing this manually, you just neeed to change the line 
# "FMW_HOME=" and put in the FMW_HOME path for your installation. 
# In the case of SampleApp v406 it is /app/oracle/biee
sudo sed -i -e 's/FMW_HOME=.*$/FMW_HOME=\/app\/oracle\/biee/g' /etc/init.d/obiee

# Make the script executable
sudo chmod 750 /etc/init.d/obiee

You should now be able to run the following status command to see if OBIEE is running or not:

sudo service obiee status

and if it’s not running, start it up:

sudo service obiee start


To shut it down:

sudo service obiee stop

You can read more about the script here.

There’s one more step to run to make OBIEE start automatically at bootup:

sudo chkconfig --add obiee

Now that the service scripts are in place, restart the machine and check they work:

sudo shutdown -r now

When the VM restarts, it may appear to “hang” on boot – the progress bar will show and “Oracle Linux Server 6.5”, but no desktop. That’s because the services are starting up, and you can switch to the console view to see this. On my Mac I press Shift-Function-Backspace to do this, it may vary on other host operating systems (try ALT + d on Windows):

Once the desktop appears you should be able to launch Firefox and go straight to OBIEE, up and running

There are some additional tweaks I usually make to any new server I work with:

  1. Install screen (don’t know what screen is? Read this!):
    sudo yum -y install screen

    Note I’m using sudo which this version of SampleApp thankfully has configured for the oracle user – previous versions didn’t IIRC.
  2. Configure screen with a nice statusbar:

    cat > ~/.screenrc <

    I’m using the bash “here document” functionality here to embed the contents of a document in a command statement. It means I can cut and paste it from my clipboard (if you’re on a Mac, you really should check out Alfred, which has a clipboard history manager, so I always have this screenrc and many other snippets available to paste within just a couple of key presses). Cut and paste a single command is easier (and thus less error-prone) than explaining what content to edit into which text file with which text editor.

  3. Set up SSH keys. I wrote about SSH keys previously on this blog (in fact, all of the things on this checklist are covered there). Put simply, it makes logging in much faster and removes the problem of fat-fingering the password:

    As with the screen configuration in the previous step, I use a snippet of code from my clipboard, pasted into the shell of any new server I’m working on:

    # setup ssh key
    mkdir ~/.ssh
    cat >> ~/.ssh/authorized_keys <<EOF
    ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDMvNjJ8y1a3+NR5OEZHjAd6HhPFdkOI4mDiIIegbRDG+ABhTQmE80DuDe0vp4tex7qefS1u5FG3Xf+hmKDBzKFyegpC+9oEmrtIPkuW7b1Uf/f7Dr6EjXbyA6x1LGEu3THCrGVfw5Mkakqzr43ZfGEFf8k0RiMroHQbp3vOcR+5y0Q132Tbrr0GPd0RPuj4cVOC7QQ6jXSs7TZWepzrcqEpB4M1+N3xS/jEQ5aiCY2FTDOUVj6Y4c0Ogg93bLos6JltPKznq08KXy0ZW/rWDindAmeL0ZAMuU12Qv1ehQZJsxjUwq3jz4yNMgXs5ba3nSvMAr7ZVRzbK5nLl7+eAN3 zaphod@beeblebrox
    EOF
    #
    chmod -R 700 ~/.ssh

    As a side node, that’s my public SSH key there. One’s public ssh key is just that – public. If you want to use it, go ahead, it just means I’ll be able to login to your server. That’s because I have the other half of the key-pair; the private key, and that is the one that should not be shared, because with it you can access any server that has been configured with the public key. Private SSH keys should be treated just as your most important passwords (and indeed, you should use a passphase when creating your private SSH key, to add a layer of security to it, so that it can only be used if the passphrase is known).

The next step is Installing obi-metrics-agent, Graphite, and collectl, but that’s a matter for another blog post :-)

Categories: BI & Warehousing

Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture Pt2. – Delivering the Data Factory

Rittman Mead Consulting - Mon, 2014-06-16 04:00

In my previous post on our updated Oracle Information Management Reference Architecture, jointly-developed with Oracle’s Enterprise Architecture team, we went through a conceptual and logical view of the information architecture, introducing new concepts like the Raw Data Reservoir, the Data Factory and the Discovery Lab. I said that the Data Factory, Data Reservoir, Enterprise Information Store and Reporting, together with the Discovery Lab, comprised the Information Platform, which could be used to create data applications (raw data plus the data factory), or more wide-ranging Information Solutions, with the Discovery Lab often acting as the place where new ideas and models were prototyped, and algorithms developed that could be productionized in the Execution area of the overall architecture.

NewImage

And this is typically where most information architectures leave you; with a good starting point and an overall schematic, but without any real guidance on how to deliver it. For example, how do you develop small pieces of functionality that may often be throwaway, and how do you take on larger projects? Is it possible to develop this type of architecture in an agile way, and if so, can the data reservoir part be delivered agile too? In the collaboration with Oracle’s Enterprise Architecture team this is where we made our major contribution, taking the ideas behind our ExtremeBI agile development approach and using them to deliver this updated Information Management architecture. Let’s go back a few steps though and think about why an agile, or at least a different, way of developing BI content is needed, and what part of the overall architecture are we developing for?

Within this overall architecture, the bit that developers such as ourselves typically add value for is the “data factory”; the part of the architecture that acts as the conduit, or interface, between the schema-on-read raw data reservoir and the schema-on-write enterprise information store.

NewImage

There’s other code to write too; interface and streaming code to get data into the event engine, for example, and more traditional ETL code to load data in from database and application stores. But what we’re most concerned about is the data models and reports that users request in their user stories and epics. If you take a look back at the logical view of our new information architecture, you can see that information can come in at any level in the architecture – raw data reservoir, foundation or access and performance, with different pros and cons for landing and accessing data at each layer, and the ability to take data from one layer and use it to load layers higher up in the layer hierarchy.

NewImage

And this is where our agile development methodology, “ExtremeBI”, comes in. ExtremeBI, for structured relational sources, typically uses Oracle GoldenGate to trickle-feed data of interest directly into the foundation layer of this information management data architecture, with developers then using that layer plus the access and performance layer, if needed, to close-off user stories as quickly as possible, as shown in the diagram below.

NewImage

In some cases it’s possible to model against just the foundation layer in OBIEE, where that foundation layer contains all the metadata and history columns that you need. One of the main parts of our ExtremeBI methodology is a technical process to make this modelling possible, and more importantly allow the underlying physical model to evolve in response to user stories about performance and analysis needs, without breaking existing reports or creating an unmanageable system.

NewImage

We’ll be documenting this process as part of the series of white papers we’ll be co-authoring with Oracle, and we also bring utilities and accelerators to ExtremeBI customer projects, along with continuous integration and regression testing tools, to help get usable software in the hands of end-users as soon as possible.

But what about the parts of the data factory that source data from the raw data store? Can we use similar techniques, and tools like ODI and OBIEE, to deliver working software to users within a short time span? Well in practice it’s not quite so easy; typically, getting information out of schema-on-read database involves writing code, and its easy to fall back into bad habits where everything is scripted, code is brittle and fragile, and no-one feels confident in evolving the model over time.

We think though it’s possible to use ExtremeBI techniques with schema-on-read sources thought, when you combine them with tools like Flume or GoldenGate for trickle-feed extraction and loading, and new capabilities in ODI and OBIEE where Hadoop data can be accessed via Hive. Flume is the industry-standard mechanism for transporting log entries from source to the Hadoop cluster, and GoldenGate has the ability through its Java API to land data in HDFS or Hive, when the source is a relational database (see MOS Doc.IDs 1586188.1 (Hive) and 1586210.1 (HDFS) for example code).

Hive, though slow for ad-hoc query access against Hadoop data, has a range of Serializer/Deserializer utilities that can translate semi-structured and NoSQL sources into more regular columns and tables, with Cloudera Impala removing the response-time issue and rapidly matching Hive in terms of plugins and SerDes. ODI, as I demonstrated in a five-part series on the blog last week, has a number of Hadoop-native knowledge modules and also leverages Oracle Loader for Hadoop to move data out of the raw data reservoir and into the foundation and access + performance layers.

The important thing to understand though when working with data in the updated Information Management architecture, and reporting against schema-on-read and schema-on-write sources, is that the most effective way to deliver real value for end-users is to have two phases to your development work;

NewImage

  • A discovery phase, typically done in the Discovery Lab sandbox or Rapid Development Sandboxes, that focuses on understanding the data and helping the users see where the value is, with less emphasis on governance and corporate standards, and
  • A commercial exploitation phase, when our understanding of the data has significantly increased and we start to add structure to the data, and where it’s an appropriate time to add governance and address data quality issues. Typically, this would be done in the Execution part of our architecture, taking our prototype design and adding it to the data factory as one of its interface routines.

We’ll be developing these ideas out further over the summer, with some white papers and a joint presentation at Openworld 2014 to formally launch it. In the meantime, keep an eye on the blog as we work through some of the key concepts.

 

Categories: BI & Warehousing

Oracle Tours Africa and the Middle East

Look Smarter Than You Are - Sun, 2014-06-15 10:33
Happy Father's Day, everyone!  I got up early this morning to write about my recent experience traveling the world on Oracle's behalf.  I got to attend the first annual Oracle Technology Network tour of Africa and the Middle East.  It made 2 stops in North Africa (both in Tunisia), 2 stops in Saudi Arabia, and the final stop was in Dubai, UAE.

Tariq Farooq first mentioned the idea of doing a MENA (Middle East & North Africa) tour to me in Beijing last fall.  He asked if I'd be willing to travel half-way around the world to speak to people in English that primarily spoke French and Arabic, and I - of course - said "yes."  Here's Tariq being interviewed by Lillian Buziak at Collaborate 2014 (audio is a bit difficult to hear):


I had two reasons for wanting to go: I do love educating/evangelizing for Oracle EPM, BI, and Business Analytics.  The possibility of reaching new audiences for the first time was exciting. My other reason for going was that I wanted to experience totally different cultures than I ever have before.  I've spoken on 5 continents (now 6 after this tour and I'm anxiously awaiting the OTN Tour to Antarctica) before and have seen presented everywhere from a women's college in Mumbai that was 95F with no air conditioning in the presentation room to a ballroom in the Philippines that had 3 simultaneous English sessions going on (in one room!) all happily observed by smiling Filipinos.  From China to India to Australia to Germany, I have seen some amazing slices of life, but nothing prepared me for the differences I saw on this tour.

In each of the sections below, I have linked the header to a blog from my new best German friend, Bjoern Rost.  He blogged after every stop and unlike me, he actually understood all the Oracle RDBMS sessions on the tour.  Visit http://portrix-systems.de/blog/author/brost/ to see his entertaining blog posts.  (Warning: though I think Bjoern is hilarious, being German, you may find his posts to be 'not funny.'  German humor is an acquired taste.)

I left for the first stop, Tunisia, on Memorial Day (in the USA), May 26, 2014...




Tunisia, May 27To get to Tunisia in time for my session, I left Dallas. Texas at 10AM on Monday, flew to JFK (New York City), flew to Rome, flew to Tunis, and had a nice car waiting for me at the airport compliments of our hosts in Tunisia.  I landed at 10:30AM on Tuesday and considering I was flying to Africa, I felt that the trip went by quickly.  I made it through customs in Tunis in about 15 minutes, walked out the front door of the hotel, and was in an entirely different world.

Speaking in North America, South America, Europe, Asia, and Australia had done absolutely nothing to prepare me for Africa.  The closest thing I could compare it to in my life (but the comparison does not do it justice) was the cities of India: chaotic, dirty, cramped, foreign, and chaotic (worth mentioning again).  Now take all that, remove the cows, and make it Muslim.

My host picked me up in a nice car and we began the hour drive to Beja where the conference was being held.  We passed mounds of trash piled up in the center of the roads though thankfully the heat (high was ~75F when I arrived) didn't make the place smell horribly.  Traffic laws seemed to be non-existent, but it moved fairly quickly being in the middle of the day.  I loved looking out the window at the shops along the road and carts selling watermelon approximately every 100 feet (I was told by my host that it was watermelon season).

We left the city about 20 minutes after I got in the car and I was suddenly in Tuscany.  At least, it looked like Tuscany: fields of amber waves of grass, wide open spaces, wildflowers, lakes, olive groves, country life, gorgeous hills... truly one of the most beautiful countrysides I've seen in my entire life.  Since I had been traveling for over a day, the pleasant scenery soon lulled me to sleep.


My driver woke me up when we got to the technical university in Beja. I walked in to find Tariq trying to explain Oracle Enterprise Manager to a bunch of college students who didn't seem to understand databases let alone Oracle.  I tried to hide in one of the back seats, but Tariq immediately called me up on stage to answer a question about how to develop optimal databases.

Not long after I got there, we broke for lunch.  Our hosts took us to a traditional Tunisian restaurant in downtown Beja.  They were kind enough to make me vegetarian food.  Lunch is apparently a sacred event not-to-be-hurried in Tunisia, so we made it back to the university over 2 hours after we left, fully satiated.

I volunteered to give a session introducing Big Data and Analytics to the college kids, because I felt that it required little technical background.  It seemed to go over well.  My favorite part was when I told a joke about the differences in social media sites and only 10 people laughed.  The people on either side of those 10 then asked them to repeat in Arabic and French what I had said, which caused those people to laugh.  They then shared it around the room and it was like a disease vector of laughter that took 2 minutes to make it around to the 150+ students in the room.  In case you haven't seen it, here's essentially what I said out-loud:
After the sessions were over, I asked one of the professors why everyone listened so intently if they had no background in Oracle.  I mentally wondered if it was because I was an awesome presenter bringing the gospel of Oracle to the future of Africa.  I was told "they didn't understand a lot of what you were saying, but they love listening to people speak English."  So much for my future African disciples.
Our hosts offered to drive us to Dougga, the so-called "best preserved Roman ruins outside of Italy."  It sounded like an exaggeration, but it was actually an understatement. Dougga was once a "small" Roman town on the fringes of the empire... and the miles of town are for the most part still there.  We arrived shortly after they closed the gates for the day.  Our hosts got out of the front car in our 4-car caravan to talk to the guards.  I was in the last car and saw an interesting polite dialog when our hosts started pointing to my car.  I waved back.  The guard smiled and raised the gates for our caravan to enter.  I wondered if bribing had occurred, so I asked how we got in after hours.  Our gracious hosts explained that I was renowned historian, Edward Roske, a visiting professor from the United States of America whose sole purpose for being in Tunisia was to see the Dougga ruins.  They said I should take lots of pictures and walk around looking officially important.  I discovered later that they weren't kidding: they really did tell this to the guard, so I took at least 50 pictures and took some very official selfies to help sell my renowned historian status.




The ruins were truly majestic.  Every time I came around a bend, there was another temple, theatre, circus, road, market, tunnel, column, arch, statue, or something else 2,000 years old to be seen.  It is all in a semi-wild state with no borders separating the ruins from the countryside.  There were even wildflowers growing in the central square:
My favorite moment of the entire trip occurred when I broke away from the rest of our group to go explore some arches on the edge of the ruins.  I went to take a picture of one of the doorways, and I got photobombed:
I went through the doorway to discover a local sheepherder grazing his sheep right in the ruins:

They started on the edge of the ruins but eventually the herder marched his sheep right down the center of the 2,000 year old road leading through Dougga.  Tariq decided to join their herd:
I can't stress enough how amazing this site is.  I would encourage people to visit Tunisia if for no other reason than to see Dougga and Carthage.  You have never felt Roman society like you can wandering around the ancient town with only sheep to keep you company.

Eventually, the guards at the entrance (the only guards in the place, so far as we could tell) came to find our renowned historian group because they wanted to go home.  We stayed the night in the Golden Tulip Hotel in Carthage which was a 4-star hotel for under $200 USD per night.  I recommend it to anyone.  The next morning, the OTN MENA 5 (Tariq, Mike Ault, Bjoern, me, and Jim Czuprynski) headed for the flight to Cairo then on the Riyadh.


Saudi Arabia, May 29-31While the Islam is a part of the culture in Tunisia, Islam is the culture in Saudi Arabia.  I have never seen a country more dominated by a single religion than Saudi Arabia.  It is one of the most difficult places in the world to get a visa (they suspended tourist visas 5 years ago) and it's even harder for a non-Muslim like me.  I spent 4 hours clearing customs which gave me a lot of time to study up on what I was in for.

My guidebook (and several websites) told me about all the things I wasn't allowed to do, say, or maybe even think when I got to Saudi.  Here's what I was told versus what actually happened:
- Muslims everywhere.  Yes, 100% true.  They have calls to prayers everywhere and we had to stop presenting when it was time for prayer.  The whole city stops, for that matter, when it's prayer time.  I was lucky enough to be in a public park for evening prayer one day.  The sounds of the call to prayers across the city were beautiful.

- Traffic fatalities.  90% true.  Saudi apparently has the highest incidence of traffic fatalities in the world because traffic laws are more like traffic vague suggestions only to be followed if everyone has plenty of time and sort of feels like it.  I was prepared to almost die every time I got in a car, and while I saw no deaths, I saw multiple car accidents of the fender bender type each day I was in Saudi.  At one point, we were stopped at a red-light to make a left turn.  With traffic coming from both directions in front of us, a car behind us who wanted to make a left-hand turn felt that our stopping was delaying his day.  He didn't honk or behave rudely in any way: he just drove around us and made the left-turn on the red into oncoming traffic.  No one seemed annoyed at the man for doing it.
- Pornography.  100% nonexistent in Saudi.  They even go through the magazines in the shops and black out with a sharpie anything that's considered too revealing (shoulders, waists, knees, etc.).  They also sharply monitor the web and block out any site deemed inappropriate (including Bing.com with safe search set to anything but strict).
- Pictures of other people.  0% true.  I was told before I went that Muslims do not believe in having images taken of people.  What I actually found was the most selfie-absorbed culture I've ever seen, and I live in America with a teenage child.  I couldn't walk 10 feet at the Riyadh or Jeddah events without someone taking a picture.  I was also told that you couldn't take pictures of public buildings or in public buildings (like the national museum).  Totally untrue: people were taking pictures of just about anything except women.
- Women being covered.  100% true.  All the women wore black abbeyahs at all times.  You're not allowed to film them or talk to them.  That said, I didn't see that many.  Both the events in Riyadh and Jeddah were male-only.  We did see women in the public places particularly near retail outlets and in the city parks & museums.
- Men wearing suits.  25% true.  For the most part, the men wore traditional dress shirts and head coverings.  I wore a suit (no tie) which considering it was 110F+, was quite a sacrifice.  My fellow presenters wore ties in addition to their suits which proves they're more willing to sacrifice for the cause of Oracle.
- No alcohol. 100% true but weird.  The first night I got to Riyadh, I opened my minibar to find... a Budweiser.  Closer examination revealed it was a "Budweiser NA" signifying no alcohol.  Every restaurant we went to offered us "Saudi champagne" or "Saudi wine" which apparently means alcohol-taste without any actual alcohol.  Since I hate the taste of alcohol, I didn't think non-alcoholic alcohol would taste any better, so I avoided it.
- No narcotics.  100% true so far as I was willing to test it.  I actually panicked during customs at the thought that maybe I had some prescription drug in my laptop bag that had a narcotic in it.  Narcotics are a capital crime in Saudi, and I spent most of my 4 hours wondering if Ambien counts as a narcotic.  Luckily for me, I wasn't executed for sleep aid smuggling.

After the worst customs experience of my entire life at the airport in Riyadh, I got to the Marriott hotel for about 4 hours sleep before the day was to begin.  I got to go up to the concierge floor for an elaborate breakfast and a great view of the Riyadh skyline.

The Riyadh venue was spectacular.  The hosts (eSolutions) put on one of the best events I've ever attended on an OTN tour.  From the venue to the signage to the elaborate Lebanese food lunch to the photographer to the videographer to the speaker gifts, it was all top-notch.  Like in Beja, I presented on "Taming Big Data with Analytics" to an enthusiastic audience.  There were about 75 (all male) people there including several male children of the attendees.  They were some of the best behaved children I have ever seen even though some were elementary school age.
I got introduced at one point as "Edward Roske from the United States, a rich man who did not inherit his business from his father."  That was unique and though it left me speechless, the audience seemed very impressed at my ability to become head of a business without my father having to die first.

After the conference, our hosts took us to the National Museum.  Like much of Riyadh, it looks like it was built in the last 5 years, and in the case of a museum, this worked very well.  Since they built it all at once (versus many amazing museums around the world that were created over hundreds of years), it was able to tell a complete story from the beginning of the universe up through today (as opposed to just having rooms of collections).  It was a very Muslim-centric view of history, but I found the educational aspects fascinating.  The sign on the way in said no photography, but since the massive museum apparently only had 4 people working in it (all at the front desk), everyone ignored it.  I found a cube in the first hall that seemed Essbase-like.

My favorite exhibit was a scale model of Mecca and Medina as viewed at night.  Being non-Muslim, I will never see Mecca, so this is as close as I will ever get:
After the museum, our hosts took us to Kingdom Centre, the tallest building in Saudi Arabia (for now).  It has 30+ open stories at the top with a skybridge connecting them.  Here's a view from the ground of the building (notice the necklace like architecture with the bridge on top) and a view from the building of the ground:

Our hosts took us to dinner (which was unfortunately a meat-on-a-stick restaurant, unfortunately for me since I'm a vegetarian).  Since I wasn't able to eat much, I got to talk at length to the CEO of eSolutions.  He was a fascinating man who told me all about how Saudi Arabia and the Gulf Coast Community is ready for analytics and reporting.  He was a former Pakistani military man, but his love of country and family shined through in everything he said.

The following morning, we journeyed to Jeddah over on the Red Sea coast of Saudi near Mecca.  The weather was like Riyadh except with crazy humidity.  It reminded me of Houston on the hottest day of summer.  We were in Jeddah for less than 24 hours so I don't admittedly have a ton to say about it.  I saw nothing more than the airport, the drive to/from the hotel, the venue where we presented, and a fast casual restaurant (more tasty Lebanese food!) where we had dinner.

The venue was small since it was a half-day event and lightly attended.  I did ask if I could present on a different topic since I had gotten a bit bored talking Big Data all the time.  This time, I spoke on "In-Memory Databases" which is a hot topic these days thanks to the SAP guys saying "Hana" at least once every sentence.  After the morning event, the OTN MENA 5-1 (Bjoern went straight to Dubai since there weren't enough speaker slots in Jeddah) headed for the airport for the flight to the United Arab Emirates.  Getting out of Saudi took 2 minutes at Customs which goes to show, I guess, that they're a lot happier to get rid of you than let you in.

Dubai UAE, June 1I honestly don't know where to begin with Dubai.  It is truly one of the most amazing cities on Earth and almost indescribable to anyone who hasn't actually witnessed it.  My flight landed on Saturday evening in the largest airport I think I've ever seen.  Clearing customs took minutes then I headed for the cleanest (and probably newest since it just opened in 2009) train system in the world.  The view from the Dubai Metro was stunning.  If Riyadh looks like it was built in the last 10 years, Dubai looks like it was built in the last 10 minutes.  It reminds me of New York City if they took out all the advertising, 90% of the people, and any building under 1,000 feet tall.


I was lucky enough to stay at a hotel in the Burj Khalifa, the tallest building in the world (by far: the Burj is over 200 stories tall at over 2,500 feet).  The Burj sets all kinds of "tallest" records including the tallest outdoor observation deck.  Here's a view of several buildings below that are all over 1,000 feet tall (with an Edward in it for perspective).

My room had a view of the Dubai Fountains.  No offense, Vegas, but these fountains put the Bellagio fountains to shame. They're more agile, faster, better lit, larger, and frankly, classier. And if you're staying in the hotel in the Burj Khalifa, you can listen to the fountain music through the TV:

At the base of the Burj Khalifa is the Dubai Mall, the largest mall in the world (you hear a lot in Dubai of "that's the largest _____ in the world").  It's 3-4 times the size of Mall of America if that puts it in perspective.  It has over 1,200 shops, over 150 restaurants, and some of the strangest (yet up-scale) stores you've ever seen.  There was one selling full-size metal camels, for instance (these were the only camels I saw on my trip).  I ate at the Dubai Mall every day I was in Dubai and felt I could eat there every day for a year without repeating an entree.  I am normally not a fan of malls, but I loved the Dubai Mall.  It wasn't crowded and the people that were there kept to themselves.  This could actually be said about everywhere in the Middle East: the people are nice, but they keep to themselves.  For an introvert like me, it's heaven.  You can be alone in a crowd.


The first full day I was there, I presented in the morning, burned my feet on the sands of a beautiful beach at lunch, and was skiing in the afternoon.  Yes, skiing.  The Mall of the Emirates (another mall that dwarfs anything we have in the USA) is not only massive, it has its own indoor ski resort inside.  It's not a tiny ski hill either: it's 1,200 feet long with three separate runs (a blue/intermediate run, a green/beginner run, and a small terrain park).  They also have lots of other fun activities to do including a penguin exhibit, ziplining over the ski hill, sledding, and large transparent hamster balls to roll down the hill in.

For around $50 USD, you get a lift ticket, ski pants, ski jacket, boots, skis, and poles for 2 hours.  A full-day pass is only around $15 USD more but I opted for 2 hours.  (You can only ski indoors in Dubai so much, obviously.)  There were at most 15 people skiing, but there were hundreds of people from the Middle East paying their $50 to ride up and down the ski lift basking in the glory of being cold.  The whole place is chilled to around 23F and the snow is glorious: it's soft and velvety because they actually make it snow indoors every night (unlike the ice blowers we use for man-made "snow" in the USA).  I skied for my full 2 hours without a break and I loved that I was so cold by the end that I had to go back into the mall and get a soy hot chocolate at Starbucks.

I had too many amazing experience to recount (and this is already seeming a bit like an advertisement for Dubai) but the most fundamentally changing experience of my trip was a visit to a mosque in Dubai.  Hosted by an eloquent British Muslim woman in a black abbeyah, she spent an hour educating a group of Westerners all about Islam.  I was taken aback by how... peaceful their religion is.  She covered the 5 pillars of Islam in a way that made me understand 1,000,000,000+ Muslims far better than I ever have.  I think that if everyone in the Western world could attend that one hour I did, we would have a level of cultural understanding that would ease a ton of our current fears.  They let us take lots of pictures and even let us video them doing 5 minutes of prayers.  It was moving and if you are ever in Dubai, make sure you visit the Jumeirah Mosque during one of their visitation hours (it's free unlike most everything else in Dubai).

My final presentation was in Dubai and it was a bittersweet end to a whirlwind week.  Every single day was spent traveling, speaking, or both, so it was nice to finally have a break.  That said, I will consider the other members of the OTN MENA 5 to be friends for life and I miss them already.  As I finished my presentation on Big Data (for the third time on the trip), I looked out at the anxious faces in the audience and realized that I would miss the Muslim world far more than I ever expected to.  

If there's ever a 2nd OTN tour of the Middle East, Africa, or both, sign me up.  Until then, thank you for letting me be a part of the most inspirational, educational tour I've ever experienced.

Categories: BI & Warehousing

Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture Pt1. – Information Architecture and the “Data Factory”

Rittman Mead Consulting - Fri, 2014-06-13 05:39

One of the things at Rittman Mead that we’re really interested in, is the architecture of “information management” systems and how these change over time as thinking, and product capabilities, evolve. In fact we often collaborate with the Enterprise Architecture team within Oracle, giving input into the architecture designs they come up with, and more recently working on a full-blown collaboration with them to come up with a next-generation Information Management architecture. I these two posts I wanted to share some of our recent thinking in this area, looking first at our new proposed architecture, and then in the second post talking about how we’d use agile development methods, in-particular our “ExtremeBI” development approach, to deliver it.

But first, some history. Back in 2009 I blogged about a first-generation DW reference architecture which introduced a couple of new concepts, based on new capabilities from tools such as OBIEE plus some thinking we, and the Enterprise Architecture team at Oracle, had been doing over the years. This reference architecture introduced the concept of “Foundation” and “Access and Performance” layers, and recognised the reality that Kimball-type star schemas were great for querying but not so good for long-term, query-neutral storage of data, whilst Inmon-style EDW models were great as a long-term, process-neutral system of record, but not so good for running user queries on. This new architecture included both of these design approaches, with the foundation layer forming the “information management” layer and the access and performance layer being the “information access” layer. Most importantly, tools like OBIEE made it possible for enterprises to create metadata layers that potentially accessed all layers in this model, so users could query the foundation layer if needed as well as the access and performance layer, if the foundation layer was a better source of data for a particular reports.

NewImage

A second revision to this model, a couple of years later, expanded on the original one and introduced another two interesting concepts, brought upon by the introduction of tools like Endeca Information Discovery, and the rise of unstructured and semi-structured data sources. This new architecture added unstructured and semi-structured sources into the model, and also introduced the concept of “sandboxes”, areas of the information management model that allowed more free-form, exploratory BI applications to be built.

NewImage

But in-practice, this idea of “unstructured” and “semi-structured” sources wasn’t all that helpful. What really started to make an impact in the past couple of years is the increasing use of “schema-on-read” databases, where we trade-off the performance and defined structure of traditional relational 3NF and star schemas for the flexibility and “time-to-value” provided by key-value store databases. The Endeca Server is a good example of these types of database, where the Endeca Server allows rapid storage of loosely-associated datasets and tools like Endeca Studio then apply a structure to the data, at the time of analysis. Schema-on-read databases are great for fast, flexible access to datasets, but the cost of ETL is then borne by each system that accesses the data.

Probably the most well-known examples of schema-on-read sources though are Hadoop, and NoSQL databases. Coupled with their ability to store lots of detail-level data at relatively low cost, Hadoop and NoSQL databases have significantly affected the overall landscape for BI, data warehousing and business analytics, and we thought it was about time for a new reference architecture that fully-incorporated the capabilities and latest thinking around this area. Back at the start of 2014 myself, Jon Mead and Stewart Bryson met up with Oracle’s Andrew Bond in his team for a series of workshops, and what came out of it was an updated Information Management Architecture *and* a development methodology for delivering it. Let’s start off then by looking at this updated architecture from a conceptual view.

NewImage

At a conceptual level, we build on this idea of sandbox environment and formally separate things out into the Execution area – business-as-usual, production and development areas – and an Innovation area, where we build on the idea of a sandbox and rename it the “Discovery lab”. The Discovery lab is where, for want of a better word, the “data scientists” work, with fewer constraints on development and whose inputs are events and data, and outputs are the discovery output that can be the prototype and inspiration for designs going into the execution area.

The main “engine” of the Execution area is our enterprise store of data, this time broken down into four areas:

  • A “data reservoir” where we store all incoming events and data at detail-level, typically on HDFS. This blog article by Oracle’s Jean-Pierre Dijcks sets out the concept of a data reservoir well, and I like this blog by Scaleabilities’ Jeff Needham where he makes the case for calling it a “data reservoir” that can ingest, process and analyse data rather than a “data lake”, which implies a passive store.
  • An Enterprise Data Store, analogous to the enterprise data warehouses we use today, and a reporting component, typically in our case OBIEE
  • Most importantly, the new concept of a “data factory”, a conduit between the data reservoir and the enterprise information store

Together, the execution and innovation layers form our “information platform”, with the event engine feeding real-time events into the platform and outputting them into the data reservoir, and traditional ETL routines loading structured data from the enterprise into the enterprise information store.

This conceptual architecture then permits several types of information application. For example, the data reservoir and the data factory together could support what we call “data applications”, applications working on semi-structured, large and low-granularity data sets such as those used for genomic analysis.

NewImage

Other applications might be more traditional BI and data warehousing applications, but with the addition of data from the data reservoir and the analysis capabilities of Hadoop.

NewImage

The discovery lab can be a standalone area, or the insights and discovery it outputs can be used as inputs into the main information platform. More event-based data will typically come in via the event engine, with its output going into the data reservoir and supporting “next-best-decision” applications like Oracle Real-Time Decisions.

Another way of looking at this architecture is from a logical perspective, in particular focusing on the data layers and access/loading processes to load them. The diagram below is our latest version of the two diagrams at the start of this article, and as you can see we’ve kept the data sources and BI element much the same, and kept the concept of the sandbox, in this case refined as the “discovery lab sandbox”.

NewImage

What is different this time though is the middle bit; we’ve lost the staging area and replaced it with the raw data reservoir, added a “Rapid Development Sandbox”, and drawn the main layers as a slanted set of stacked areas. So why?

What we’re trying to show with the slanted data layers is the relative cost of data ingestion (loading), and the relative cost of accessing it (information interpretation). For the raw data reservoir, for example, there’s little cost in ingesting the data – maybe copy some files to HDFS, or use Flume or GoldenGate to capture log or transaction data to HDFS or Hive, but the cost is then borne in accession this typically “schema-on-read” data source. As you go up the stack, there’s a bit more work in landing data into the Foundation layer – ETL routines, GoldenGate routines, some data cleaning and constraint checking, for example – but it’s correspondingly easier to get data out. For the Access and Performance Layer there’s the most cost in getting data in, but then users have very little work to do when getting data out.

Data can move up the stack from Raw Data Reservoir to Foundation, or directly into Access and Performance, or it could be landed at levels above Raw Data Reservoir, for example in our ExtremeBI approach where we use GoldenGate to replicate source system tables directly into Foundation without going through a staging layer. The Rapid Development Sandboxes are there to support agile, iterative development, with the output from them either being the result in itself, or their designs and insights being used to create more formal projects and data structures.

From a more product-centric perspective, you can overlay these types of diagrams with specific schematics for example enterprises. For example, in the diagram below you can see Oracle NoSQL database being see with HDFS and the Oracle Big Data Connectors to capture and store events from Complex Event Processing, and then outputs from CEP being also fed into a more traditional, “high density data” store as well as directly into a decision engine.

NewImage

So this all sounds great, but how do you build it? Do we have to use the (discredited) step-by-step, waterfall method to build this type of architecture, and in particular the key “data factory” element that provides connectivity between the Raw Data Reservoir and the Enterprise Information Store? And can we apply agile methods to big data sources, as well as regular databases and applications? Check back on Monday for our thoughts on how this should be done.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.5 : Bulk Unload to Oracle

Rittman Mead Consulting - Fri, 2014-06-13 04:00

All week I’ve been looking at what’s involved in moving data around Hadoop on the Oracle Big Data Appliance, using ODI12c to orchestrate the end-to-end process. Using web log data from the Rittman Mead website, I first landed the log data on HDFS using Apache Flume, then transformed the data in several stages using Hive transformations built using ODI mappings and knowledge modules. The links below give the previous articles in this series, where you can read how I transformed data using Hive, Sqoop and Python streaming:

At the end of all these ODI12c transformations, I’m left with a Hive table containing summarised web page accesses, enhanced with reference data on the page title and author, and with the entries geocoded with the name of the country associated with the page access. In many cases, this is where I’d leave the data, but quite often it’d be useful to then export the data out of Hadoop and into an Oracle database, so that I could load it into a data warehouse, or analyse it using a richer SQL dialect than I can get with HiveQL. I could use Sqoop to get the data out of Hive and into Oracle, but for larger exports in particular I’d get better unload performance by using Oracle Loader for Hadoop, one of the Oracle Big Data Connectors that typically come with Oracle Big Data Appliance.

There’s actually two Oracle Big Data Connectors that deal with getting data out of Hadoop and into Oracle; Oracle Direct Connector for HDFS (ODCH) gives you the ability to define an Oracle Database External Table over a HDFS file or Hive table, and is useful if you just want to access the data in Hadoop without actually loading it into an Oracle Database. 

NewImage

The way ODCH works is that it adds a special type of adapter to the external table feature in Oracle Database, that gives the database the ability to stream HDFS files in a similar way to how external tables stream regular file data into the database. This makes ODCH very convenient if you just want access to HDFS or Hive data from an Oracle database, but it suffers from the same performance penalty as regular external flat file access, in that you can’t index the table and so forth; so if you want high-performance access to data held in a Hadoop system, and you need this access more than just occasionally, you’re better off copying the data right into the Oracle database, and this is where Oracle Loader for Hadoop comes in.

Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that uses MapReduce technology to sort, partition and organise the data extraction at the Hadoop end, and then fast-unload the data into the Oracle Database typically much faster than a Sqoop data transfer.

NewImage

OLH is, however, a bit fiddly to setup, so what’s useful is that ODI has a knowledge module, IKM File/Hive to Oracle, that can set up both OLH and ODCH jobs. This knowledge module does have its own setup process that itself is a bit involved – mostly around environment variables on your Linux system – but once its done, each OLH data transfer is then very easy to setup, and as long as you don’t then go on to do any more transformation on the Oracle data post-unload, it’s covered by the ODI limited-use license you get when you buy it in conjunction with the BDA.

So the key things to get OLH working with ODI are firstly, to install OLH on the BDA node that’s running the ODI12c agent (or in my case, full ODI Studio), and then set up the Oracle user’s environment variables so OLH works, and ODI knows where to find any additional Hadoop / Hive libraries and configuration files to work with OLH. In my case, this meant adding the following entries to my .bashrc file:

export HIVE_HOME=/usr/lib/hive
export HADOOP_CLASSPATH=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/*:/etc/hive/conf:$HIVE_HOME/lib/hive-metastore-0.12.0-cdh5.0.1.jar:$HIVE_HOME/lib/libthrift.jar:$HIVE_HOME/lib/libfb303-0.9.0.jar:$HIVE_HOME/lib/hive-common-0.12.0-cdh5.0.1.jar:$HIVE_HOME/lib/hive-exec-0.12.0-cdh5.0.1.jar
export OLH_HOME=/home/oracle/oracle/product/oraloader-3.0.0-h2
export HADOOP_HOME=/usr/lib/hadoop
export JAVA_HOME=/usr/java/jdk1.7.0_60
export ODI_HIVE_SESSION_JARS=/usr/lib/hive/lib/hive-contrib.jar
export ODI_OLH_JARS=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/ojdbc6.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-utility.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-mapping.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-collation.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/oraclepki.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/osdt_cert.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/osdt_core.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/commons-math-2.2.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/jackson-core-asl-1.8.8.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/jackson-mapper-asl-1.8.8.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/avro-1.7.3.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/avro-mapred-1.7.3-hadoop2.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/oraloader.jar,/usr/lib/hive/lib/hive-metastore.jar,/usr/lib/hive/lib/libthrift-0.9.0.cloudera.2.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/hive/lib/hive-common-0.12.0-cdh5.0.1.jar,/usr/lib/hive/lib/hive-exec.jar

where OLH was installed to /home/oracle/oracle/product/oraloader-3.0.0-h2, Hadoop was installed to /usr/lib/hadoop and Hive to /usr/lib/hive. ODI_HIVE_SESSION_JARS is optional, but if you do anything fancy with SerDes in your mappings you’re best adding this entry in, to the hive-contrib.jar file (or technically the symbolic link to it). Getting this right was the hardest part about setting this KM up, but once it’s done, it’s done.

Then, to set the ODI parts up, I first create a table in my Oracle 11.2.0.3 database with matching datatypes for the Hive table i’m going to unload; varchar2 for string, number for int, etc, and then create entries in the ODI topology and Designer navigator. Then, I create a new mapping in ODI12c, where I copy the columns from my Hive table into the Oracle table, like this (note that I could map from an HDFS file instead, if I wanted to):

NewImage

Now it’s a case of configuring the Physical part of the mapping. There’s two steps to this; first, you have to change the default LKM setting for the access table in the physical mapping, the ACCESS_F object in the TARGET_GROUP below:

NewImage

By default this sets itself to LKM SQL to Oracle (as you’ve got an Oracle target), but we need to change this to a special one, LKM SQL Multi-Connect (below), that delegates the multi-connect capabilities to the downstream node, so you can use a multi-connect IKM such as IKM File/Hive to Oracle. 

NewImage

 

Then, and only once you’ve selected this LKM, you can select IKM File/Hive to Oracle for the target table KM.

NewImage

There’s not much you need to enable with this KM to at least get started. Set OLH_OUTPUT_MODE to JDBC (you can enable OCI later on instead, for potentially faster unloading) and set MAPRED_OUTPUT_BASE_DIR to an HDFS directory that the agent’s OS user (typically “oracle”) can read and write to – in my case, I created one specifically for this KM, using Hue’s File System browser tool. Then, if all’s set up correctly you should be able to execute the mapping and view the results in the Operator navigator – note that in my case, it took several attempts before I finally got all the environment variable settings right. Once done, you should then be able to view the Hive table output in your Oracle table, using SQL Developer.

NewImage

So now, we’ve got each step for our end-to-end process. The only thing left to do is to pull all of these steps, including the ODI Procedure to run the sqoop transfer, into an ODI package or load plan, so that we can run the steps in sequence:

NewImage

Then, finally, I can run the whole process as a single execution, moving data from Flume capture through Hadoop ETL steps, to the unload into an Oracle table, all from within ODI12c.

NewImage

So there you have it – Hadoop-native ETL processing, with the metadata-driven design, error handling and heterogenous connectivity provided by ODI12c.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.4 : Transforming Data using Python & Hive Streaming

Rittman Mead Consulting - Thu, 2014-06-12 03:30

This week I’m taking an in-depth look at ETL on the Oracle Big Data Appliance, using Oracle Data Integrator 12c to call the various bits of Hadoop functionality and orchestrate the whole process. So far, I’ve landed web log data into the Hadoop cluster using Flume, created a Hive table over the log data using Hive and the RegEx SerDe, then used further Hive transformations to join this data to other reference data, some of which came from an Oracle database via Sqoop. Here’s a complete listing of the posts so far, and the ones to come:

Our next transformation is a bit trickier though. The data we’re bringing in from our webserver has IP addresses recorded for each HTTP access, and what we want to do is use that IP address to identify the country that the website visitor was located in. Geocoding, as it’s called, is fairly easy to do if your data is stored in a database such as Oracle Database or you’re using a language like Python, with two steps needed to retrieve the IP address’s country:

1. First convert the IP address into a single integer, like this:

address = '174.36.207.186'
( o1, o2, o3, o4 ) = address.split('.')
integer_ip =   ( 16777216 * o1 )
             + (    65536 * o2 )
             + (      256 * o3 ) 
           +              o4

2. Then, using a freely-downloadable database from MaxMind, look to see which range of numbers your converted IP address is in:

SELECT ip_country
FROM geoip
WHERE 2921648058 BETWEEN begin_ip_num AND end_ip_num
LIMIT 1

The only problem is that Hive doesn’t support anything other than equi-joins. In a previous blog post I got around this issue by using Impala rather than Hive, and in another one I used Pig, and a custom Python UDF, to do the geocoding outside of the main Pig data flow. I could use Pig and a Python UDF in this situation, creating an ODI Procedure to call Pig from the command-line, but instead I’ll use another ODI KM, IKM Hive Transform, to call a Python script but entirely controlled from within ODI.

To set this geocoding process up, I first create a new ODI mapping, with my current Hive table as the source, and another Hive table this time with an additional column for the IP address country, as the target. I then map what columns I can into the target, leaving just the country column without a mapping. Notice I’ve put the source table within an ODI12c dataset – this isn’t mandatory, it’s just something I decided to do at the time – it doesn’t impact on the rest of the process.

NewImage

Now in a similar way to how I created the Python UDF to do the country lookup for Pig, I’ll create a variation of that script that expects columns of data as an input, outputs other columns of data, and does the geocoding within the script by calling the Python GeoIP API provided by MaxMind. Most importantly though, I’ll use it in conjunction with IKM Hive Transform’s ability to stream it’s Hive data through an arbitrary script, making use of the Hive TRANSFORM function and a feature called “hive streaming”. To keep things simple let’s first set it up outside of the KM, see how it works, and then configure the KM to use this python script as its transformation function.

Let’s start by looking at the source Hive table I’ll be starting with:

hive> describe access_per_post_categories;
OK
hostname           string             None                
request_date       string             None                
post_id            string             None                
title              string             None                
author             string             None                
category           string             None                
Time taken: 0.112 seconds, Fetched: 6 row(s)
hive> select * from access_per_post_categories limit 5;
OK
137.254.4.12[02/Jun/2014:21:20:41 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
137.254.4.12[02/Jun/2014:21:20:42 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
137.254.4.12[02/Jun/2014:21:20:43 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
137.254.4.12[02/Jun/2014:21:20:45 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
137.254.4.12[02/Jun/2014:21:21:23 +0000]2093Hyperion Planning : Installing the Sample Planning Application in EPM 11.1Mark RittmanHyperion
Time taken: 0.976 seconds, Fetched: 5 row(s)

The target table is exactly the same, except for the additional column for country:

hive> describe access_per_post_full;
OK
hostname           string             None                
request_date       string             None                
post_id            string             None                
title              string             None                
author             string             None                
country            string             None                
category           string             None                
Time taken: 0.098 seconds, Fetched: 7 row(s)

The way IKM Hive Transform works is by streaming each incoming row of Hive data to the script registered with it, with each row’s columns tab-separated. In-turn, it expects the script to output similar rows of tab-separated values, each line terminated with a newline. My script, therefore, needs to accept an arbitrary number of incoming rows, parse each row into its constituent columns, call the MaxMind API to do the geocoding, and then output a tab-separated set of columns for each row it processes (a bit like pipelined table functions in Oracle PL/SQL).

So here’s the Python script that does this:

[root@bdanode1 tmp]# cat add_countries.py
#!/usr/bin/python
import sys
sys.path.append('/usr/lib/python2.6/site-packages/')
import pygeoip
gi = pygeoip.GeoIP('/tmp/GeoIP.dat')
for line in sys.stdin:
  line = line.rstrip()
  hostname,request_date,post_id,title,author,category = line.split('\t')
  country = gi.country_name_by_addr(hostname)
  print hostname+'\t'+request_date+'\t'+post_id+'\t'+title+'\t'+author+'\t'+country+'\t'+category

A few things to note about the script:

1. “Import sys” brings in the Python libraries that then allows me to read the input data, via sys.stdin.
2. I’ve previously installed the MaxMind GeoIP API, database and libraries, with the GeoIP.dat database stored alongside the script in /tmp. Import pygeoip brings in the Python library to use the API.
3. The bit that does the geocoding is “gi.country_name_by_addr”, which calls the MaxMind API. It’s this bit that replaces the need for a join with a BETWEEN clause in HiveQL.
4. To output data back to the calling Hive transformation, I just “print” it, using “\t” for tab.

One thing we’ll need to do though, similar to how we had to copy the JAR file used by Hive for the RegEx SerDe around all the nodes in the cluster, Is copy the GeoIP.dat file used by the MaxMind geocoding API to the same place on all the cluster nodes. We’ll also have to set the permissions on this file so it’s readable and writeable by the Hive/YARN process on each Hadoop node:

officeimac:~ markrittman$ ssh root@bdanode2.rittmandev.com
The authenticity of host 'bdanode2.rittmandev.com (192.168.2.231)' can't be established.
RSA key fingerprint is 1b:e7:ec:01:57:0c:09:4e:6d:19:08:a4:0d:df:00:e0.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'bdanode2.rittmandev.com' (RSA) to the list of known hosts.
root@bdanode2.rittmandev.com's password: 
Last login: Fri Jun  6 16:55:44 2014 from 192.168.2.201
[root@bdanode2 ~]# ls /tmp/Geo* -l
-rwxrwxrwx 1 oracle oracle 687502 Jun  7 12:31 /tmp/GeoIP.dat

I also need to make sure the MaxMind geocoding Python API is installed into each node too:

wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py
python get-pip.py pip 
install pygeoip

Now I can run a HiveQL command and test out the script. I’ll start up a Hive shell session, and the first thing I need to do is register the script with Hive, distributing across the cluster and putting it in cache. Then I can select against the source table, using the TRANSFORM USING clause:

[oracle@bdanode1 ~]$ hive
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/06/09 20:48:00 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/06/09 20:48:01 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
 
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.12.0-cdh5.0.1.jar!/hive-log4j.properties
hive> add file x-marsedit-filelocal:///tmp/add_countries.py;                                
Added resource: x-marsedit-filelocal:///tmp/add_countries.py
hive> select transform (hostname,request_date,post_id,title,author,category)
    > using 'add_countries.py'                                              
    > as (hostname,request_date,post_id,title,author,category,country)      
    > from access_per_post_categories; 
...
63.73.199.69[02/Jun/2014:21:19:28 +0000] 14726 SmartView as the Replacement for BI Office with OBIEE 11.1.1.7 Mark Rittman United States Oracle EPM
77.125.81.239[02/Jun/2014:21:18:58 +0000] 14476 Upgrading OBIEE to 11.1.1.7 Robin Moffatt Israel Oracle BI Suite EE
77.125.81.239[02/Jun/2014:21:18:59 +0000] 14476 Upgrading OBIEE to 11.1.1.7 Robin Moffatt Israel Oracle BI Suite EE
Time taken: 27.645 seconds, Fetched: 176 row(s)

So it looks like it’s working. Let’s move back to ODI now and reference the script within the IKM Hive Transform KM settings. Note that I enter the name and filesystem location of the script in the KM settings but I don’t use the TRANSFORM_SCRIPT option to directly key the script contents into the KM; this should work, and if you do so the KM will write the script contents out to the agent’s host filesystem at the start of the process, but the KM in this ODI12c release has an issue parsing these keyed-in scripts, so you’re best leaving this setting empty and the KM will just use the file already in the filesystem.

NewImage

After running the mapping, I then go over to the Operator navigator and see that it’s run successfully. Looking at the code the KM generates, I can see it referencing our script in the HiveQL statement, with the script outputting the extra country column that’s then loaded into the target table along with the other columns.

NewImage

Then finally, checking the target table, I can see that each row in my final Hive table has the country name alongside each log file entry, along with the IP address and other details.

So – now we’re at the stage where we’ve finished the processing with the Hadoop cluster, and I’d like to copy the final set of data into an Oracle database, so that I can analyse it using Oracle SQL and other tools. In the final installment in this series we’ll do just that, using IKM File/Hive to Oracle (OLH/ODCH) and Oracle Loader for Hadoop, one of the Oracle Big Data Connectors.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.3 : Enhance with Oracle Reference Data via Sqoop, and CKMs

Rittman Mead Consulting - Wed, 2014-06-11 04:02

In the first two posts in this series, I used the software on the Oracle Big Data Appliance 3.0 to ingest web log data from the Rittman Mead blog server, parse and load that data into a Hive table, and then join that table to another to add details on post and author. Links to the post in this series are below, and I’ll complete them as the series is posted this week:

In this next step, I’ve got some data sitting in an Oracle database that I’d like to use to enhance the data in the Hadoop cluster; as a reminder, the diagram below shows how I’m looking to move data through the system, and we’re currently at step number three:

NewImage

The data in the Oracle tables contains categories that I can join to the page entries that have come through the ETL process, and the join is pretty simple: POST_ID = POST_ID; but how do I get access to the Oracle data – I need to register the Oracle source in the ODI topology and create a model to represent the table, but can I then just join that table to the Hive table and load the results into another Hive table? Let’s try.

The screenshot below shows the mapping that would use this approach, with two Oracle tables joining to the Hive table and another Hive table as the target; on the right-hand side of the screen is the join condition, across both sources:

NewImage

The Execution view shows how ODI intends to run the transformation, with LKM SQL to SQL KMs used to load the Oracle tables and stage their data in Hive, and then IKM Hive Control Append used to load the joined dataset into another Hive table.

NewImage

If you try and execute the mapping though, it fails – because the LKM SQL to SQL KM can’t work with Hive tables yet.

NewImage

In fact. the various IKMs that come with Oracle Data Integrator Application Adapter for Hadoop (IKM file to Hive, IKM Hive Control Append etc) are a bit of a mix of IKMs and LKMs in that they contain extraction code, and integration code in the same KM, and none of the regular LKMs and IKMs will otherwise work with Hadoop sources. In fact, what we need to do with ODI at this point is actually land the Oracle data in Hive first, then do the join, which then begs the second question – how do we do that?

Currently, the only way to get Oracle data in to Hadoop is via IKM File to Hive (LOAD DATA), which involves an unnecessary extra step of exporting the Oracle data to a file, and then loading that file into HDFS and Hive. What we can do though is use sqoop, a tool within Hadoop to extract from, and load into, relational databases, something I covered on the blog a few weeks ago. Sqoop creates data loading and unloading jobs that run in parallel on the Hadoop cluster, and can use native JDBC drivers or even additional plugins such as Oraoop to make the process run faster (though Oracle Loader for Hadoop is considered the fastest way to unload to Oracle, if you’ve licensed the Big Data Adapters).

The only problem is that there’s no official support for sqoop in ODI, so no KMs that make use of it. What you can do though is create a command-line script to run sqoop and include that in an ODI procedure, which is what I’ll now do to bring in my Oracle data into Hive. To do this, I create an ODI procedure and add a single task, using the Operating System (command-shell) technology type, and use it to tell Sqoop to create me a Hive table based on an SQL SELECT statement against my Oracle database:

NewImage

(note that I had to format the sqoop command, in practice, as one line, to get it to run – the above listing is so that you can see all of the code.)

Executing this procedure works OK, and thereafter I’ve got a single Hive table containing the joined dataset from Oracle.

NewImage

Joining this new Hive table to the previous one containing the distinct set of page views is then fairly straightforward, but something I’d also like to do is stop any entries going into the rest of the ETL process where the calling IP address is a test one we use, “63.73.199.69″. The way I can do this is to use the CKM Hive knowledge module and put a constraint on the hostname in the table I’ll be loading from the join, so that I can then use ODI’s flow control feature to divert those rows to an error table.

NewImage

I also need to define a primary key for this table, something that’s mandatory when flow control is used. So let’s put the mapping together, joining the table I just brought in from Sqoop with the latest version of the weblog entries Hive table, loading into the Hive table I’ve just enabled the constraint for:

NewImage

and then I enable flow control, and the CKM Hive check knowledge module, in the Physical mapping settings.

NewImage

This is of course one of the benefits of using ODI to do your Hadoop data loading – you’ve got access to the data quality and error handling features that come with the tool. Then, when I execute the mapping and check with the Operator navigator, I can see the error handling process running, and afterwards in Hue I can see the contents of the new error table, which now contains those log entries where my test IP address was used, removing them from the target Hive table where they’d ordinarily have gone.

NewImage

So that’s the third step in the ODI BDA ETL process complete. The next one’s a bit trickier though – I need to geocode the entries in the log table, assigning country names to each row based on where the IP address is located. More tomorrow.

Categories: BI & Warehousing

The BI Survey 14 – Have Your Voice Heard!

Rittman Mead Consulting - Wed, 2014-06-11 03:20

Long-term readers of this blog will know that we’ve supported for many years the BI Survey, an independent survey of BI tools customers and implementors. Rittman Mead have no (financial or other) interest in the BI Survey or its organisers, but we like the way it gathers in detailed data on which tools work best and when, and it’s been a useful set of data for companies such as Oracle when they prioritise their investment in tools such as OBIEE, Essbase and the BI Applications.

Here’s the invite text and link to the survey:

“We would like to invite you to participate in The BI Survey 14, the world’s largest annual survey of business intelligence (BI) users.

To take part in this year’s survey, visit: https://digiumenterprise.com/answer/?link=1906-PHB5RT7V

As a participant, you will:

  • Receive a summary of the results from the full survey
  • Be entered into a draw to win one of ten $50 Amazon vouchers
  • Ensure that your experiences are included in the final analyses

BARC’s annual survey gathers input from thousands of organizations to analyze their buying decisions, implementation cycles and the benefits they achieve from using BI software.

The BI Survey 14 is strictly vendor-independent: It is not sponsored by any vendor and the results are analyzed and published independently.

You will be asked to answer questions on your usage of a BI product from any vendor. Your answers will be used anonymously and your personal details will not be passed on to software vendors or other third parties.

Business and technical users, as well as vendors and consultants, are all encouraged to participate.

The BI Survey 14 should take about 20 minutes to complete. For further information, please contact Jevgeni Vitsenko at BARC (jvitsenko@barc.de). 

Click below to take part in The BI Survey 14: https://digiumenterprise.com/answer/?link=1906-PHB5RT7V

Categories: BI & Warehousing

Rittman Mead Featured in Oracle In-Memory Option Launch

Rittman Mead Consulting - Tue, 2014-06-10 17:02

Today saw the official launch of the Oracle Database In-Memory Option, with Larry Ellison going through the product features and then reading out some quotes and testimonials from beta testers. Rittman Mead were part of the beta testing program, with several of our team testing out various scenarios where we ETL’d into it, used it with OBIEE and worked out what would be involved in “in-memory-enabling” some of our customer’s BI systems.

In fact, as we said in our quote for the launch, enabling Oracle Database for in-memory analysis was almost “boringly simple” – just enable the option, choose your tables, drop any OLTP indexes and you’re ready to go.

NewImage

Of course, in practice you’ll need to think about which tables you’ll put into memory if RAM is limited, in some scenarios TimesTen might be a better option, and you’ll need to test your particular system and carefully consider whether you’ll keep particular indexes or materialised views, but we’re really excited about the In-Memory Option for Oracle Database as it’s got the potential to significantly improve query response times for users – and from what we’ve seen so far, it “just works”.

We’re still in the NDA period whilst beta testing goes on, but you can read more on the In-Memory Option on the Oracle website, and on the blog post I wrote when the feature was announced last Openworld. Once it goes GA look out for some in-depth articles on the blog around how it works, and details on how we’ll be able to help customers take advantage of this significant new Oracle Database feature.

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.2 : Hive Table Joins, Aggregation and Loading

Rittman Mead Consulting - Tue, 2014-06-10 02:49

In this series of posts, I’m going to be looking at an end-to-end ETL process on Oracle Big Data Appliance, using Hadoop technologies to do the data manipulation and Oracle Data Integrator 12c to orchestrate the process. Over the five posts, I’ll be landing web server log data on the Hadoop cluster using Flume, loading that data into a Hive table, transforming and enhancing the dataset, and then loading the final dataset into an Oracle database using one of the Oracle Big Data Connectors. The first post in the list below has a schematic for the overall process, and over the rest of the week I’ll be adding the links in for the remaining posts in the series.

So in today’s step, I want to take the initial Hive table containing the full set of log data, and then do three things; first extract the page details out of the “request” column in the incoming Hive table, then join that to some reference data also in the BDA that’ll allow me to add details of the post author and title of the post, and finally aggregate and project just certain columns from the the dataset so I’ve got a Hive table of just page accesses by IP address, with the author and title details.

In fact this is the easiest out of the five steps to set up and bring together, as we’re just using basic ODI functionality, albeit with Hive tables rather than regular database tables. In the screenshot below you can see the incoming weblog hive table, with the individual columns split-out in the previous step via the RegEx SerDe, and I just join it to the table containing post and author details, apply a distinct on the join output and then load the results into a third Hive table.

NewImage

The join between the two tables is bit complex, because I need to extract just the URL details out of the request field (which also contains the transport method and other metadata), but its no different that joining two regular database tables. Also, I did need to check the Generate ANSI Syntax checkbox, as Hive expects table joins to be in this format rather than the “Oracle” format.

NewImage

Going over to the Physical part of the mapping, I set the KM to IKM Hive Control Append, turn on the feature to enable table truncation prior to load, and I’m ready to go.

NewImage

Then, when I execute the mapping, I can see its progress in the Operator navigator, and the code view shows me the HiveQL commands that the KM has generated to move data around the BDA.

NewImage

So far so good, and as I said, this was the easy bit. For the next transformation I want to bring in some additional reference data from an Oracle database, so the question for me is whether I need to land that data into the BDA before doing the transformation, or whether I can create a join between the Oracle table and my Hive table? Check back tomorrow for the next installment…

Categories: BI & Warehousing

End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.1 : Flume to Initial Hive Table

Rittman Mead Consulting - Mon, 2014-06-09 07:18

A few months ago I posted an article on the ODI12c examples in the new Oracle Big Data Appliance, and over the past few weeks I’ve been writing about the various components within the Cloudera CDH Hadoop stack, including Hive, Pig, Spark and Flume. Since then I’ve built out a virtualized example of an Oracle Big Data Appliance using the Oracle Big Data 3.0 software set, and I thought it’d be interesting to create an example ETL flow through the system showing how ODI could be used to initiate and control the process. As with any situation where you actually build a demo to do something specific, as opposed to just play around with the technology, you end up uncovering a few quirks and techniques that you wouldn’t have otherwise been aware of, so I’ll spend this week going through the various steps and calling out things others in the same situation might find useful – the steps I’ll go through are below, and I’ll add the links as the articles get published during the week;

As an overview, what I’ll be working with is a six-node Hadoop cluster running Cloudera CDH5, Oracle Big Data Connectors 3.0 and Oracle Data Integrator 12c, as you’d get with the current version of Oracle Big Data Appliance. Obviously BDA goes up to eighteen nodes as a maximum, and under the covers there’s lots more memory and much better networking that I was able to set up on VMWare ESXi, but what I’ve got is enough to prove the process. The diagram below shows the six nodes, and where the software is installed.

NewImage

I took a couple of short-cuts with the setup; obviously each node has a lot less RAM than BDA’s 64GB per node, but the main node in the cluster (bdanode1) running the HDFS NameNode and all the ODI software got bumped-up to 28GB, with the next two allocated 12GB and the others 8GB – enough to work through a demo at least. I also ran ODI Studio on the bdanode1 as well, instead of setting it up on a separate client VM, mainly to avoid having to set up all the Hadoop and Hive libraries on another machine. Other than that though, its the same CDH distribution you get on BDA, the same version of ODI and the Big Data Connectors, and so on, with the following software versions and downloads being used:

Setup of the Hadoop cluster is really out of scope for these articles, except to say that with CDH5, I find it easier to select the (non-default, deprecated) Packages install type rather than the new Parcels type, as this new methods installs all of the Hadoop software on each node in a new place – /opt/cloudera – rather than the usual /usr/lib/hadoop, /usr/lib/hive and so forth, meaning that most configuration examples you’ll read point to the wrong place for your install. Parcels are Cloudera’s way forward for delivering software components (there’s advantages in terms of patching across the cluster) but if you’re getting started with the Hadoop platform, installing in the old location usually makes things easier to follow. Other than that, the two bits of configuration you need to do is firstly to tell ODI Studio where to find that various Hadoop libraries and configuration files; because I’ve installed Studio directly on the main Hadoop node, I could then just add that node’s file locations to the Oracle user’s $HOME/.odi/oracledi/userlib/additional_path.txt file, so that it looked like this:

Also to make use of Oracle Loader for Hadoop, one of the Oracle Big Data Connectors and something we’ll use at the end to bulk-unload data from Hadoop to an Oracle database, the second thing I’ll need to do is set a couple of environment variables in the “oracle” user’s .bashrc profile file pointing to where OLH is installed, and where the Hadoop and Hive libraries and configuration files are:

[oracle@bdanode1 userlib]$ cat $HOME/.bashrc
# .bashrc
 
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
 
# User specific aliases and functions
export HADOOP_CLASSPATH=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/*:/etc/hive/conf
export OLH_HOME=/home/oracle/oracle/product/oraloader-3.0.0-h2

The scenario I’ll be working with is similar to the ones I covered on the blog recently, where I landed data from the Rittman Mead blog webserver into HDFS files using Hive, and then processed the files using Hive, Pig, Spark and so on. In this example though, I’ll use ODI to do the data manipulation where possible, but still use Hive and so forth under-the-covers to do the work. The diagram below shows the data flow that i’ll be looking to set up in the example:

NewImage

So in this scenario incoming data is being landed in the Hadoop cluster by Flume, using the process outlined in this previous blog post. All Flume is is a transport mechanism; it doesn’t in itself have any processing ability (making it analogous to GoldenGate) and all it does it transport log file entries from one place to another, via Flume agents on either end. All we as developers need to be aware of is (a) where the log files will be landed, and (b) that Flume will keep continuously writing to these files until the source log file gets rotated – therefore I can’t assume a log file is completely written to when I read from it.

What this means in practice is that if I want to do incremental loads, I need to consider the fact that a file I’m reading to might have more data in it later on. There’s various solutions to this – principally having Flume write to HBase, rather than raw HDFS files, and then I read from the HBase database noting the last extraction point at the time – but to keep things simple I’ll just do a full-load each time the ETL run takes place, meaning that I don’t need to think about incremental loads throughout the system.

So the first thing I need to do is have ODI take the incoming files and load them into a Hive table. To do this I set up Topology entries for the incoming HFDS files, and here’s the first “gotcha” – to create a connection to HDFS files, you use the regular File technology, but you leave the JDBC driver type blank, and put the address of the HDFS NameNode in to the JDBC URL – which of course is technically invalid and won’t allow you to either test it, or reverse-engineer the file names in, but is a “hack” used by the IKM File to Hive KM to get the address of your HDFS NameNode (if you choose to source the incoming file from HDFS rather than the local filesystem).

NewImage

Then, when you come to register the physical schemas to go with the new File technology data server, you’ll need to be aware that ODI just appends the final file name to the directory name when retrieving the file data – so if you want the connection to point to a directory, rather than just a file, you’ll need to set up the physical schema to be the directory “above” the one you’re interested in, and then set the file name later on to be that directory. In this example , I want the final file reference to point to hdfs://bdanode1.rittmandev.com/user/oracle/weblog_incoming_files/apache_access_combined, a whole directory (HDFS aggregates all the files in a directory if you reference just the directory in an operation) rather than just a single log file. You can see the directory and the log files it contains in the Hue screenshot below:

NewImage

I therefore set the physical schema to be hdfs://bdanode1.rittmandev.com/user/oracle/weblog_incoming_files, and the file reference in the datastore model is set to the final directory name, like this:

NewImage

If it seems a bit confusing, it’ll become clearer in a moment.

Now I need to go back to the Topology navigator and create a connection through to the Hive server on the Big Data Appliance VMs – in fact recent versions of CDH (CDH4, CDH5) swap out the old Hive Server for HiveServer2, so you’ll need to use the correct JDBC drivers (as supplied with CDH4/5) to connect to it, and also create the JDBC URL in the format jdbc:hive2://[machine name:port], as shown inn the screenshot below:

NewImage

A quick note about security at this point; by default, most CDH4/5 clusters are set up as what’s called “unsecured”, which means that whilst you use a username and login to connect to Cloudera Manager, for example, by this default although Hive and Impala request user credentials when you connect, they don’t actually check the password against anything, for example your LDAP server. You can connect these tools to LDAP or Active Directory, and typically you’d combine this with Kerebos authentication between the various components and when you connect via Impala and Hive, and typically you’d also use Apache Sentry to provide role-based access control to the data within individual HDFS files. But by default, Hive will accept more or less anything as a password when you connect, but then you may hit issues later on when your HDFS files are owned by a different user to the one you connect as.

Where this manifests itself is when a Hive table has underlying HDFS files owned by, say, the “admin” user in CDH5 (because that’s how you connected to Hue to upload them), but then you connect as the “oracle” user through Hive to then manipulate the Hive table contents. In practice, what I try to do is create any Hive tables (using Hue) as the user I’m then going to connect in Hive to them as, which means you’ll most probably need to go into Hue and create a new user called “oracle” (if that’s what you’re running ODI as, and connecting through Hive as) before creating the Hive tables you’ll then import into the ODI topology.

So once you’ve done all that, you can go into the Designer navigator and reverse-engineer the definition of your Hive tables into datastore models. In my case, I’ve got a bunch of tables that I’ll be using throughout the whole ETL process.

NewImage

Now it’s time for the first ODI mapping, to take the raw log file directory and load it into a Hive table. As it stands though, these raw log files are rows of just a single “column” of data – the log file entry in Apache CombinedLogFormat format. To make them useful to the rest of the ETL process I’ll somehow need to parse them into the individual log file elements, so I create a target Hive table that contains an entry for the raw log entry, and then columns for the various log file elements:

NewImage

The way that I parse the log file is to use a feature within the IKM File to Hive (LOAD DATA) KM that allows me to specify a regular expressed Serde (Serializer-Deserializer) to parse the log file entry into its individual columns, like this (note that you’ll need to ensure the hive-contrib-* JAR file is available to all of your Hadoop nodes before using this SerDe)

NewImage

In this instance, I want the KM to leave the source files in-place when doing the data load (Hive by default moves incoming source files into the /user/hive/warehouse directory area) as these files most probably haven’t been finished written-to by Flume yet, so I leave the EXTERNAL_TABLE value set to true (Hive external table, not Oracle external table) and make sure FILE_IS_LOCAL is set to FALSE, so that this KM knows to use the HDFS file location hack I set up in the topology earlier. Then, I just run the mapping and check that it’s worked OK:

NewImage

and I can check from the Model pane in the Designer navigator that I’ve now got a Hive table of individually split-up log entry columns to work with, for the rest of the downstream ETL process:

NewImage

So that’s the first stage done – next, I’ll be combining this Hive table with data from another one, using the IKM Hive Control Append KM.

Categories: BI & Warehousing

Rittman Mead at the Oracle Virtual Technology Summit 2014 – “The Architecture of Analytics: Big Time Big Data and Business Intelligence”

Rittman Mead Consulting - Mon, 2014-06-09 05:25

I’m pleased to announce that I’ll be presenting at the Oracle Technology Network (OTN) Virtual Technology Summit, an online event running over three days in early July. I’m speaking as part of the Middleware track, entitled “The Architecture of Analytics: Big Time Big Data and Business Intelligence”, along with fellow Oracle ACE Director Kevin McGinley, US BI Forum keynote speaker Richard Tomlinson, and by Tom Plunkett, Lead Author of the Oracle Big Data Handbook. It’s free to attend, it’s running in three timezones over the three days, and here’s the track details and session abstracts:

The Architecture of Analytics: Big Data and Business Intelligence

“More than just another hyped-up technological buzzword,  Big Data represents a dramatic shift not just in the amount of data to be dealt with, but also in how business intelligence is extracted from that data in order to inform critical business decisions. This OTN Virtual Technology Summit track will present a solution architect’s perspective on how business intelligence products in Oracle’s Fusion Middleware family and beyond fit into an effective big data architecture, and present insight and expertise from Oracle ACE Directors specializing in business Intelligence to help you meet your big data business intelligence challenges. Technologies covered in this track include Oracle Big Data Appliance, Oracle Data Integrator, Oracle Business Intelligence Enterprise Edition, Oracle Endeca, Oracle Data Warehouse, and Oracle Big Data Connectors, along with Hadoop, Hive, NoSQL, and MapReduce.”

Session Abstracts

1.Oracle Big Data Appliance Case Study: Using Big Data to Analyze Cancer-Genome Relationships [30:00]
by Tom Plunkett, Lead Author of the Oracle Big Data Handbook

This presentation takes a deep technical dive into the use of the Oracle Big Data Appliance in a project for the National Cancer Institute’s Frederick National Laboratory for Cancer Research.  The Frederick National Laboratory and the Oracle team won several awards for analyzing relationships between genomes and cancer subtypes with big data, including the 2012 Government Big Data Solutions Award, the 2013 Excellence.Gov Finalist for Innovation, and the 2013 ComputerWorld Honors Laureate for Innovation.

2.Getting Value from Big Data Variety [30:00]
by Richard Tomlinson, Director, Product Management, Oracle

Big data variety implies big data complexity. Performing analytics on diverse data typically involves mashing up structured, semi-structured and unstructured content. So how can we do this effectively to get real value? How do we relate diverse content so we can start to analyze it? This session looks at how we approach this tricky problem using Endeca Information Discovery.

3.How To Leverage Your Investment In Oracle Business Intelligence Enterprise Edition Within A Big Data Architecture [30:00]
By Oracle ACE Director Kevin McGinley

More and more organizations are realizing the value Big Data technologies contribute to the return on investment in Analytics.  But as an increasing variety of data types reside in different data stores, organizations are finding that a unified Analytics layer can help bridge the divide in modern data architectures.  This session will examine how you can enable Oracle Business Intelligence Enterprise Edition (OBIEE) to play a role in a unified Analytics layer and the benefits and use cases for doing so.

4.Oracle Data Integrator 12c As Your Big Data Data Integration Hub [90:00]
by Oracle ACE Director Mark Rittman

Oracle Data Integrator 12c (ODI12c), as well as being able to integrate and transform data from application and database data sources, also has the ability to load, transform and orchestrate data loads to and from Big Data sources. In this session, we’ll look at ODI12c’s ability to load data from Hadoop, Hive, NoSQL and file sources, transform that data using Hive and MapReduce processing across the Hadoop cluster, and then bulk-load that data into an Oracle Data Warehouse using Oracle Big Data Connectors.

We will also look at how ODI12c enables ETL-offloading to a Hadoop cluster, with some tips and techniques on real-time capture into a Hadoop data reservoir and techniques and limitations when performing ETL on big data sources.

Registration is free, and the form for all three events is online here. Look out for a preview of my big data ETL on BDA session over the next five days, and hopefully you’ll be able to make the VTS event and hear me talk about it in-person.

Categories: BI & Warehousing

Goodbye to Stewart…

Rittman Mead Consulting - Mon, 2014-06-09 05:00

We’re sad to announce that Stewart Bryson’s last day with Rittman Mead was last Friday, and he’s moving-on now to a new challenge. Five years ago Stewart co-founded Rittman Mead America with us, and most recently he’s been our Chief Innovation Officer, handing over the day-to-day running of our US business to Charles Elliott, our US Practice Manager.

One of the main contributions Stewart has made over the last few years is helping us develop the Rittman Mead Delivery Framework and the Rittman Mead Way. The Delivery Framework is a set of software, accelerators and processes that determine how Rittman Mead consultants go about projects (the Rittman Mead Way). Stewart has contributed greatly to our work around Extreme BI, version control and release management. These processes are now embedded in our delivery management team and framework, and as such will live on through the DNA of Rittman Mead, but in the meantime we wish him well going into the future.

Categories: BI & Warehousing