Rittman Mead Consulting

Subscribe to Rittman Mead Consulting feed
Rittman Mead consults, trains, and innovates within the world of Oracle Business Intelligence, data integration, and analytics.
Updated: 4 hours 36 min ago

OBIEE 12.2.1.1.0 - New Feature Guide

Thu, 2016-07-28 04:00
OBIEE 12.2.1.1.0 - New Feature Guide

Oracle has recently released version 12.2.1.1.0 for OBIEE 12c, which has a variety of exciting new features for Data Visualization, BI Publisher, Dashboards in OBIEE, and ODBC connections. It can be performed as an in-place upgrade (performed by the Upgrade Assistant) or as a fresh install. The upgrade process is explained at length here. There is also an example Linux in-place upgrade from 12.2.1.0 to 12.2.1.1 which can be found here. Let's take an in-depth look at some of the new features.

Data Visualization (aka Visual Analyzer)

A variety of features have been added to Data Visualization, which brings it closer to its standalone counterpart, Data Visualization Desktop.

Visualize Data from Oracle Applications

One of the most powerful exploratory features of Data Visualization Desktop is the ability to use your OBIEE analyses as direct datasources. This allows user to "revisualize" and explore the data contained within one or more analyses in different ways. Oracle has added this same feature in data visualization. To access this feature from the DV Home Page, click on Data Sources > Create New Data Source.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

You are then presented with three choices: From a File, From Oracle Applications or From Database (more on this later).

OBIEE 12.2.1.1.0 - New Feature Guide

Selecting From Oracle Applications opens a new window where you enter your connection information.

OBIEE 12.2.1.1.0 - New Feature Guide

Once you enter the connection information, you are able to view all of the folders you have access to within the OBIEE catalog. Selecting an analysis and clicking ok brings you to another screen where you can specify between measures and attributes for your columns and change the specified aggregation for your measure columns.

OBIEE 12.2.1.1.0 - New Feature Guide

Clicking Add Data Source adds the analyses to the Data Source page

.OBIEE 12.2.1.1.0 - New Feature Guide

Visualize Data from Databases

Another feature added to DV is the ability to connect directly to a database. It allows you to add tables directly from a database schema or write a SQL statement to select the data you want.

To connect to a database as a data source, select Data Sources from the Home Page then Create New Data Source > From Database. This brings up a connection window to enter your connection details.

OBIEE 12.2.1.1.0 - New Feature Guide

From here, you can also select from a large selection of database types other than the default Oracle Type. Clicking on Oracle brings up the list which matches the selections available in the latest release of Data Visualization Desktop.

OBIEE 12.2.1.1.0 - New Feature Guide

Choosing to add tables from a database schema allows you to add data sources, one table at a time.

OBIEE 12.2.1.1.0 - New Feature Guide

In some situations, you might want to create one datasource selection from a database from multiple tables. You can achieve this by clicking on the SQL tab after making your database connection and writing a custom sql statement.

OBIEE 12.2.1.1.0 - New Feature Guide

Clicking OK brings you to a results screen with all of the columns previously defined in your sql statement and also give you the ability to rename your datasource and to change the aggregation type for your measure columns.

OBIEE 12.2.1.1.0 - New Feature Guide

Modifying Uploaded Data Sources

Also known as "Data Wrangling", Oracle has added the ability to manipulate a dataset depending on the column data type. By invoking a logical SQL function, Data Visualization does the work for you and can create a new column or edit an existing one. To show an example of this, I created a new VA Project and uploaded a sample Excel file.

OBIEE 12.2.1.1.0 - New Feature Guide

In the top left menu bar, there is an option called Stage. Clicking on it opens the datasource and shows all of the columns present and the first 100 records.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Note: If you have more than one file, clicking on the name of the file in the top left will reveal a dropdown menu where you can choose between them.

To edit or add a column, navigate to the right of the column you wish to edit and click on the options icon OBIEE 12.2.1.1.0 - New Feature Guide.

This brings up the options menu where you can select from a variety of different options for editing or adding column depending on the datatype.

OBIEE 12.2.1.1.0 - New Feature Guide

Using the Concatenate function, I was able to create a new column that showed the full name of each customer combined with data in the Age Group column separated by a colon.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

If you chose to create a new column, a Create New Data Element window opens allowing you to input logical SQL functions similar to the edit column formula feature in OBIEE Analytics.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

New and Enhanced Visualization Types

DV now includes new visualizations including donut charts, text boxes, sunburst, combo, scatter (cat.) and stacked scatter (cat.), which brings it in line with the offerings from Data Visualization Desktop.

OBIEE 12.2.1.1.0 - New Feature Guide

Also included (assuming you have installed and configured Oracle's R distribution) is the ability to add clusters, outliers, reference lines, trend lines and forecast. There are two ways to add these to your visualizations. The first is by selecting the Analytics option in the horizontal menu bar on the far right side and then choose the desired function. Double clicking on a function automatically adds it to your visualization.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Another option is to add them directly from Menu > Properties > Analytics within your visualization.  You can add analytic functions and, depending on the function, there are a variety of different options to change how the function is displayed.

OBIEE 12.2.1.1.0 - New Feature Guide

You can also add URLs or links to insights within Tiles, Text Boxes and Image visualizations and, if you use Chrome for Windows or Android, there is a dictation option within properties that you can use to add descriptions.

OBIEE 12.2.1.1.0 - New Feature Guide

Customize Color Schemes

One of the more frustrating absences from DV up to this point was the ability to customize colors in your visualizations. Oracle has not only added this feature but given you the ability to customize and save these customizations, making color conformance for a group of users or across an entire organization a breeze.

Let's start by looking at an example of applying color to an entire project. To access the project color properties, click on Canvas Settings > Project Properties. By clicking on the color selection in the Color Series section, you can choose from one of four default selections or create your own custom palette.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Notice how each measure is assigned a color and that color is continuous across the entire project.

You can also manage colors from inside of an individual visualization. To do this, click on Menu > Color > Manage Color Assignments. From here you can assign colors to each measure individually and it is then displayed across the entire project.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Another option is to change a color by right clicking on an individual data point.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

In this example, I right clicked on Technology and changed its color which then changed it for every visualization in the project.

OBIEE 12.2.1.1.0 - New Feature Guide

Data Blending

Oracle added the ability to specify which data source takes precedence over the other when blending two data sources together. This can be very useful when you have a project that includes data from two data sources and there are match values in one data source that are not in the second. Let's look at an example.

I have two data sources. One contains population information and one contains country name and they are matched on the country codes.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Notice in the example that there are countries that have no population facts and and population facts that have no countries.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

So that we can see only the countries which have population information, click on one of the columns in the report, right click and select Properties > Data Sets. Here you are presented with an option for each source: All Rows or Matching Rows.

OBIEE 12.2.1.1.0 - New Feature Guide

Changing the source that contains country information to Matching Rows will keep only the countries that have corresponding population information.

OBIEE 12.2.1.1.0 - New Feature Guide

Share Reports as Read-Only

It is now possible to share a report with other users as read-only. They will still be able to interact and edit filters, but won't be able to see the authoring content controls. They are also able to go back an forth between presentation mode.

Other Features 

Two other features for DV introduced in 12.2.1.1 is the ability to upload data files up to 50mb in size. Also you can hide the filter panel from view to increase canvas space for your projects.

BI Publisher

Oracle has also updated BI Publisher 12c to version 12.2.1.1. You can see the official document here.

Deliver documents to Oracle Document Cloud Service

You are now able to deliver BI Publisher reports to the Oracle Documents Cloud service via a delivery channel in BI Publisher. You can deliver reports to cloud storage from both on-premise or a cloud deployment of BI Publisher.

You can set up this feature under Administration > Delivery > Document Cloud Services within BI Publisher 12.2.1.1.

OBIEE 12.2.1.1.0 - New Feature Guide

Here, under the Document Cloud Services tab, you can add the connection details to your cloud server.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Use WebCenter Content as a Data Source

You can now read text data files from WebCenter Content. You can create a Content Server data source under Administration > Delivery > Content Server

OBIEE 12.2.1.1.0 - New Feature Guide

After you create your Content Server as a data source, you can create Content Server as a dataset in a BI Publisher data model and retrieve text data files stored in Web Center Content by Document ID.

OBIEE 12.2.1.1.0 - New Feature Guide

Attach PDF to Your Invoices

You can now send PDF attachments along with invoices with BI Publisher 12.2.1.1. You can attach PDFs along with the invoice while bursting. The process is explained in more detail in the BI Publisher 12.2.1.1 documents here.

Integrate Using RESTful APIs

BI Publisher 12.2.1.1 introduces a set of REST APIs that allow you to view reports in an application. It connects to the BI Publisher Server through a URL and uses JSON objects to contain data. You can read more about RESTful APIs in BI Publisher in the BI Publisher 12.2.1.1 new features guide here.

Deliver Email Body in HTML Format along with Report as Attachment

You can use standard HTML4 formatting tags to create the email body, include a logo or images, add hyperlinks and more. Reports will be generated as attachments to the email.

Manage Custom Fonts

In BI Publisher 12.2.1.1, you can manage custom fonts from the Administration page, and once uploaded, these custom fonts can be displayed in reports.

OBIEE 12.2.1.1.0 - New Feature Guide

Manage PGP Keys for FTP Delivery Encryption

In BI Publisher 12.2.1.1, Administrators can upload and manage PGP Keys from the Administration page. FTP delivery encryption is now a self-service feature. More details can be found in the BI Publisher 12.2.1.1 documents.

Dynamic Memory Guard

You can now separately configure limits for online and offline reports. The limits can be made dynamic by using variables, system defined functions and operators and, when the data changes, the dynamic memory guard will adjust.

You can access the Memory Guard in the Administration page under Runtime Configuration > Properties.

OBIEE 12.2.1.1.0 - New Feature Guide

More information can be found in the BI Publisher 12.2.1.1 documents.

Enabling Encryption for File Data Security on Cloud

You can now enable File Data Encryption in the Administration page and your data files with be encrypted at the time of uploading them on the server.

OBIEE Dashboard Subpages

A new feature for OBIEE 12.2.1.1 is the ability to add subpages within dashboards. This feature will allow dashboard designers to create multiple subpages for each dashboard page. To add a subpage, click on the desired dashboard page and click the "Add Dashboard Page" icon and select "Add Subpage."

OBIEE 12.2.1.1.0 - New Feature Guide

The Add Subpage window will then be displayed where you can name your subpage and add a description.

OBIEE 12.2.1.1.0 - New Feature Guide

Once you click OK, the subpage will be added to the dashboard page and adding content to the subpage is the exact same process as adding content to a dashboard page.

OBIEE 12.2.1.1.0 - New Feature Guide

RPD - Enable Data Driven Fragment Selection

A new feature for RPD developers is the ability to improve the performance of fragmented logical table sources by using the Enable Data Driven Fragment Selection feature. More information on how to enable this feature can be found in the 12.2.1.1 documents here.

Additional Data Sources Supported

New data sources are supported for 12.2.1.1 using DataDirect 7.1.5. These include Amazon RedShift, Oracle Service Cloud, Greenplum, Salesforce, and Teradata. Connections to Apache Spark and MongoDB are also available using 8.0.

Conclusion

All in all, the 12.2.1.1 update for OBIEE 12c adds some very powerful features. Data Visualization is now a very complete product and it's new features and native connectors make it a powerful addition to the new era of data visualization. For Oracle Cloud users, the integration for BI Publisher and Data Visualization makes the "report and share" process easier than ever.

Categories: BI & Warehousing

Stream Analytics and Processing with Kafka and Oracle Stream Analytics

Tue, 2016-07-26 09:00

In my previous post I looked the latest release of Oracle Stream Analytics (OSA), and saw how it provided a graphical interface to "Fast Data". Users can analyse streaming data as it arrives based on conditions and rules. They can also transform the stream data, publishing it back out as a stream in its own right. In this article we'll see how OSA can be used with Kafka.

Kafka is one of the foremost streaming technologies nowadays, for very good reasons. It is highly scalable and flexible, supporting multiple concurrent consumers. Oracle Streaming Analytics supports Kafka as both a source and target. To set up an inbound stream from Kafka, first we define the Connection:

Once the Connection is defined, we can create a Stream for a given Kafka topic: Oracle_Stream_Analytics If you get an error at this point of Unable to deploy OEP application then check the OSA log - it could be a connectivity issue to Zookeeper.
Exception in thread "SpringOsgiExtenderThread-286" org.springframework.beans.FatalBeanException:  
Error in context lifecycle initialization; nested exception is com.bea.wlevs.ede.api.EventProcessingException:  
org.I0Itec.zkclient.exception.ZkTimeoutException:  
Unable to connect to zookeeper server within timeout: 6000  
Assuming that the Stream is saved with no errors, you can then create an Exploration based on the stream and all being well, the live tweets are soon shown. Unlike the example at the top of this article, these tweets are coming in via Kafka, rather than the built-in OSA Twitter Stream. This is partly to demonstrate the Kafka capabilities, but also because the built-in OSA Twitter Stream only includes a subset of the available twitter data fields. Avro? Nope.

Data in Kafka can be serialised in many formats, including Avro - which OSA doesn’t seem to like. No error is thrown to the GUI but the exploration remains blank.

Looking in the OSA log file there’s a whole lot of errors recorded similar to this:
  
line 1:0 no viable alternative at character '?'  
line 1:1 no viable alternative at character '?'  
line 1:2 no viable alternative at character '?'  
line 1:3 no viable alternative at character '?'  
line 1:4 no viable alternative at character '?'  
line 1:5 no viable alternative at character '?'  
line 1:6 no viable alternative at character '?'  
line 1:7 no viable alternative at character '?'  
line 1:8 no viable alternative at character ''  
JSON? Kinda. One of the challenges that I found working with OSA was defining the “Shape” (data model) of the inbound stream data. JSON is a format used widely as a technology-agnostic data interchange format, including for the twitter data that I was working with. You can see a sample record here. One of the powerful features of JSON is its ability to nest objects in a record, as well as create arrays of them. You can read more about this detail in a recent article I wrote here. Unfortunately it seems that OSA does not support flattening out JSON, meaning that only elements in the root of the model are accessible. For twitter, that means we can see the text, and who it was in reply to, but not the user who tweeted it, since the latter is a nested element (along with many other fields, including hashtags which are also an array):
root
|-- created_at: string (nullable = true)
|-- entities: struct (nullable = true)
|    |-- hashtags: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- indices: array (nullable = true)
|    |    |    |    |-- element: long (containsNull = true)
|    |    |    |-- text: string (nullable = true)
|    |-- user_mentions: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- id: long (nullable = true)
|    |    |    |-- id_str: string (nullable = true)
|    |    |    |-- indices: array (nullable = true)
|    |    |    |    |-- element: long (containsNull = true)
|    |    |    |-- name: string (nullable = true)
|    |    |    |-- screen_name: string (nullable = true)
|-- source: string (nullable = true)
|-- text: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- truncated: boolean (nullable = true)
|-- user: struct (nullable = true)
|    |-- followers_count: long (nullable = true)
|    |-- following: string (nullable = true)
|    |-- friends_count: long (nullable = true)
|    |-- name: string (nullable = true)
|    |-- screen_name: string (nullable = true)
So what to do if the inbound streaming data is in nested-JSON format? It seems to me the only option is to pre-process it to flatten it. There are a variety of tools that could be used here - in the first instance I’d generally reach for Logstash, it being the one I’m most familiar with. To get an idea of the schema of a JSON record you can use jsonschema.net. Funnily enough when I was researching this blog post I came across the exact same problem on a forum posted by … me! Early last year I was working with the same dataset, and had the same issue with embedded arrays. The way to do it in Logstash is with a bit of Ruby code to flatten the arrays, and a standard mutate to bring nested objects up to the root level. Sample code:
mutate {  
    add_field => { "user_name" => "%{[user][name]}" }  
    add_field => { "user_screen_name" => "%{[user][screen_name]}" }  
}
ruby {  
    code => 'event["hashtags_array"] = event["[entities][hashtags]"].collect { |m| m["text"] } unless event["[entities][hashtags]"].nil?  
             event["hashtags_list"] = event["hashtags_array"].join(",")  unless event["[hashtags_array]"].nil?'  
}
You can find the full Logstash code on gist here. With this logstash code running I set up a new OSA Stream pointing to the new Kafka topic that Logstash was writing, and added the flattened fields to the Shape: We can then see in the Exploration the fields that we wanted to get at - user name, hashtags, and so on: Other Shape Gotchas

One of the fields in Twitter data is ‘source’ - which unfortunately is a reserved identifier in the CQL language that OSA uses behind the scenes.

Caused By: org.springframework.beans.FatalBeanException: Exception initializing channel; nested exception is com.bea.wlevs.ede.api.ConfigurationException: Event type [sx-10-16-Kafka_Technology_Tweets_JSON-1] of channel [channel] uses invalid or reserved CQL identifier = , source

It’s not clear how to define a shape in which the source data field is named after a reserved identifier.

Further Exploration of Twitter Streams with OSA

Using the flattened Twitter stream coming via Kafka that I demonstrated above, let’s now look at more OSA functionality.

Depending on the source of your data stream, and your purpose for analysing it, you may well want to filter out certain content. This can be done from the Exploration screen:

The Business Rules section of the Exploration enables you to define rules about the data and set field values based on it. This can be static values, or expressions based on data in the stream. There doens’t seem to be a way to add arbitrary fields via this, so I amended the Stream Shape to include a ‘spare’ field that I then populated: Kafka Stream Transformation with OSA

Here we’ll see how OSA can be used to ingest one Kafka topic, apply a transformation, and stream it to another Kafka topic.

The OSA exploration screen offers a basic aggregation (‘summary’) function, here showing the number of tweets per language:

Using the Windows icon to the right of the Sources box the time window can be defined, along with the refresh frequency:

This means that the count of tweets per language will be calculated looking at the data for the past 30 seconds, and this will be evaluated every five seconds. More complex functionality such as pivoting on the group-by column (so as to be able to chart out the number of tweets per language as separate metrics) doesn’t seem to be present in this release; arguably this is moving over into per analytics territory such as would be found in Oracle’s Big Data Discovery.

Taking the summarised stream (count of tweets, by language) I first Publish the exploration, making it available for use as the input to a subsequent exploration. Then from the Catalog page select a Pattern, which I’m going to use to build a stream showing the most common languages in the past five seconds. With the Top N pattern you specify the event stream (in this case, the summarised stream that I built above), and the metric by which to order the events which here is the count of tweets per language.

For completeness, I’m going to stream the output of this pattern exploration back to a Kafka topic

Note that I’ve defined a new Shape here based on the columns in the pattern. In the pattern itself I renamed the COUNT column to a clearer one (tweetcount5_sec). Renaming it wasn’t strictly necessary since it’s possible to define the field/shape mapping when you configure the Target:

For the target to take effect, I publish the pattern exploration, and then using kafka-console-consumer can see the topic being populated in realtime by OSA:

Being able to apply transformations to streams in realtime like this and stream the results is pretty useful. There are some limitations to the capabilities of OSA through the front end GUI. For example, support for nested json, and integration with the Kafka Schema Registry to automatically derive Shapes for inbound topics would both be great. Lower-level, the option to specify the consumer group id, as well as the start point for consumption (beginning of topic, or streaming at the end) are both things that would probably be necessary sooner or later using OSA for full-blown development.

OSA and Spatial

One of the Patterns that OSA provides is a Spatial one, which can be used to analyse source data that includes geo-location data. This could be to simply plot the occurrence of the data point (as we'll see shortly) on a map. It can also be used in a more sophisticated manner, to track a given entity's movements on a map. An example of this could be a fleet of trucks reporting their position back at regular intervals. Areas on a map can be defined and conditions triggered as the entity enters or leaves the area. For now though, we'll keep it simple. Using the flattened Twitter stream from Kafka that I produced from Logstash above, I'm going to plot Tweets in realtime on a map, along with a very simplistic tagging of the broad area in which they came from.

In my source Kafka topic I have two fields, latitude and longitude. I expose these as part of the 'flattening' of the JSON in this logstash script, since by default they're nested within the coordinates field and as an array too. When defining the Stream's Shape make sure you define the datatype correctly (Double) - OSA is not very forgiving of stupidity and I spent a frustrating time trying to work out why "-80.1422195" was coming through as zero - obviously defined as an Integer this was never going to work!

Not entirely necessary, but useful for debug purposes, I setup an exploration based on the flattened Twitter stream, with a filter to only include tweets that had geo-location data in them. This way I knew what tweets I should expect to be seeing in the next step. One of the things that I have found with OSA is that it has a tendency to fail silently; instead of throwing errors you'll just not get any data. By setting up the filter exploration I could at least debug things a bit more easily.

Oracle_Stream_Analytics

After this I created a new object, a Map. A Map object defines a set of named areas, which could be sourced from a database table, or drawn manually - which is what I did here by setting the Map Type to 'None (Create Manually)'. One thing to note about the maps is that they're sourced online (openstreetmap.org) so you'll need an internet connection to do this. Once the Map is open, click the Polygon Tool icon and click-drag a shape around the area that you want to "geo-fence". Each area is given a name, and this is what is used in the streaming data to label the event's geographical area.

Yorkshire - God's Own County

Having got our source data stream with geo-data in, and a Map on which to plot it and analyse the location of each event, we now use the Spatial General pattern to create an Exploration. The topology looks like this:

Oracle_Stream_Analytics

The fields in the Spatial General pattern are all pretty obvious. Object key is the field to use to track the same entity across multiple events, if you want to use the enter/exit/stay statuses. For tweets we just use 'Enter', but for people or vehicles, for example, you might get multiple status reports and want to track them on a map. For example, when a person is near a point of interest that you're tracking, or a vehicle has remained in a set area for too long.

Oracle_Stream_Analytics

If you let the Exploration now run, depending on the rate of event ingest, you'll sooner or later see points appearing on the map and event details underneath. The "status" column is populated (blank if the event is outside of the defined geo-fences), as is the "Place", based on the geo-fence names that you defined.

osa_ani_05

Summary

I can see OSA being used in two ways. The first as an ‘endpoint’ for streams with users taking actions based on the data, with some of the use cases listed here. The second is for prototyping transformations and analyses on streams prior to productionising them. The visual interface and immediacy of feedback on transformations applied means that users can quickly understand what further processing they may want to apply to the stream using actual streaming data to inform this.

This latter concept - that of prototyping - is similar to that which we see with another of Oracle’s products, Big Data Discovery. With BDD users can analyse data in the organisation’s data reservoir, as well as apply transformations to it (read more). Just as BDD doesn’t replace OBIEE or Visual Analyzer but enables users to understand how they do want to model the data in these tools, OSA wouldn’t replace “production grade” integration done by Oracle Data Integrator. What it would do is allow users to get a clearer idea of the transformations they would want performed in it.

OSA's user interface is easy to use and intuitive, and this is definitely a tool that you would put in front of technically minded business users. There are limitations to what can be achieved technically through the web GUI alone and something like Oracle Data Integrator (ODI) would still be a more appropriate fit for complex streaming work. At Oracle Open World last year it was announced (slides) that a beta would be starting for ODI using Spark Streaming for ETL and stream processing, so it'll be interesting to see this when it comes out.

Further Reading

Categories: BI & Warehousing

An Introduction to Oracle Stream Analytics

Mon, 2016-07-25 09:00
An Introduction to Oracle Stream Analytics

Oracle Stream Analytics (OSA) is a graphical tool that provides “Business Insight into Fast Data”. In layman terms, that translates into an intuitive web-based interface for exploring, analysing, and manipulating streaming data sources in realtime. These sources can include REST, JMS queues, as well as Kafka. The inclusion of Kafka opens OSA up to integration with many new-build data pipelines that use this as a backbone technology.

An Introduction to Oracle Stream AnalyticsPreviously known as Oracle Stream Explorer, it is part of the SOA component of Fusion Middleware (just as OBIEE and ODI are part of FMW too). In a recent blog it was positioned as “[…] part of Oracle Data Integration And Governance Platform.”. Its Big Data credentials include support for Kafka as source and target, as well as the option to execute across multiple nodes for scaling performance and capacity using Spark. I’ve been exploring OSA from the comfort of my own Mac, courtesy of Docker and a Docker image for OSA created by Guido Schmutz. The benefits of Docker are many and covered elsewhere, but what I loved about it in this instance was that I didn’t have to download a VM that was 10s of GB. Nor did I have to spend time learning how to install OSA from scratch, which whilst interesting wasn’t a priority compared to just trying to tool out and seeing what it could do. [Update] it turns out that installation is a piece of cake, and the download is less than 1Gb … but in general the principle still stands - Docker is a great way to get up and running quickly with something In this article we’ll take OSA for a spin, looking at some of the functionality and terminology, and then real examples of use with live Twitter data. To start with, we sign in to Oracle Stream Analytics: An Introduction to Oracle Stream AnalyticsFrom here, click on the Catalog link, where a list of all the resources are listed. Some of these resource types include:
  • Streams - definitions of sources of data such as Kafka, JMS, and a dummy data generator (event generator)
  • Connections - Servers etc from which Streams are defined
  • Explorations - front-end for seeing contents of Streams in realtime, as well as applying light transformations
  • Targets - destination for transformed streams
Viewing Realtime Twitter Data with OSA The first example I’ll show is the canonical big data/streaming example everywhere – Twitter. Twitter is even built into OSA as a Stream source. If you go to https://dev.twitter.com you can get yourself a set of credentials enabling you to query the live Twitter firehose for given hashtags or users. With my twitter dev credentials, I create a new Connection in OSA: An Introduction to Oracle Stream AnalyticsAn Introduction to Oracle Stream AnalyticsAn Introduction to Oracle Stream AnalyticsNow we have an entry in the Catalog, for the Twitter connection: An Introduction to Oracle Stream Analyticsfrom which we can create a Stream, using the connection and a set of hashtags or users for whom we want to stream tweets: An Introduction to Oracle Stream AnalyticsAn Introduction to Oracle Stream AnalyticsThe Shape is basically the schema or data model that is applied for the stream. There is one built-in for Twitter, which we’ll use here: An Introduction to Oracle Stream AnalyticsWhen you click Save, if you get an error Unable to deploy OEP application then check the OSA log file for errors such as unable to reach Twitter, or invalid credentials. Assuming the Stream is created successfully you are then prompted to create an Exploration from where you can see the Stream in realtime: An Introduction to Oracle Stream AnalyticsExplorations can have multiple stream sources, and be used to transform the contents, which we’ll see later. For now, after clicking Create, we get our Exploration window, which shows the contents of the stream in realtime: An Introduction to Oracle Stream AnalyticsAt the bottom of the screen there’s the option to plot one or more charts showing the value of any numeric values in the stream, as can be seen in the animation above. I’ll leave this example here for now, but finish by using the Publish option from the Actions menu, which makes it available as a source for subsequent analyses. Adding Lookup Data to Streams Let's look now at some more of the options available for transforming and 'wrangling' streaming data with OSA. Here I’m going to show how two streams can be joined together (but not crossed) based on a common field, and the resulting stream used as the input for a subsequent process. The data is simulated, using a CSV file (read by OSA on a loop) and OSA's Event Generator. From the Catalog page I create a new Stream, using Event Generator as the Type: An Introduction to Oracle Stream AnalyticsOn the second page of the setup I define how frequently I want the dummy events to be generated, and the specification for the dummy data: An Introduction to Oracle Stream AnalyticsThe last bit of setup for the stream is to define the Shape, which is the schema of data that I’d like generated: An Introduction to Oracle Stream AnalyticsThe Exploration for this stream shows the dummy data: An Introduction to Oracle Stream AnalyticsThe second stream is going to be sourced from a very simple key/value CSV file:
attr_id,attr_value  
1,never  
2,gonna  
3,give  
4,you  
5,up  
The stream type is CSV, and I can configure how often OSA reads from it, as well as telling OSA to loop back to the beginning when it's read to the end, thus simulating a proper stream. The ‘shape’ is picked up automatically from the file, based on the first row (headers) and then inferred data types: An Introduction to Oracle Stream AnalyticsThe Exploration for the stream shows the five values repeatedly streamed through (since I ticked the box to ‘loop’ the CSV file in the stream): An Introduction to Oracle Stream AnalyticsBack on the Catalog page I’m going to create a new Exploration, but this time based on a Pattern. Patterns are pre-built templates for stream manipulation and processing. Here we’ll use the pattern for a “left outer join” between streams. An Introduction to Oracle Stream AnalyticsAn Introduction to Oracle Stream AnalyticsThe Pattern has a set of pre-defined fields that need to be supplied, including the stream names and the common field with which to join them. Note also that I’ve increased the Window Range. This is necessary so that a greater range of CSV stream events are used for the lookup. If the Range is left at the default of 1 second then only events from both streams occurring in the same second that match on attr_id would be matched. Unless both streams happen to be in sync on the same attr_id from the outset then this isn’t going to happen that often, and certainly wouldn’t in a real-life data stream. So now we have the two joined streams: An Introduction to Oracle Stream AnalyticsWithin an Exploration it is possible to do light transformation work. By right-clicking on a column you can rename or remove it, which I’ve done here for the duplicated attr_id (duplicated since it appears in both streams), as well as renamed the attr_value: An Introduction to Oracle Stream Analytics Daisy-Chaining, Targets, and Topology

Once an Exploration is Published it can be used as the Source for subsequent Explorations, enabling you to map out a pipeline based on multiple source streams and transformations. Here we're taking the exploration created just above that joined the two streams together, and using the output as the source for a new Exploration:

An Introduction to Oracle Stream AnalyticsSince the Exploration is based on a previous one, the same stream data is available, but with the joins and transformations already applied

An Introduction to Oracle Stream AnalyticsFrom here another transformation could be applied, such as replacing the value of one column conditionally based on that of another

An Introduction to Oracle Stream AnalyticsWhilst OSA enables rapid analysis and transformation of inbound streams, it also lets you stream the transformed results outside of OSA, to a Target as we saw in the Kafka example above. As well as Kafka other technologies are supported as targets, including a REST endpoint, or a simple CSV file.

An Introduction to Oracle Stream AnalyticsWith a target configured, as well as an Exploration based on the output of another, the Topology comes in handy for visualising the flow of data. You can access this from the Topology icon in an Exploration page, or from the dropdown menu on the Catalog page against a given object

An Introduction to Oracle Stream Analytics

In the next post I will look at how Oracle Stream Analytics can be used to analyse, enrich, and publish data to and from Kafka. Stay tuned!

Categories: BI & Warehousing

Announcing OBI Remote Training

Wed, 2016-07-20 04:00
Announcing OBI Remote Training

Announcing OBI Remote Training

Since the release of OBIEE 12c in 2015, we have received countless inquiries about how we would be offering our training. Our customers are familiar with our ability to provide on-site private training for a team and we are well known for hosting training classes in our offices in the UK and the US. But what most people aren’t aware of is that we now offer OBI remote training.

Our public training schedule offers a variety of courses monthly, some of which are offered exclusively as remote classes. And for any one of our public classes that is hosted in our U.S. offices, we also offer a limited number of seats to remote attendees. What does this mean for you? This means you have options!

One of our goals here at Rittman Mead is to provide unhindered access to the great wealth of information our team has accumulated through their extensive real-world experience. Now we've translated this goal into more accessible training. We understand budgets can be tight and travel may not always be an option for you or your team, but we don’t want that to be the reason you can’t attend our training.

In mid-2015 we started testing our ability to deliver remote training. Our main concern as we began testing was whether we’d be able to deliver the same value to our customers in a digital classroom that we've traditionally been able to deliver in a physical classroom. Our fear was that when you lost the face-to-face interaction between the instructor and students, you would also lose some of the rhythm and chemistry of the training, and, consequently, our students would feel less engaged. Other more technical concerns were on our minds, ranging from sound and video quality to connectivity. Much to our surprise and satisfaction, however, our concerns quickly dissolved as, time after time, we were able to deliver the training without issue.

So after plenty of testing, we are pleased to offer remote training as a regular option in our training schedule.

We are aware that remote training (or online training) has been around for some timewe are not claiming to be innovators in the ways of online learning—but we feel that the platform for online learning has finally reached a level that is in line with the quality we demand for our training.

In fact, we have consistently received high marks from customers who have attended our remote training, solidifying our confidence that it does in fact live up to our standards. We invite you to check out our training options. Whether it be on-site training (public or private) or remote training, rest assured that you will be receiving expert-level training from Rittman Mead’s best.

For a full list of our scheduled trainings, see our US or UK calendars.

Categories: BI & Warehousing

System Metrics Collectors

Tue, 2016-07-19 02:18
System Metrics Collectors

The need to monitor and control the system performances is not new. What is new is the trend of clever, lightweight, easy to setup, open source metric collectors in the market, along with timeseries databases to store these metrics, and user friendly front ends through which to display and analyse the data.

In this post I will compare Telegraf, Collectl and Topbeat as lightweight metric collectors. All of them do a great job of collecting variety of useful system and application statistic data with minimal overhead to the servers.  Each has the strength of easy configuration and accessible documentation but still there are some differences around range of input and outputs; how they extract the data, what metrics they collect and where they store them.

  • Telegraf is part of the Influx TICK stack, and works with a vast variety of useful input plugins such as Elasticsearch, nginx, AWS and so on. It also supports a variety of outputs, obviously InfluxDB being the primary one. (Find out more...)
  • Topbeat is a new tool from Elastic, the company behind Elasticsearch, Logstash, and Kibana. The Beats platform is evolving rapidly, and includes topbeat, winlogbeat, and packetbeat. In terms of metric collection its support for detailed metrics such as disk IO is relatively limited currently. (Find out more...)
  • Collectl is a long-standing favourite of systems performance analysts, providing a rich source of data. This depth of complexity comes at a bit of a cost when it comes to the documentation’s accessibility, it being aimed firmly at a systems programmer! (Find out more...)

In this post I have used InfluxDB as the backend for storing the data, and Grafana as the front end visualisation tool. I will explain more about both tools later in this post.

In the screenshot below I have used Grafana dashboards to show  "Used CPU", "Used Memory" and "Network Traffic" stats from the mentioned collectors. As you can see the output of all three is almost the same. What makes them different is:

    • What your infrastructure can support? For example, you cannot install Telegraf on old version of X Server.
    • What input plugins do you need? The current version of Topbeat doesn’t support more detailed metrics such as disk IO and network stats.
    • What storage do you want/need to use for the outputs? InfluxDB works as the best match for Telegraf data, whilst Beats pairs naturally with Elasticsearch
    • What is your visualisation tool and what does it work with best. In all cases the best front end should natively support time series visualisations.

System Metrics Collectors

Next I am going to provide more details on how to download/install each of the mentioned metrics collector services, example commands are written for a linux system.

Telegraf "An open source agent written in Go for collecting metrics and data on the system it's running on or from other services. Telegraf writes data it collects to InfluxDB in the correct format."
  1. Download and install InfluxDB: sudo yum install -y https://s3.amazonaws.com/influxdb/influxdb-0.10.0-1.x86_64.rpm
  2. Start the InfluxDB service: sudo service influxdb start
  3. Download Telegraf: wget http://get.influxdb.org/telegraf/telegraf-0.12.0-1.x86_64.rpm
  4. Install Telegraf: sudo yum localinstall telegraf-0.12.0-1.x86_64.rpm
  5. Start the Telegraf service: sudo service telegraph start
  6. Done!

The default configuration file for Telegraf sits in /etc/telegraf/telegraf.conf or a new config file can be generated using the -sample-config flag on the location of your choice:  telegraf -sample-config > telegraf.conf .  Update the config file to enable/disable/setup different input or outputs plugins e.g. I enabled network inputs: [[inputs.net]]. Finally to test the config files and to verify the output metrics run: telegraf -config telegraf.conf -test

Once all ready and started, a new database called 'telegraf' will be added to the InfluxDB storage which you can connect and query. You will read more about InfluxDB in this post.

 

Collectl Unlike most monitoring tools that either focus on a small set of statistics, format their output in only one way, run either interactively or as a daemon but not both, collectl tries to do it all. You can choose to monitor any of a broad set of subsystems which currently include buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp.
  • Install collectl: sudo yum install collectl
  • Update the Collectl config file at /etc/collectl.conf to turn on/off different switches and also to write the Collectl's output logs to a database, i.e. InfluxDB
  • Restart Collectl service  sudo service collectl restart
  • Collectl will write its log in a new InfluxDB database called “graphite”.

 

Topbeat Topbeat is a lightweight way to gather CPU, memory, and other per-process and system wide data, then ship it to (by default) Elasticsearch to analyze the results.
  • Download Topbeat: wget https://download.elastic.co/beats/topbeat/topbeat-1.2.1-x86_64.rpm
  • Install: sudo yum local install topbeat-1.2.1-x86_64.rpm
  • Edit the topbeat.yml configuration file at /etc/topbeat and set the output to elasticsearch or logstash.
  • If choosing elasticsearch as output, you need to load the index template, which lets Elasticsearch know which fields should be analyzed in which way. The recommended template file is installed by the Topbeat packages. You can either configure Topbeat to load the template automatically, Or you can run a shell script to load the template: curl -XPUT 'http://localhost:9200/_template/topbeat -d@/etc/topbeat/topbeat.template.json
  • Run topbeat: sudo /etc/init.d/topbeat start
  • To test your Topbeat Installation try: curl -XGET 'http://localhost:9200/topbeat-*/_search?pretty'
  • TopBeat logs are written at /var/log
  • Reference to output fields 

 

Why write another metrics collector?

From everything that I have covered above, it is obvious that there is no shortage of open source agents for collecting metrics. Still you may come across a situation that none of the options could be used e.g. specific operating system (in this case, MacOS on XServe) that can’t support any of the options above. The below code is my version of light metric collector, to keep track of Disk IO stats, network, CPU and memory of the host where the simple bash script will be run.

The code will run through an indefinite loop until it is forced quit. Within the loop, first I have used a CURL request (InfluxDB API Reference) to create a database called OSStat, if the database name exists nothing will happen. Then I have used a variety of built-in OS tools to extract the data I needed. In my example sar -u for cpu, sar -n for network, vm_stat for memory, iotop for diskio could return the values I needed. With a quick search you will find many more options. I also used a combinations of awk, sed and grep to transform the values from these tools to the structure that I was easier to use on the front end. Finally I pushed the results to InfluxDB using the curl requests.

#!/bin/bash  
export INFLUX_SERVER=$1  
while [ 1 -eq 1 ];  
do

#######CREATE DATABASE ########
curl -G http://$INFLUX_SERVER:8086/query  -s --data-urlencode "q=CREATE DATABASE OSStat" > /dev/null

####### CPU  #########
sar 1 1 -u | tail -n 1 | awk -v MYHOST=$(hostname)   '{  print "cpu,host="MYHOST"  %usr="$2",%nice="$3",%sys="$4",%idle="$5}' | curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null

####### Memory ##########
FREE_BLOCKS=$(vm_stat | grep free | awk '{ print $3 }' | sed 's/\.//')  
INACTIVE_BLOCKS=$(vm_stat | grep inactive | awk '{ print $3 }' | sed 's/\.//')  
SPECULATIVE_BLOCKS=$(vm_stat | grep speculative | awk '{ print $3 }' | sed 's/\.//')  
WIRED_BLOCKS=$(vm_stat | grep wired | awk '{ print $4 }' | sed 's/\.//')

FREE=$((($FREE_BLOCKS+SPECULATIVE_BLOCKS)*4096/1048576))  
INACTIVE=$(($INACTIVE_BLOCKS*4096/1048576))  
TOTALFREE=$((($FREE+$INACTIVE)))  
WIRED=$(($WIRED_BLOCKS*4096/1048576))  
ACTIVE=$(((4096-($TOTALFREE+$WIRED))))  
TOTAL=$((($INACTIVE+$WIRED+$ACTIVE)))

curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary  "memory,host="$(hostname)" Free="$FREE",Inactive="$INACTIVE",Total-free="$TOTALFREE",Wired="$WIRED",Active="$ACTIVE",total-used="$TOTAL > /dev/null

####### Disk IO ##########
iotop -t 1 1 -P | head -n 2  | grep 201 | awk -v MYHOST=$(hostname)  
  '{ print "diskio,host="MYHOST" io_time="$6"read_bytes="$8*1024",write_bytes="$11*1024}'  | curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null

###### NETWORK ##########
sar -n DEV 1  |grep -v IFACE|grep -v Average|grep -v -E ^$ | awk -v MYHOST="$(hostname)" '{print "net,host="MYHOST",iface="$2" pktin_s="$3",bytesin_s="$4",pktout_s="$4",bytesout_s="$5}'|curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null

sleep 10;  
done

 

 

InfluxDB Storage "InfluxDB is a time series database built from the ground up to handle high write and query loads. It is the second piece of the TICK stack. InfluxDB is meant to be used as a backing store for any use case involving large amounts of timestamped data, including DevOps monitoring, application metrics, IoT sensor data, and real-time analytics."

InfluxDB's SQL-like query language is called InfluxQL, You can connect/query InfluxDB via Curl requests (mentioned above), command line or browser. The following sample InfluxQLs cover useful basic command line statements to get you started:

influx -- Connect to the database

SHOW DATABASES  -- Show existing databases, _internal is the embedded databased used for internal metrics

USE telegraf -- Make 'telegraf' the current database

SHOW MEASUREMENTS -- show all tables within current database

SHOW FIELD KEYS -- show tables definition within current database

InfluxDB also have a browser admin console that is by default accessible on port 8086. (Official Reference(Read more on RittmanMead Blog)

System Metrics Collectors

 

Grafana Visualisation "Grafana provides rich visualisation options best for working with time series data for Internet infrastructure and application performance analytics."

Best to use InfluxDB as datasource for Grafana as Elasticsearch datasources doesn't support all Grafana's features e.g. functions behind the panels. Here is a good introduction video to visualisation with Grafana.

System Metrics Collectors

Categories: BI & Warehousing

Connecting Oracle Data Visualization Desktop to OBIEE

Mon, 2016-07-18 04:00
Connecting Oracle Data Visualization Desktop to OBIEE

Recently at Rittman Mead we have been asked a lot of questions surrounding Oracle’s new Data Visualization Desktop tool and how it integrates with OBIEE. Rather than referring people to the Oracle docs on DVD, I decided to share with you my experience connecting to an OBIEE 12c instance and take you through some of the things I learned through the process.

In a previous blog, I went though database connections with Data Visualization Desktop and how to create reports using data pulled directly from the database. Connecting to DVD to OBIEE is largely the same process, but allows the user to pull in data at pre-existing report level. I decided to use our 12c ChitChat demo server as the OBIEE source and created some sample reports in answers to test out with DVD.

From the DVD Data Sources page, clicking "Create New Data Source" brings up a selection pane with the option to select “From Oracle Applications.”

Connecting Oracle Data Visualization Desktop to OBIEE

Clicking this option brings up a connection screen with options to enter a connection name, URL (location of the reports you want to pull in as a source), username, and password respectively. This seems like a pretty straightforward process. Reading the Oracle docs on connectivity to OBIEE with DVD say to navigate to the web catalog, select the folder containing the analysis you want to use as a source, and then copy and paste the URL from your browser into the URL connection in DVD. However, using this method will cause the connection to fail.

Connecting Oracle Data Visualization Desktop to OBIEE

To get Data Visualization Desktop to connect properly, you have to use the URL that you would normally use to log into OBIEE analytics with the proper username and password.

Connecting Oracle Data Visualization Desktop to OBIEE

Once connected, the web catalog folders are displayed.

Connecting Oracle Data Visualization Desktop to OBIEE

From here, you can navigate to the analyses you want to use for data sources.

Connecting Oracle Data Visualization Desktop to OBIEE

Selecting the analysis you want to use as your data source is the same process as selecting schemas and tables from a database source. Once the selection is made, a new screen is displayed with all of the tables and columns that were used for the analysis within OBIEE.

Connecting Oracle Data Visualization Desktop to OBIEE

From here you can specify each column as an attribute or measure column and change the aggregation for your measures to something other than what was imported with the analysis.

Clicking "Add to Project" loads all the data into DVD under Data Elements and is displayed on the right hand side just like subject area contents in OBIEE.

Connecting Oracle Data Visualization Desktop to OBIEE

The objective of pulling data in from existing analyses is described by Oracle as revisualization. Keep in mind that Data Visualization Desktop is meant to be a discovery tool and not so much a day-to-day report generator.

The original report was a pivot table with Revenue and Order information for geographical, product and time series dimensions. Let’s say that I just wanted to look at the revenue for all cities located in the Americas by a specific brand for the year 2012.

Dragging in the appropriate columns and adding filters took seconds and the data loaded almost instantaneously. I changed the view to horizontal bar and added a desc sort to Revenue and this was my result:

Connecting Oracle Data Visualization Desktop to OBIEE

Notice how the revenue for San Fransisco is much higher than any of the other states. Let’s say I want to get a closer look at all the other states without seeing the revenue data for San Fransisco. I could create a new filter for City and exclude San Fransisco from the list or I could just create a filter range for Revenue. Choosing the latter gave me the option of moving a slider to change my revenue value distribution and showed me the results in real time. Pretty cool, right?

Connecting Oracle Data Visualization Desktop to OBIEE

Connecting Oracle Data Visualization Desktop to OBIEE

Taking one report and loading it in can open up a wide range of data discovery opportunities but what if there are multiple reports I want to pull data from? You can do this and combine the data together in DVD as long as the two reports contain columns to join the two together.

Going back to my OBIEE connection, there are two reports I created on the demo server that both contain customer data.

Connecting Oracle Data Visualization Desktop to OBIEE

By pulling in both the Customer Information and Number of Customer Orders Per Year report, Data Visualization Desktop creates two separate data sources which show up under Data Elements.

Connecting Oracle Data Visualization Desktop to OBIEE

Inspecting one of the data sources shows the match between the two is made on both Customer Number and Customer Name columns.

Connecting Oracle Data Visualization Desktop to OBIEE

Note: It is possible to make your own column matches manually using the Add Another Match feature.

By using two data sets from two different reports, you can blend the data together to discover trends, show outliers and view the data together without touching the database or having to create new reports within OBIEE.

Connecting Oracle Data Visualization Desktop to OBIEE

The ability to connect directly to OBIEE with Data Visualization Desktop and pull in data from individual analyses is a very powerful feature that makes DVD’s that much greater. Combining data from multiple analyses blend them together internally creates some exciting data discovery possibilities for users with existing OBIEE implementations.

Categories: BI & Warehousing

Using R with Jupyter Notebooks and Oracle Big Data Discovery

Thu, 2016-07-14 05:00
Using R with Jupyter Notebooks and Oracle Big Data Discovery

Oracle's Big Data Discovery encompasses a good amount of exploration, transformation, and visualisation capabilities for datasets residing in your organisation’s data reservoir. Even with this though, there may come a time when your data scientists want to unleash their R magic on those same datasets. Perhaps the data domain expert has used BDD to enrich and cleanse the data, and now it's ready for some statistical analysis? Maybe you'd like to use R's excellent forecast package to predict the next six months of a KPI from the BDD dataset? And not only predict it, but write it back into the dataset for subsequent use in BDD? This is possible using BDD Shell and rpy2. It enables advanced analysis and manipulation of datasets already in BDD. These modified datasets can then be pushed back into Hive and then BDD.

BDD Shell provides a native Python environment, and you may opt to use the pandas library to work with BDD datasets as detailed here. In other cases you may simply prefer working with R, or have a particular library in mind that only R offers natively. In this article we’ll see how to do that. The "secret sauce" is rpy2 which enables the native use of R code within a python-kernel Jupyter Notebook.

As with previous articles I’m using a Jupyter Notebook as my environment. I’ll walk through the code here, and finish with a copy of the notebook so you can see the full process.

First we'll see how you can use R in Jupyter Notebooks running a python kernel, and then expand out to integrate with BDD too. You can view and download the first notebook here.

Import the RPY2 environment so that we can call R from Jupyter

import readline is necessary to workaround the error: /u01/anaconda2/lib/libreadline.so.6: undefined symbol: PC

import readline
%load_ext rpy2.ipython
Example usage Single inline command, prefixed with %R
%R X=c(1,4,5,7); sd(X); mean(X)
array([ 4.25])
R code block, marked by %%R
%%R
Y = c(2,4,3,9)
summary(lm(Y~X))
Call:  
lm(formula = Y ~ X)

Residuals:  
    1     2     3     4  
 0.88 -0.24 -2.28  1.64 

Coefficients:  
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.0800     2.3000   0.035    0.975  
X             1.0400     0.4822   2.157    0.164

Residual standard error: 2.088 on 2 degrees of freedom  
Multiple R-squared:  0.6993,    Adjusted R-squared:  0.549  
F-statistic: 4.651 on 1 and 2 DF,  p-value: 0.1638
Graphics plot, output to the notebook
%R plot(X, Y)
Using R with Jupyter Notebooks and Oracle Big Data Discovery Pass Python variable to R using -i
import numpy as np
Z = np.array([1,4,5,10])
%R -i Z mean(Z)
array([ 5.])
For more information see the documentation Working with BDD Datasets from R in Jupyter Notebooks Now that we've seen calling R in Jupyter Notebooks, let's see how to use it with BDD in order to access datasets. The first step is to instantiate the BDD Shell so that you can access the datasets in BDD, and then to set up the R environment using rpy2
execfile('ipython/00-bdd-shell-init.py')  
%load_ext rpy2.ipython

I also found that I had to make readline available otherwise I got an error (/u01/anaconda2/lib/libreadline.so.6: undefined symbol: PC)

import readline  

After this, we can import a BDD dataset, convert it to a Spark dataframe and then a pandas dataframe, ready for passing to R

ds = dss.dataset('edp_cli_edp_8d6fd230-8e99-449c-9480-0c2bddc4f6dc')  
spark_df = ds.to_spark()  
import pandas as pd  
pandas_df = spark_df.toPandas()  

Note that there is a lot of passing of the same dataframe into different memory structures here - from BDD dataset context to Spark to Pandas, and that’s before we’ve even hit R. It’s fine for ad-hoc wrangling but might start to be painful with very large datasets.

Now we use the rpy2 integration with Jupyter Notebooks and invoke R parsing of the cell’s contents, using the %%R syntax. Optionally, we can pass across variables with the -i parameter, which we’re doing here. Then we assign the dataframe to an R-notation variable (optional, but stylistically nice to do), and then use R's summary function to show a summary of each attribute:

%%R -i pandas_df  
R.df <- pandas_df  
summary(R.df)  
vendorid     tpep_pickup_datetime tpep_dropoff_datetime passenger_count  
 Min.   :1.000   Min.   :1.420e+12    Min.   :1.420e+12     Min.   :0.000  
 1st Qu.:1.000   1st Qu.:1.427e+12    1st Qu.:1.427e+12     1st Qu.:1.000  
 Median :2.000   Median :1.435e+12    Median :1.435e+12     Median :1.000  
 Mean   :1.525   Mean   :1.435e+12    Mean   :1.435e+12     Mean   :1.679  
 3rd Qu.:2.000   3rd Qu.:1.443e+12    3rd Qu.:1.443e+12     3rd Qu.:2.000  
 Max.   :2.000   Max.   :1.452e+12    Max.   :1.452e+12     Max.   :9.000  
 NA's   :12      NA's   :12           NA's   :12            NA's   :12     
 trip_distance      pickup_longitude  pickup_latitude    ratecodeid    
 Min.   :    0.00   Min.   :-121.93   Min.   :-58.43   Min.   : 1.000  
 1st Qu.:    1.00   1st Qu.: -73.99   1st Qu.: 40.74   1st Qu.: 1.000  
 Median :    1.71   Median : -73.98   Median : 40.75   Median : 1.000  
 Mean   :    3.04   Mean   : -72.80   Mean   : 40.10   Mean   : 1.041  
 3rd Qu.:    3.20   3rd Qu.: -73.97   3rd Qu.: 40.77   3rd Qu.: 1.000  
 Max.   :67468.40   Max.   : 133.82   Max.   : 62.77   Max.   :99.000  
 NA's   :12         NA's   :12        NA's   :12       NA's   :12      
 store_and_fwd_flag dropoff_longitude dropoff_latitude  payment_type 
 N   :992336        Min.   :-121.93   Min.   : 0.00    Min.   :1.00  
 None:    12        1st Qu.: -73.99   1st Qu.:40.73    1st Qu.:1.00  
 Y   :  8218        Median : -73.98   Median :40.75    Median :1.00  
                    Mean   : -72.85   Mean   :40.13    Mean   :1.38  
                    3rd Qu.: -73.96   3rd Qu.:40.77    3rd Qu.:2.00  
                    Max.   :   0.00   Max.   :44.56    Max.   :5.00  
                    NA's   :12        NA's   :12       NA's   :12    
  fare_amount          extra            mta_tax          tip_amount     
 Min.   :-170.00   Min.   :-1.0000   Min.   :-1.7000   Min.   :  0.000  
 1st Qu.:   6.50   1st Qu.: 0.0000   1st Qu.: 0.5000   1st Qu.:  0.000  
 Median :   9.50   Median : 0.0000   Median : 0.5000   Median :  1.160  
 Mean   :  12.89   Mean   : 0.3141   Mean   : 0.4977   Mean   :  1.699  
 3rd Qu.:  14.50   3rd Qu.: 0.5000   3rd Qu.: 0.5000   3rd Qu.:  2.300  
 Max.   : 750.00   Max.   :49.6000   Max.   :52.7500   Max.   :360.000  
 NA's   :12        NA's   :12        NA's   :12        NA's   :12       
  tolls_amount      improvement_surcharge  total_amount       PRIMARY_KEY     
 Min.   : -5.5400   Min.   :-0.3000       Min.   :-170.80   0-0-0   :      1  
 1st Qu.:  0.0000   1st Qu.: 0.3000       1st Qu.:   8.75   0-0-1   :      1  
 Median :  0.0000   Median : 0.3000       Median :  11.80   0-0-10  :      1  
 Mean   :  0.3072   Mean   : 0.2983       Mean   :  16.01   0-0-100 :      1  
 3rd Qu.:  0.0000   3rd Qu.: 0.3000       3rd Qu.:  17.80   0-0-1000:      1  
 Max.   :503.0500   Max.   : 0.3000       Max.   : 760.05   0-0-1001:      1  
 NA's   :12         NA's   :12            NA's   :12        (Other) :1000560  

We can use native R code and R libraries including the excellent dplyr to lightly wrangle and then chart the data:

%%R

library(dplyr)  
library(ggplot2)

R.df %>%  
    filter(fare_amount > 0) %>%  
    ggplot(aes(y=fare_amount, x=tip_amount,color=passenger_count)) +  
    geom_point(alpha=0.5 )
Using R with Jupyter Notebooks and Oracle Big Data Discovery

Finally, using the -o flag on the %%R invocation, we can pass back variables from the R context back to pandas :

%%R -o R_output  
R_output <-  
    R.df %>%  
    mutate(foo = 'bar')

and from there back to Spark and write the results to Hive:

spark_df2 = sqlContext.createDataFrame(R_output)  
spark_df2.write.mode('Overwrite').saveAsTable('default.updated_dataset')  

and finally ingest the new Hive table to BDD:

from subprocess import call  
call(["/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI","--table default.updated_dataset"])  

You can download the notebook here.

https://gist.github.com/6f7d3138efdbb322d8543d35912c99ab

Categories: BI & Warehousing

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

Wed, 2016-07-13 09:02
Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

Big Data Discovery (BDD) is a great tool for exploring, transforming, and visualising data stored in your organisation’s Data Reservoir. I presented a workshop on it at a recent conference, and got an interesting question from the audience that I thought I’d explore further here. Currently the primary route for getting data into BDD requires that it be (i) in HDFS and (ii) have a Hive table defined on top of it. From there, BDD automagically ingests the Hive table, or the dataprocessingCLI is manually called which prompts the BDD DGraph engine to go and sample (or read in full) the Hive dataset.

This is great, and works well where the dataset is vast (this is Big Data, after all) and needs the sampling that DGraph provides. It’s also simple enough for Hive tables that have already been defined, perhaps by another team. But - and this was the gist of the question that I got - what about where the Hive table doesn’t exist already? Because if it doesn’t, we now need to declare all the columns as well as choose the all-important SerDe in order to read the data.

SerDes are brilliant, in that they enable the application of a schema-on-read to data in many forms, but at the very early stages of a data project there are probably going to be lots of formats of data (such as TSV, CSV, JSON, as well as log files and so on) from varying sources. Choosing the relevant SerDe for each one, and making sure that BDD is also configured with the necessary jar, as well as manually listing each column to be defined in the table, adds overhead to the project. Wouldn’t it be nice if we could side-step this step somehow? In this article we’ll see how!

Importing Datasets through BDD Studio

Before we get into more fancy options, don’t forget that BDD itself offers the facility to upload CSV, TSV, and XLSX files, as well as connect to JDBC datasources. Data imported this way will be stored by BDD in a Hive table and ingested to DGraph.

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

This is great for smaller files held locally. But what about files on your BDD cluster, that are too large to upload from local machine, or in other formats - such as JSON?

Loading a CSV file

As we’ve just seen, CSV files can be imported to Hive/BDD directly through the GUI. But perhaps you’ve got a large CSV file sat local to BDD that you want to import? Or a folder full of varying CSV files that would be too time-consuming to upload through the GUI one-by-one?

For this we can use BDD Shell with the Python Pandas library, and I’m going to do so here through the excellent Jupyter Notebooks interface. You can read more about these here and details of how to configure them on BigDataLite 4.5 here. The great thing about notebooks, whether Jupyter or Zeppelin, is that I don’t need to write any more blog text here - I can simply embed the notebook inline and it is self-documenting:

https://gist.github.com/76b477f69303dd8a9d8ee460a341c445

(gist link)

Note that at end of this we call dataprocessingCLI to automatically bring the new table into BDD’s DGraph engine for use in BDD Studio. If you’ve got BDD configured to automagically add new Hive tables, or you don’t want to run this step, you can just comment it out.

Loading simple JSON data

Whilst CSV files are tabular by definition, JSON records can contain nested objects (recursively), as well as arrays. Let’s look at an example of using SparkSQL to import a simple flat JSON file, before then considering how we handle nested and array formats. Note that SparkSQL can read datasets from both local (file://) storage as well as HDFS (hdfs://):

https://gist.github.com/8b7118c230f34f7d57bd9b0aa4e0c34c

(gist link)

Once loaded into Hive, it can be viewed in Hue:

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery Loading nested JSON data

What’s been great so far, whether loading CSV, XLS, or simple JSON, is that we’ve not had to list out column names. All that needs modifying in the scripts above to import a different file with a different set of columns is to change the filename and the target tablename. Now we’re going to look at an example of a JSON file with nested objects - which is very common in JSON - and we’re going to have to roll our sleeves up a tad and start hardcoding some schema details.

First up, we import the JSON to a SparkSQL dataframe as before (although this time I’m loading it from HDFS, but local works too):

df = sqlContext.read.json('hdfs:///user/oracle/incoming/twitter/2016/07/12/')  

Then I declare this as a temporary table, which enables me to subsequently run queries with SQL against it

df.registerTempTable("twitter")  

A very simple example of a SQL query would be to look at the record count:

result_df = sqlContext.sql("select count(*) from twitter")  
result_df.show()

+----+  
| _c0|  
+----+  
|3011|  
+----+

The result of a sqlContext.sql invocation is a dataframe, which above I’m assigning to a new variable, but I could as easily run:

sqlContext.sql("select count(*) from twitter").show()  

for the same result.

The sqlContext has inferred the JSON schema automagically, and we can inspect it using

df.printSchema()  

The twitter schema is huge, so I’m just quoting a few choice sections of it here to illustrate subsequent points:

root  
|-- created_at: string (nullable = true)  
|-- entities: struct (nullable = true)  
|    |-- hashtags: array (nullable = true)  
|    |    |-- element: struct (containsNull = true)  
|    |    |    |-- indices: array (nullable = true)  
|    |    |    |    |-- element: long (containsNull = true)  
|    |    |    |-- text: string (nullable = true)  
|    |-- user_mentions: array (nullable = true)  
|    |    |-- element: struct (containsNull = true)  
|    |    |    |-- id: long (nullable = true)  
|    |    |    |-- id_str: string (nullable = true)  
|    |    |    |-- indices: array (nullable = true)  
|    |    |    |    |-- element: long (containsNull = true)  
|    |    |    |-- name: string (nullable = true)  
|    |    |    |-- screen_name: string (nullable = true)  
|-- source: string (nullable = true)  
|-- text: string (nullable = true)  
|-- timestamp_ms: string (nullable = true)  
|-- truncated: boolean (nullable = true)  
|-- user: struct (nullable = true)  
|    |-- followers_count: long (nullable = true)  
|    |-- following: string (nullable = true)  
|    |-- friends_count: long (nullable = true)  
|    |-- name: string (nullable = true)  
|    |-- screen_name: string (nullable = true)

Points to note about the schema:

  • In the root of the schema we have attributes such as text and created_at
  • There are nested elements (“struct”) such as user and within it screen_name, followers_count etc
  • There’s also array objects, where an attribute can occur more than one, such as hashtags, and user_mentions.

Accessing root and nested attributes is easy - we just use dot notation:

sqlContext.sql("SELECT created_at, user.screen_name, text FROM twitter").show()

+--------------------+--------------+--------------------+  
|          created_at|   screen_name|                text|  
+--------------------+--------------+--------------------+  
|Tue Jul 12 16:13:...|  Snehalstocks|"Students need to...|  
|Tue Jul 12 16:13:...|   KingMarkT93|Ga caya :( https:...|

We can save this as a dataframe that’s then persisted to Hive, for ingest into BDD:

subset02 = sqlContext.sql("SELECT created_at, user.screen_name, text FROM twitter")  
tablename = 'twitter_user_text'  
qualified_tablename='default.' + tablename  
subset02.write.mode('Overwrite').saveAsTable(qualified_tablename)  

Which in Hue looks like this:

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

Attributes in an array are a bit more tricky. Here’s an example tweet with multiple user_mentions and a hashtag too:

https://twitter.com/flederbine/status/752940179569115136

Here we use the LATERAL VIEW syntax, with the optional OUTER operator since not all tweets have these additional entities, and we want to make sure we show all tweets including those that don’t have these entities. Here’s the SQL formatted for reading:

SELECT id,  
created_at,  
user.screen_name,  
text as tweet_text,  
hashtag.text as hashtag,  
user_mentions.screen_name as mentioned_user  
from twitter  
LATERAL VIEW OUTER explode(entities.user_mentions) user_mentionsTable as user_mentions  
LATERAL VIEW OUTER explode(entities.hashtags) hashtagsTable AS hashtag  

Which when run as from sqlContext.sql() gives us:

+------------------+--------------------+---------------+--------------------+-------+---------------+  
|                id|          created_at|    screen_name|          tweet_text|hashtag|    screen_name|  
+------------------+--------------------+---------------+--------------------+-------+---------------+  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|      johnnyq72|  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|       orcldoug|  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|          rmoff|  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|    markrittman|  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|     mikedurran|  
+------------------+--------------------+---------------+--------------------+-------+---------------+

and written back to Hive for ingest to BDD:

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

You can use these SQL queries both for simply flattening JSON, as above, or for building summary tables, such as this one showing the most common hashtags in the dataset:

sqlContext.sql("SELECT hashtag.text,count(*) as inst_count from twitter  LATERAL VIEW OUTER explode(entities.hashtags) hashtagsTable AS hashtag GROUP BY hashtag.text order by inst_count desc").show(4)

+-----------+----------+  
|       text|inst_count|  
+-----------+----------+  
|     Hadoop|       165|  
|     Oracle|       151|  
|        job|       128|  
|    BigData|       112|

You can find the full Jupyter Notebook with all these nested/array JSON examples here:

https://gist.github.com/a38e853d3a7dcb48a9df99ce1e3505ff

(gist link)

You may decide after looking at this that you’d rather just go back to Hive and SerDes, and as is frequently the case in ‘data wrangling’ there’s multiple ways to achieve the same end. The route you take comes down to personal preference and familiarity with the toolsets. In this particular case I'd still go for SparkSQL for the initial exploration as it's quicker to 'poke around' the dataset than with defining and re-defining Hive tables -- YMMV. A final point to consider before we dig in is that SparkSQL importing JSON and saving back to HDFS/Hive is a static process, and if your underlying data is changing (e.g. streaming to HDFS from Flume) then you would probably want a Hive table over the HDFS file so that it is live when queried.

Loading an Excel workbook with many sheets

This was the use-case that led me to researching programmatic import of datasets in the first place. I was doing some work with a dataset of road traffic accident data, which included a single XLS file with over 30 sheets, each a lookup table for a separate set of dimension attributes. Importing each sheet one by one through the BDD GUI was tedious, and being a lazy geek, I looked to automate it.

Using Pandas read_excel function and a smidge of Python to loop through each sheet it was easily done. You can see the full notebook here:

https://gist.github.com/rmoff/3fa5d857df8ca5895356c22e420f3b22

(gist link)

Categories: BI & Warehousing

Rittman Mead at KScope16

Tue, 2016-06-21 10:01

June is the perfect month: summer begins, major football (and futbol) tournaments are in full swing, and of course, KScope16 is on! Rittman Mead have participated in many of the past KScope conferences and will continue that tradition this year with a wide range of presentation and training topics across the Business Intelligence, Data Warehousing, Big Data, and Advanced Analytics subject areas. This year the event is held in Chicago at the Sheraton Grand Chicago, a great location right on the river and a short distance from sights such as the Navy Pier and Magnificent Mile. And, as always, there will be some great networking opportunities throughout the week.

chicago-kscope16

This year, we’re pleased to congratulate our own Becky Wagner (of Becky’s BI Apps Corner) on her graduation from the ODTUG Leadership Program. This is the third class to complete the program, which aims to build leadership skills, confidence, and work/life balance skills, amongst many other things, throughout the 9 month class duration. Well done, Becky! You’ll be able to find her and the rest of the 2016 Leadership class at the Career Community/Leadership Program Information Breakfast – Monday, June 27th 7:30AM at the Sheraton hotel main dining area. They will be kicking off the application season for the 2017 Leadership Program and providing further information about the new ODTUG career program as well.

The rest of the Rittman Mead attendees will be presenting one or more sessions during the week.

KScope16 Rittman Mead Schedule

The week kicks off with my talk Monday on GoldenGate and Kafka integration using the new Oracle GoldenGate for Big Data release 12.2. The topic of this session, real-time data streaming, provides a great solution for the challenging task of ingesting data from multiple different sources, and different structures, and making that data readily available for others in the organization to use. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution.

GoldenGate to Kafka logo

Tuesday morning, Charles Elliott, Managing Director of Rittman Mead in the US, will begin the first of his two-part hands on lab sessions on Free Form Data Visualizations. This hands-on introduction to all things visual in OBIEE’s new Visual Analyzer (VA) will allow attendees to see the top reporting features within the current release while also understanding the core differences between OBIEE’s previous reporting platform and what’s available now. In addition to creating some of the most eye-catching visualizations the new tool has to offer, this session will highlight new OBIEE 12c features such as mapping, advanced analytics, and an overview of the brand-new web catalog. Whether you are just interested in what’s new in free-form Oracle BI visualizations or preparing for an upgrade, this session is for you!

During the lunch and learn sessions you can find me moderating the BI and Data Warehousing panel and Mark Rittman will be moderating the Big Data and Advanced Analytics crew. Grab a lunch and join us!

The remainder of Rittman Mead sessions fall on Wednesday, with some interesting topics around Big Data Discovery, OBIEE branding and Essbase integration, and the Kimball ETL Subsystems. Also, the second day of the Free Form Data Visualizations HOL will be held on Wednesday afternoon.

It all starts with Mark Rittman presenting “OBIEE 12c: What’s New for Integration and Reporting Against EPM Sources”. He’ll dig into interesting new features around OBIEE 12c and Essbase integration, including improvements around using Essbase as a query acceleration layer and general improvements around usability and RPD/Outline integration. Come to this session to see what they are, how they affect OBIEE/Essbase integration, and where this initiative may be going over time.

Following Mark and his dive into OBIEE and Essbase on Wednesday will be my session, “A Walk Through the Kimball ETL Subsystems with Oracle Data Integration”, with a focus on how to implement the ETL Subsystems using Oracle Data Integration solutions. As you may know, Ralph Kimball, dimensional modeling and data warehousing expert and founder of The Kimball Group, spent much of his career working to build an enterprise data warehouse methodology that can meet analytical reporting needs. His book, “The Data Warehouse ETL Toolkit,” is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories, describing how the Oracle Data Integration products (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality) are perfectly suited for the Kimball approach.

After lunch, Andy Rocha and Pete Tamisin will dive into “How to Brand and Own Your OBIEE Interface: Past, Present, and Future”. In this presentation, they will explore the different techniques in branding and customizing your OBIEE environment in different versions: 11.1.1.7, 11.1.1.9, and 12c. What that really means is they will talk about how to brand the interface in a thoughtful way… not just throwing the company’s colors on the screen, but thinking about what will actually make the application more attractive and useful for the user community.

bdd-rittman

Closing out the Rittman Mead Wednesday sessions is Mark Rittman with his unique look at machine learning and advanced visualizations using Oracle Big Data Discovery (BDD). Mark will use the BDD Shell, pySpark, and Jupyter to ingest, process and run machine learning routines on IoT and smart home data, which he has been collecting over the past year. Oracle Big Data Discovery provides a graphical environment for cataloging, searching, and displaying data from Hadoop and other sources along with some basic data wrangling features and display options. In this session, he’ll look at how BDD can be extended using the aforementioned APIs to cover more advanced predictive modeling, machine learning, and data science tasks, and how the graphics API can be used to add custom D3 advanced visualizations for your datasets and analyses. I’m definitely looking forward to this one!

We’re all looking forward to attending the event in Chicago this year and can’t wait for next week! If you’d like to get together to discuss any of the above topics we’re presenting, or anything data integration or BI related, feel free to reach out via email (info@rittmanmead.com), twitter (@mRainey) or drop a note in the comments below. Hope to see you there!

The post Rittman Mead at KScope16 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Rittman Mead at KScope16

Tue, 2016-06-21 09:01
Rittman Mead at KScope16

June is the perfect month: summer begins, major football (and futbol) tournaments are in full swing, and of course, KScope16 is on! Rittman Mead have participated in many of the past KScope conferences and will continue that tradition this year with a wide range of presentation and training topics across the Business Intelligence, Data Warehousing, Big Data, and Advanced Analytics subject areas. This year the event is held in Chicago at the Sheraton Grand Chicago, a great location right on the river and a short distance from sights such as the Navy Pier and Magnificent Mile. And, as always, there will be some great networking opportunities throughout the week.

Rittman Mead at KScope16

This year, we're pleased to congratulate our own Becky Wagner (of Becky's BI Apps Corner) on her graduation from the ODTUG Leadership Program. This is the third class to complete the program, which aims to build leadership skills, confidence, and work/life balance skills, amongst many other things, throughout the 9 month class duration. Well done, Becky! You'll be able to find her and the rest of the 2016 Leadership class at the Career Community/Leadership Program Information Breakfast - Monday, June 27th 7:30AM at the Sheraton hotel main dining area. They will be kicking off the application season for the 2017 Leadership Program and providing further information about the new ODTUG career program as well.

The rest of the Rittman Mead attendees will be presenting one or more sessions during the week.

Rittman Mead at KScope16

The week kicks off with my talk Monday on GoldenGate and Kafka integration using the new Oracle GoldenGate for Big Data release 12.2. The topic of this session, real-time data streaming, provides a great solution for the challenging task of ingesting data from multiple different sources, and different structures, and making that data readily available for others in the organization to use. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution.

Rittman Mead at KScope16

Tuesday morning, Charles Elliott, Managing Director of Rittman Mead in the US, will begin the first of his two-part hands on lab sessions on Free Form Data Visualizations. This hands-on introduction to all things visual in OBIEE’s new Visual Analyzer (VA) will allow attendees to see the top reporting features within the current release while also understanding the core differences between OBIEE’s previous reporting platform and what’s available now. In addition to creating some of the most eye-catching visualizations the new tool has to offer, this session will highlight new OBIEE 12c features such as mapping, advanced analytics, and an overview of the brand-new web catalog. Whether you are just interested in what’s new in free-form Oracle BI visualizations or preparing for an upgrade, this session is for you!

During the lunch and learn sessions you can find me moderating the BI and Data Warehousing panel and Mark Rittman will be moderating the Big Data and Advanced Analytics crew. Grab a lunch and join us!

The remainder of Rittman Mead sessions fall on Wednesday, with some interesting topics around Big Data Discovery, OBIEE branding and Essbase integration, and the Kimball ETL Subsystems. Also, the second day of the Free Form Data Visualizations HOL will be held on Wednesday afternoon.

Rittman Mead at KScope16

It all starts with Mark Rittman presenting "OBIEE 12c: What's New for Integration and Reporting Against EPM Sources". He'll dig into interesting new features around OBIEE 12c and Essbase integration, including improvements around using Essbase as a query acceleration layer and general improvements around usability and RPD/Outline integration. Come to this session to see what they are, how they affect OBIEE/Essbase integration, and where this initiative may be going over time.

Following Mark and his dive into OBIEE and Essbase on Wednesday will be my session, “A Walk Through the Kimball ETL Subsystems with Oracle Data Integration", with a focus on how to implement the ETL Subsystems using Oracle Data Integration solutions. As you may know, Ralph Kimball, dimensional modeling and data warehousing expert and founder of The Kimball Group, spent much of his career working to build an enterprise data warehouse methodology that can meet analytical reporting needs. His book, “The Data Warehouse ETL Toolkit,” is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories, describing how the Oracle Data Integration products (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality) are perfectly suited for the Kimball approach.

After lunch, Andy Rocha and Pete Tamisin will dive into "How to Brand and Own Your OBIEE Interface: Past, Present, and Future". In this presentation, they will explore the different techniques in branding and customizing your OBIEE environment in different versions: 11.1.1.7, 11.1.1.9, and 12c. What that really means is they will talk about how to brand the interface in a thoughtful way… not just throwing the company’s colors on the screen, but thinking about what will actually make the application more attractive and useful for the user community.

Rittman Mead at KScope16

Closing out the Rittman Mead Wednesday sessions is Mark Rittman with his unique look at machine learning and advanced visualizations using Oracle Big Data Discovery (BDD). Mark will use the BDD Shell, pySpark, and Jupyter to ingest, process and run machine learning routines on IoT and smart home data, which he has been collecting over the past year. Oracle Big Data Discovery provides a graphical environment for cataloging, searching, and displaying data from Hadoop and other sources along with some basic data wrangling features and display options. In this session, he'll look at how BDD can be extended using the aforementioned APIs to cover more advanced predictive modeling, machine learning, and data science tasks, and how the graphics API can be used to add custom D3 advanced visualizations for your datasets and analyses. I'm definitely looking forward to this one!

We're all looking forward to attending the event in Chicago this year and can't wait for next week! If you'd like to get together to discuss any of the above topics we're presenting, or anything data integration or BI related, feel free to reach out via email (info@rittmanmead.com), twitter (@mRainey) or drop a note in the comments below. Hope to see you there!

Categories: BI & Warehousing

Using Jupyter Notebooks with Big Data Discovery 1.2

Wed, 2016-06-15 10:00

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

Whilst BDD Shell is command-line based, there’s also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive “Notebook”. This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a ‘Notebook’ enables you to modify and re-run commands, and then once correct retain them as a fully functioning script for future use.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you’d find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2.

For information how on to set up BDD Shell and Jupyter Notebooks, see this previous post. For the purpose of this article I’m running Jupyter on port 18888 so as not to clash with Hue:

cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It’s important that you run this from the bdd-shell folder, otherwise the BDD shell won’t initialise properly
  • Jupyter by default only listens locally, so you need to use a web browser local to the server, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and from the New menu select a Python 2 notebook:

You should then see an empty notebook, ready for use:

The ‘cell’ (grey box after the In [ ]:) is where you enter code to run – type in execfile('ipython/00-bdd-shell-init.py') and press shift-Enter. This will execute it – if you don’t press shift you just get a newline. Whilst it’s executing you’ll notice the line prefix changes from [ ] to [*], and in the terminal window from which you launched Jupyter you’ll see some output related to the BDD Shell starting

WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).
WARNING: Running spark-class from user-defined location.
spark.driver.cores is set but does not apply in client mode.

Now back in the Notebook, enter the following – use Enter, not Shift-enter, between lines:

dss = bc.datasets()
dss.count

Now press shift-enter to execute it. This uses the pre-defined bc BDD context to get the datasets object, and return a count from it.

BDD_Shell_example_04

By clicking the + button on the toolbar, using the up and down arrows on the toolbar, and the Code/Markdown dropdown, it’s possible to insert “cells” which are not code but instead commentary on what the code is. This way you can produce fully documented, but executable, code objects.

From the File menu give the notebook a name, and then Close and Halt, which destroys the Jupyter process (‘kernel’) that was executing the BDD Shell session. Back at the Jupyter main page, you’ll note that a ipynb file has been created, which holds the notebook definition and can be downloaded, sent to colleagues, uploaded to blogs to share, saved in source control, and so on. Here’s the file for the notebook above – note that it’s hosted on gist, which automagically previews it as a Notebook, so click on Raw to see the actual code behind it.

The fantastically powerful thing about the Notebooks is that you can modify and re-run steps as you go — but you never lose the history of how you got somewhere. Most people will be familar with learning or exploring a tool and its capabilities and eventually getting it to work – but no idea how they got there. Even for experienced users of a tool, being able to prove how to replicate a final result is important for (a) showing the evidence for how they got there and (b) enabling others to take that work and build on it.

With an existing notebook file, whether a saved one you created or one that someone sent you, you can reopen it in Jupyter and re-execute it, in order to replicate the results previously seen. This is an important tenet of [data] science in general – show your workings, and it’s great that Big Data Discovery supports this option. Obviously, showing the count of datasets is not so interesting or important to replicate. The real point here is being able to take datasets that you’ve got in BDD, done some joining and wrangling on already taking advantage of the GUI, and then dive deep into the data science and analytics world of things like Spark MLLib, Pandas, and so on. As a simple example, I can use a couple of python libraries (installed by default with Anaconda) to plot a correlation matrix for one of my BDD datasets:

jupyter_corr_matrix

As well as producing visualisations or calculations within BDD shell, the real power comes in being able to push the modified data back into Hive, and thus continue to work with it within BDD.

With Jupyter Notebooks not only can you share the raw notebooks for someone else to execute, you can export the results to HTML, PDF, and so on. Here’s the notebook I started above, developed out further and exported to HTML – note how you can see not only the results, but exactly the code that I ran in order to get them. In this I took the dataset from BDD, added a column into it using a pandas windowing function, and then saved it back to a new Hive table:
(you can view the page natively here, and the ipynb here)

.gist table { margin-bottom: 0; }

Once the data’s been written back to Hive from the Python processing, I ran BDD’s data_processing_CLI to add the new table back into BDD

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table updated_accident_data

And once that’s run, I can then continue working with the data in BDD:

This workflow enables a continual loop of data wrangling, enrichment, advanced processing, and visualisation – all using the most appropriate tools for the job.

You can also use BDD Shell/Jupyter as another route for loading data into BDD. Whilst you can import CSV and XLS files into BDD directly through the web GUI, there are limitations – such as an XLS workbook with multiple sheets has to be imported one sheet at a time. I had a XLS file with over 40 sheets of reference data in it, which was not going to be time-efficient to load one at a time into BDD.

Pandas supports a lot of different input types – including Excel files. So by using Pandas to pull the data in, then convert it to a Spark dataframe I can write it to Hive, from where it can be imported to BDD. As before, the beauty of the Notebook approach is that I could develop and refine the code, and then simply share the Notebook here

.gist table { margin-bottom: 0; }

The post Using Jupyter Notebooks with Big Data Discovery 1.2 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Using Jupyter Notebooks with Big Data Discovery 1.2

Wed, 2016-06-15 09:00
Using Jupyter Notebooks with Big Data Discovery 1.2

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

Whilst BDD Shell is command-line based, there's also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive "Notebook". This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a 'Notebook' enables you to modify and re-run commands, and then once correct retain them as a fully functioning script for future use.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you'd find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2.

For information how on to set up BDD Shell and Jupyter Notebooks, see this previous post. For the purpose of this article I'm running Jupyter on port 18888 so as not to clash with Hue:

  
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell  
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It's important that you run this from the bdd-shell folder, otherwise the BDD shell won't initialise properly
  • Jupyter by default only listens locally, so you need to use a web browser local to the server, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and from the New menu select a Python 2 notebook:

Using Jupyter Notebooks with Big Data Discovery 1.2

You should then see an empty notebook, ready for use:

Using Jupyter Notebooks with Big Data Discovery 1.2

The 'cell' (grey box after the In [ ]:) is where you enter code to run - type in execfile('ipython/00-bdd-shell-init.py') and press shift-Enter. This will execute it - if you don't press shift you just get a newline. Whilst it's executing you'll notice the line prefix changes from [ ] to [*], and in the terminal window from which you launched Jupyter you'll see some output related to the BDD Shell starting

  
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).  
WARNING: Running spark-class from user-defined location.  
spark.driver.cores is set but does not apply in client mode.  

Now back in the Notebook, enter the following - use Enter, not Shift-enter, between lines:

  
dss = bc.datasets()  
dss.count  

Using Jupyter Notebooks with Big Data Discovery 1.2

Now press shift-enter to execute it. This uses the pre-defined bc BDD context to get the datasets object, and return a count from it.

Using Jupyter Notebooks with Big Data Discovery 1.2

By clicking the + button on the toolbar, using the up and down arrows on the toolbar, and the Code/Markdown dropdown, it's possible to insert "cells" which are not code but instead commentary on what the code is. This way you can produce fully documented, but executable, code objects.

Using Jupyter Notebooks with Big Data Discovery 1.2

From the File menu give the notebook a name, and then Close and Halt, which destroys the Jupyter process ('kernel') that was executing the BDD Shell session. Back at the Jupyter main page, you'll note that a ipynb file has been created, which holds the notebook definition and can be downloaded, sent to colleagues, uploaded to blogs to share, saved in source control, and so on. Here's the file for the notebook above - note that it's hosted on gist, which automagically previews it as a Notebook, so click on Raw to see the actual code behind it.

The fantastically powerful thing about the Notebooks is that you can modify and re-run steps as you go -- but you never lose the history of how you got somewhere. Most people will be familar with learning or exploring a tool and its capabilities and eventually getting it to work - but no idea how they got there. Even for experienced users of a tool, being able to prove how to replicate a final result is important for (a) showing the evidence for how they got there and (b) enabling others to take that work and build on it.

With an existing notebook file, whether a saved one you created or one that someone sent you, you can reopen it in Jupyter and re-execute it, in order to replicate the results previously seen. This is an important tenet of [data] science in general - show your workings, and it's great that Big Data Discovery supports this option. Obviously, showing the count of datasets is not so interesting or important to replicate. The real point here is being able to take datasets that you've got in BDD, done some joining and wrangling on already taking advantage of the GUI, and then dive deep into the data science and analytics world of things like Spark MLLib, Pandas, and so on. As a simple example, I can use a couple of python libraries (installed by default with Anaconda) to plot a correlation matrix for one of my BDD datasets:

Using Jupyter Notebooks with Big Data Discovery 1.2

As well as producing visualisations or calculations within BDD shell, the real power comes in being able to push the modified data back into Hive, and thus continue to work with it within BDD.

With Jupyter Notebooks not only can you share the raw notebooks for someone else to execute, you can export the results to HTML, PDF, and so on. Here's the notebook I started above, developed out further and exported to HTML - note how you can see not only the results, but exactly the code that I ran in order to get them. In this I took the dataset from BDD, added a column into it using a pandas windowing function, and then saved it back to a new Hive table:
(you can view the page natively here, and the ipynb here)

https://gist.github.com/rmoff/f1024dd043565cb8a58a0c54e9a782f2

Once the data's been written back to Hive from the Python processing, I ran BDD's dataprocessingCLI to add the new table back into BDD

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table updated_accident_data

And once that's run, I can then continue working with the data in BDD:

Using Jupyter Notebooks with Big Data Discovery 1.2

Using Jupyter Notebooks with Big Data Discovery 1.2

This workflow enables a continual loop of data wrangling, enrichment, advanced processing, and visualisation - all using the most appropriate tools for the job.

You can also use BDD Shell/Jupyter as another route for loading data into BDD. Whilst you can import CSV and XLS files into BDD directly through the web GUI, there are limitations - such as an XLS workbook with multiple sheets has to be imported one sheet at a time. I had a XLS file with over 40 sheets of reference data in it, which was not going to be time-efficient to load one at a time into BDD.

Pandas supports a lot of different input types - including Excel files. So by using Pandas to pull the data in, then convert it to a Spark dataframe I can write it to Hive, from where it can be imported to BDD. As before, the beauty of the Notebook approach is that I could develop and refine the code, and then simply share the Notebook here

https://gist.github.com/rmoff/3fa5d857df8ca5895356c22e420f3b22

Categories: BI & Warehousing

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

Mon, 2016-06-13 09:19

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets, particularly when used in conjunction with Jupyter Notebooks. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you’d find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2. In this article we’ll see how to configure BDD Shell on Big Data Lite 4.5 (along with Jupyter Notebooks), and in a subsequent post dive into how to actually use them.

Setting up BDD Shell on Big Data Lite

You can find the BDD Shell installation document here.

Login to BigDataLite 4.5 (oracle/welcome1) and open a Terminal window. The first step is to download Anaconda, which is a distribution of Python that also includes “[…] over 100 of the most popular Python, R and Scala packages for data science” as well as Jupyter notebook, which we’ll see in a moment.

cd ~/Downloads/
wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh

Then install it: (n.b. bash is part of the command to enter)

bash Anaconda2-4.0.0-Linux-x86_64.sh

Accept the licence when prompted, and then select a install location – I used /u01/anaconda2 where the rest of the BigDataLite installs are

Anaconda2 will now be installed into this location:
/home/oracle/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/oracle/anaconda2] >>> /u01/anaconda2

After a few minutes of installation, you’ll be prompted to whether you want to prepend Anaconda’s location to the PATH environment variable. I opted not to (which is the default) since Python is used elsewhere on the system and by prepending it it’ll take priority and possibly break things.

Do you wish the installer to prepend the Anaconda2 install location
to PATH in your /home/oracle/.bashrc ? [yes|no]
[no] >>> no

Now edit the BDD Shell configuration file (/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.conf) in your favourite text editor to add/amend the following lines:

SPARK_EXECUTOR_PYTHON=/u01/anaconda2/bin/python
LOCAL_PYTHON_HOME=/u01/anaconda2

Amend the path if you didn’t install Anaconda into /u01

In the same configuration file, add/amend:

SPARK_HOME=/usr/lib/spark/
SPARK_EXTRA_CLASSPATH=/usr/lib/oozie/oozie-sharelib-yarn/lib/spark/spark-avro_2.10-1.1.0-cdh5.7.0.jar

Now run the BDD Shell setup:

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/setup.sh

This should succeed:

[bigdatalite.localdomain] Validating pre-requisites...
[bigdatalite.localdomain] Validation Success
[bigdatalite.localdomain] Setting up BDD Shell...
[bigdatalite.localdomain] Setup Success
[oracle@bigdatalite Downloads]$

Assuming it does, you can now launch the shell bdd-shell.sh:

[oracle@bigdatalite Downloads]$ /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.sh
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).
WARNING: Running spark-class from user-defined location.
spark.driver.cores is set but does not apply in client mode.
Welcome to
     ___   ___   ___       __   _     ____  _     _
    | |_) | | \ | | \     ( (` | |_| | |_  | |   | |
    |_|_) |_|_/ |_|_/     _)_) |_| | |_|__ |_|__ |_|__

SparkContext available as sc, HiveContext available as sqlContext.
BDD Context available as bc.

>>>

From the BDD Shell you can interact with BDD, for example to list out the datasets currently defined in the Catalog:

>>> bc.datasets().count
17
>>> for ds in bc.datasets():
...     print ds
...

media_demo_customer     edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        Hive    default.media_demo_customer

movie_genre     default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        Hive    default.movie_genre

media_demo_customer     default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        Hive    default.media_demo_customer

Whilst BDD Shell is command-line based, there’s also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive “Notebook”. This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a ‘Notebook’ enables you to modify and re-run commands, and then once correct, retain them as a fully functioning script for future use.

To launch it, run:

cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It’s important that you run this from the bdd-shell folder, otherwise the BDD shell won’t initialise properly
  • By default Jupyter uses 8888, which is already in use on BigDataLite by Hue, so use a different one by specifying --port
  • Jupyter by default only listens locally, so you need to either be using BigDataLite desktop to run Firefox, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and you should see the default Jupyter screen with a list of files:

In the next article, we’ll see how to use Jupyter Notebooks with Big Data Discovery, and get an idea of just how powerful the combination can be.

The post Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

Mon, 2016-06-13 08:19
Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets, particularly when used in conjunction with Jupyter Notebooks. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you'd find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2. In this article we'll see how to configure BDD Shell on Big Data Lite 4.5 (along with Jupyter Notebooks), and in a subsequent post dive into how to actually use them.

Setting up BDD Shell on Big Data Lite

You can find the BDD Shell installation document here.

Login to BigDataLite 4.5 (oracle/welcome1) and open a Terminal window. The first step is to download Anaconda, which is a distribution of Python that also includes "[...] over 100 of the most popular Python, R and Scala packages for data science" as well as Jupyter notebook, which we'll see in a moment.

cd ~/Downloads/  
wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
Then install it: (n.b. bash is part of the command to enter)
  
bash Anaconda2-4.0.0-Linux-x86_64.sh  

Accept the licence when prompted, and then select a install location - I used /u01/anaconda2 where the rest of the BigDataLite installs are

Anaconda2 will now be installed into this location:  
/home/oracle/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/oracle/anaconda2] >>> /u01/anaconda2

After a few minutes of installation, you'll be prompted to whether you want to prepend Anaconda's location to the PATH environment variable. I opted not to (which is the default) since Python is used elsewhere on the system and by prepending it it'll take priority and possibly break things.

Do you wish the installer to prepend the Anaconda2 install location  
to PATH in your /home/oracle/.bashrc ? [yes|no]  
[no] >>> no

Now edit the BDD Shell configuration file (/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.conf) in your favourite text editor to add/amend the following lines:

  
SPARK\_EXECUTOR\_PYTHON=/u01/anaconda2/bin/python  
LOCAL\_PYTHON\_HOME=/u01/anaconda2  

Amend the path if you didn't install Anaconda into /u01

In the same configuration file, add/amend:

  
SPARK_HOME=/usr/lib/spark/  
SPARK\_EXTRA\_CLASSPATH=/usr/lib/oozie/oozie-sharelib-yarn/lib/spark/spark-avro_2.10-1.1.0-cdh5.7.0.jar  

Now run the BDD Shell setup:

  
/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/setup.sh

This should succeed:

  
[bigdatalite.localdomain] Validating pre-requisites...
[bigdatalite.localdomain] Validation Success
[bigdatalite.localdomain] Setting up BDD Shell...
[bigdatalite.localdomain] Setup Success
[oracle@bigdatalite Downloads]$

Assuming it does, you can now launch the shell bdd-shell.sh:

  
[oracle@bigdatalite Downloads]$ /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.sh
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).  
WARNING: Running spark-class from user-defined location.  
spark.driver.cores is set but does not apply in client mode.  
Welcome to  
     ___   ___   ___       __   _     ____  _     _
    | |_) | | \ | | \     ( (` | |_| | |_  | |   | |
    |_|_) |_|_/ |_|_/     _)_) |_| | |_|__ |_|__ |_|__

SparkContext available as sc, HiveContext available as sqlContext.  
BDD Context available as bc.

>>>

From the BDD Shell you can interact with BDD, for example to list out the datasets currently defined in the Catalog:

>>> bc.datasets().count  
17  
>>> for ds in bc.datasets():
...     print ds
...

media\_demo\_customer     edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        Hive    default.media_demo_customer

movie_genre     default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        Hive    default.movie_genre

media\_demo\_customer     default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        Hive    default.media_demo_customer  

Whilst BDD Shell is command-line based, there's also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive "Notebook". This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a 'Notebook' enables you to modify and re-run commands, and then once correct, retain them as a fully functioning script for future use.

To launch it, run:

  
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell  
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It's important that you run this from the bdd-shell folder, otherwise the BDD shell won't initialise properly
  • By default Jupyter uses 8888, which is already in use on BigDataLite by Hue, so use a different one by specifying --port
  • Jupyter by default only listens locally, so you need to either be using BigDataLite desktop to run Firefox, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and you should see the default Jupyter screen with a list of files:

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

In the next article, we'll see how to use Jupyter Notebooks with Big Data Discovery, and get an idea of just how powerful the combination can be.

Categories: BI & Warehousing

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Wed, 2016-06-01 18:17

One of the big changes in OBIEE 12c for end users is the ability to upload their own data sets and start analysing them directly, without needing to go through the traditional data provisioning and modelling process and associated leadtimes. The implementation of this is one of the big architectural changes of OBIEE 12c, introducing the concept of the Extended Subject Areas (XSA), and the Data Set Service (DSS).

In this article we’ll see some of how XSA and DSS work behind the scenes, providing an important insight for troubleshooting and performance analysis of this functionality.

What is an XSA?

An Extended Subject Area (XSA) is made up of a dataset, and associated XML data model. It can be used standalone, or “mashed up” in conjunction with a “traditional” subject area on a common field

How is an XSA Created?

At the moment the following methods are available:

  1. “Add XSA” in Visual Analzyer, to upload an Excel (XLSX) document

  2. CREATE DATASET logical SQL statement, that can be run through any interface to the BI Server, including ‘Issue Raw SQL’, nqcmd, JDBC calls, and so on

  3. Add Data Source in Answers. Whilst this option shouldn’t actually be present according to a this doc, it will be for any users of 12.2.1 who have uploaded the SampleAppLite BAR file so I’m including it here for completeness.

Under the covers, these all use the same REST API calls directly into datasetsvc. Note that these are entirely undocumented, and only for internal OBIEE component use. They are not intended nor supported for direct use.

How does an XSA work?

External Subject Areas (XSA) are managed by the Data Set Service (DSS). This is a java deployment (datasetsvc) running in the Managed Server (bi_server1), providing a RESTful API for the other OBIEE components that use it.

The end-user of the data, whether it’s Visual Analyzer or the BI Server, send REST web service calls to DSS, storing and querying datasets within it.

Where is the XSA Stored?

By default, the data for XSA is stored on disk in SINGLETON_DATA_DIRECTORY/components/DSS/storage/ssi, e.g. /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi

[oracle@demo ssi]$ ls -lrt /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi|tail -n5
-rw-r----- 1 oracle oinstall    8495 2015-12-02 18:01 7e43a80f-dcf6-4b31-b898-68616a68e7c4.dss
-rw-r----- 1 oracle oinstall  593662 2016-05-27 11:00 1beb5e40-a794-4aa9-8c1d-5a1c59888cb4.dss
-rw-r----- 1 oracle oinstall  131262 2016-05-27 11:12 53f59d34-2037-40f0-af21-45ac611f01d3.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:04 a4fc922d-ce0e-479f-97e4-1ddba074f5ac.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:06 c93aa2bd-857c-4651-bba2-a4f239115189.dss

They’re stored using the format in which they were created, which is XLSX (via VA) or CSV (via CREATE DATASET)

[oracle@demo ssi]$ head 53f59d34-2037-40f0-af21-45ac611f01d3.dss
"7 Megapixel Digital Camera","2010 Week 27",44761.88
"MicroPod 60Gb","2010 Week 27",36460.0
"MP3 Speakers System","2010 Week 27",36988.86
"MPEG4 Camcorder","2010 Week 28",32409.78
"CompCell RX3","2010 Week 28",33005.91

There’s a set of DSS-related tables installed in the RCU schema BIPLATFORM, which hold information including the XML data model for the XSA, along with metadata such as the user that uploaded the file, when they uploaded, and then name of the file on disk:

How Can the Data Set Service be Configured?

The configuration file, with plenty of inline comments, is at ORACLE_HOME/bi/endpointmanager/jeemap/dss/DSS_REST_SERVICE.properties. From here you can update settings for the data set service including upload limits as detailed here.

XSA Performance

Since XSA are based on flat files stored in disk, we need to be very careful in their use. Whilst a database may hold billions of rows in a table with with appropriate indexing and partitioning be able to provide sub-second responses, a flat file can quickly become a serious performance bottleneck. Bear in mind that a flat file is just a bunch of data plopped on disk – there is no concept of indices, blocks, partitions — all the good stuff that makes databases able to do responsive ad-hoc querying on selections of data.

If you’ve got a 100MB Excel file with thousands of cells, and want to report on just a few of them, you might find it laggy – because whether you want to report on them on or not, at some point OBIEE is going to have to read all of them regardless. We can see how OBIEE is handling XSA under the covers by examining the query log. This used to be called nqquery.log in OBIEE 11g (and before), and in OBIEE 12c has been renamed obis1-query.log.

In this example here I’m using an Excel worksheet with 140,000 rows and 78 columns. Total filesize of the source XLSX on disk is ~55Mb.

First up, I’ll build a query in Answers with a couple of the columns:

The logical query uses the new XSA syntax:

SELECT
   0 s_0,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."first_name" s_1,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."foo" s_2
FROM XSA('prodney'.'MOCK_DATA_bigger_55Mb')
ORDER BY 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

The query log shows

Rows 144000, bytes 13824000 retrieved from database query
Rows returned to Client 200

So of the 55MB of data, we’re pulling all the rows (144,000) back to the BI Server for it to then perform the aggregation on it, resulting in the 200 rows returned to the client (Presentation Services). Note though that the byte count is lower (13Mb) than the total size of the file (55Mb).

As well as aggregation, filtering on XSA data also gets done by the BI Server. Consider this example here, where we add a predicate:

xsa14

In the query log we can see that all the data has to come back from DSS to the BI Server, in order for it to filter it:

Rows 144000, bytes 23040000 retrieved from database
Physical query response time 24.195 (seconds),
Rows returned to Client 0

Note the time taken by DSS — nearly 25 seconds. Compare this later on to when we see the XSA data served from a database, via the XSA Cache.

In terms of BI Server (not XSA) caching, the query log shows that a cache entry was written for the above request:

Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736113_56359_0.TBL'

If I refresh the query in Answers, the data is fetched anew (per this changed behaviour in OBIEE 12c), and the cache repopulated. If I clear the Presentation Services cache and re-open the analysis, I get the results from the BI Server cache, and it doesn’t have to refetch the data from the Data Set Service.

Since the cache has two columns in, an attribute and a measure, I wondered if running a query with just the fact rolled up might hit the cache (since it has all the data there that it needs)

Unfortunately it didn’t, and to return a single row of data required BI Server to fetch all the rows again – although looking at the byte count it appears it does prune the columns required since it’s now just over 2Mb of data returned this time:

Rows 144000, bytes 2304000 retrieved from database
Rows returned to Client 1

Interestingly if I build an analysis with several more of the columns from the file (in this example, ten of a total of 78), the data returned from the DSS to BI Server (167Mb) is greater than that of the original file (55Mb).

Rows 144000, bytes 175104000
Rows returned to Client 1000

And this data coming back from the DSS to the BI Server has to go somewhere – and if it’s big enough it’ll overflow to disk, as we can see when I run the above:

$ ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp
[...]
-rwxrwx--- 1 oracle oinstall 2910404 2016-06-01 14:08 nQS_AG_22345_7503_7c9c000a_50906091.TMP
-rwxrwx--- 1 oracle oinstall   43476 2016-06-01 14:08 nQS_AG_22345_7504_7c9c000a_50906091.TMP
-rw------- 1 oracle oinstall 6912000 2016-06-01 14:08 nQS_AG_22345_7508_7c9c000a_50921949.TMP
-rw------- 1 oracle oinstall  631375 2016-06-01 14:08 nQS_EX_22345_7506_7c9c000a_50921652.TMP
-rw------- 1 oracle oinstall 3670016 2016-06-01 14:08 nQS_EX_22345_7507_7c9c000a_50921673.TMP
[...]

You can read more about BI Server’s use of temporary files and the impact that it can have on system performance and particularly I/O bandwidth in this OTN article here.

So – as the expression goes – “buyer beware”. XSA is an excellent feature, but used in its default configuration with files stored on disk it has the potential to wreak havoc if abused.

XSA Caching

If you’re planning to use XSA seriously, you should set up the database-based XSA Cache. This is described in detail in the PDF document attached to My Oracle Support note OBIEE 12c: How To Configure The External Subject Area (XSA) Cache For Data Blending| Mashup And Performance (Doc ID 2087801.1).

In a proper implementation you would follow in full the document, including provisioning a dedicated schema and tablespace for holding the data (to make it easier to manage and segregate from other data), but here I’m just going to use the existing RCU schema (BIPLATFORM), along with the Physical mapping already in the RPD (10 - System DB (ORCL)):

In NQSConfig.INI, under the XSA_CACHE section, I set:

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "10 - System DB (ORCL)"."Catalog"."dbo";
CONNECTION_POOL = "10 - System DB (ORCL)"."UT Connection Pool";

And restart the BI Server:

/app/oracle/biee/user_projects/domains/bi/bitools/bin/stop.sh -i obis1 && /app/oracle/biee/user_projects/domains/bi/bitools/bin/start.sh -i obis1

Per the document, note that in the BI Server log there’s an entry indicating that the cache has been successfully started:

[101001] External Subject Area cache is started successfully using configuration from the repository with the logical name ssi.
[101017] External Subject Area cache has been initialized. Total number of entries: 0 Used space: 0 bytes Maximum space: 107374182400 bytes Remaining space: 107374182400 bytes. Cache table name prefix is XC2875559987.

Now when I re-run the test XSA analysis from above, returning three columns, the BI Server goes off and populates the XSA cache table:

-- Sending query to database named 10 - System DB (ORCL) (id: <<79879>> XSACache Create table Gateway), connection pool named UT Connection Pool, logical request hash b4de812e, physical request hash 5847f2ef:
CREATE TABLE dbo.XC2875559987_ZPRODNE1926129021 ( id3209243024 DOUBLE PRECISION, first_n[..]

Or rather, it doesn’t, because PHYSICAL_SCHEMA seems to want the literal physical schema, rather than the logical physical one (?!) that the USAGE_TRACKING configuration stanza is happy with in referencing the table.

Properties: description=<<79879>> XSACache Create table Exchange; producerID=0x1561aff8; requestID=0xfffe0034; sessionID=0xfffe0000; userName=prodney;
[nQSError: 17001] Oracle Error code: 1918, message: ORA-01918: user 'DBO' does not exist

I’m trying to piggyback on SA511’s existing configruation, which uses catalog.schema notation:

Instead of the more conventional approach to have the actual physical schema (often used in conjunction with ‘Require fully qualified table names’ in the connection pool):

So now I’ll do it properly, and create a database and schema for the XSA cache – I’m still going to use the BIPLATFORM schema though…

Updated NQSConfig.INI:

[ XSA_CACHE ]

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "XSA Cache"."BIEE_BIPLATFORM";
CONNECTION_POOL = "XSA Cache"."XSA CP";

After refreshing the analysis again, there’s a successful creation of the XSA cache table:

-- Sending query to database named XSA Cache (id: <<65685>> XSACache Create table Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash ccc0a410: [[
CREATE TABLE BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 ( id3209243024 DOUBLE PRECISION, first_name2360035083 VARCHAR2(17 CHAR), [...]

as well as a stats gather:

-- Sending query to database named XSA Cache (id: <<65685>> XSACache Collect statistics Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d73151bb:
BEGIN DBMS_STATS.GATHER_TABLE_STATS(ownname => 'BIEE_BIPLATFORM', tabname => 'XC2875559987_ZPRODNE1645894381' , estimate_percent => 5 , method_opt => 'FOR ALL COLUMNS SIZE AUTO' ); END;

Although I do note that it is used a fixed estimate_percent instead of the recommended AUTO_SAMPLE_SIZE. The table itself is created with a fixed prefix (as specified in the obis1-diagnostic.log at initialisation), and holds a full copy of the XSA (not just the columns in the query that triggered the cache creation):

With the dataset cached, the query is then run and the query log shows a XSA cache hit

External Subject Area cache hit for 'prodney'.'MOCK_DATA_bigger_55Mb'/Columns :
Cache entry shared_cache_key = 'prodney'.'MOCK_DATA_bigger_55Mb',
table name = BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357,
row count = 144000,
entry size = 201326592 bytes,
creation time = 2016-06-01 20:14:26.829,
creation elapsed time = 49779 ms,
descriptor ID = /app/oracle/biee/user_projects/domains/bi/servers/obis1/xsacache/NQSXSA_BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357_2.CACHE

with the resulting physical query fired at the XSA cache table (replacing what would have gone against the DSS web service):

-- Sending query to database named XSA Cache (id: <<65357>>), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d3ed281d: [[
WITH
SAWITH0 AS (select T1000001.first_name2360035083 as c1,
     T1000001.last_name3826278858 as c2,
     sum(T1000001.foo2363149668) as c3
from
     BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 T1000001
group by T1000001.first_name2360035083, T1000001.last_name3826278858)
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3, D1.c4 as c4 from ( select 0 as c1,
     D102.c1 as c2,
     D102.c2 as c3,
     D102.c3 as c4
from
     SAWITH0 D102
order by c2, c3 ) D1 where rownum <= 5000001

It’s important to point out the difference of what’s happening here: the aggregation has been pushed down to the database, meaning that the BI Server doesn’t have to. In performance terms, this is a Very Good Thing usually.

Rows 988, bytes 165984 retrieved from database query
Rows returned to Client 988

Whilst it doesn’t seem to be recorded in the query log from what I can see, the data returned from the XSA Cache also gets inserted into the BI Server cache, and if you open an XSA-based analysis that’s not in the presentation services cache (a third cache to factor in!) you will get a cache hit on the BI Server cache. As discussed earlier in this article though, if an analysis is built against an XSA for which a BI Server cache entry exists that with manipulation could service it (eg pruning columns or rolling up), it doesn’t appear to take advantage of it – but since it’s hitting the XSA cache this time, it’s less of a concern.

If you change the underlying data in the XSA

The BI Server does pick this up and repopulates the XSA Cache.

The XSA cache entry itself is 192Mb in size – generated from a 55Mb upload file. The difference will be down to data types and storage methods etc. However, that it is larger in the XSA Cache (database) than held natively (flat file) doesn’t really matter, particularly if the data is being aggregated and/or filtered, since the performance benefit of pushing this work to the database will outweigh the overhead of storage space. Consider this example here, where I run an analysis pulling back 44 columns (of the 78 in the spreadsheet) and hit the XSA cache, it runs in just over a second, and transfers from the database a total of 5.3Mb (the data is repeated, so rolls up):

Rows 1000, bytes 5576000 retrieved from database
Rows returned to Client 1000

If I disable the XSA cache and run the same query, we see this:

Rows 144000, bytes 801792000 Retrieved from database
Physical query response time 22.086 (seconds)
Rows returned to Client 1000

That’s 764Mb being sent back for the BI Server to process, which it does by dumping a whole load to disk in temporary work files:

$  ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp
[...]]
-rwxrwx--- 1 oracle oinstall 10726190 2016-06-01 21:04 nQS_AG_29733_261_ebd70002_75835908.TMP
-rwxrwx--- 1 oracle oinstall   153388 2016-06-01 21:04 nQS_AG_29733_262_ebd70002_75835908.TMP
-rw------- 1 oracle oinstall 24192000 2016-06-01 21:04 nQS_AG_29733_266_ebd70002_75862509.TMP
-rw------- 1 oracle oinstall  4195609 2016-06-01 21:04 nQS_EX_29733_264_ebd70002_75861716.TMP
-rw------- 1 oracle oinstall 21430272 2016-06-01 21:04 nQS_EX_29733_265_ebd70002_75861739.TMP

As a reminder – this isn’t “Bad”, it’s just not optimal (response time of 50 seconds vs 1 second), and if you scale that kind of behaviour by many users with many datasets, things could definitely get hairy for all users of the system. Hence – use the XSA Cache.

As a final point, with the XSA Cache being in the database the standard range of performance optimisations are open to us – indexing being the obvious one. No indexes are built against the XSA Cache table by default, which is fair enough since OBIEE has no idea what the key columns on the data are, and the point of mashups is less to model and optimise the data but to just get it up there in front of the user. So you could index the table if you knew the key columns that were going to be filtered against, or you could even put it into memory (assuming you’ve licensed the option).

The MoS document referenced above also includes further performance recommendations for XSA, including the use of RAM Disk for XSA cache metadata files, as well as the managed server temp folder

Summary

External Subject Areas are great functionality, but be aware of the performance implications of not being able to push down common operations such as filtering and aggregation. Set up XSA Caching if you are going to be using XSA properly.

If you’re interested in the direction of XSA and the associated Data Set Service, this slide deck from Oracle’s Socs Cappas provides some interesting reading. Uploading Excel files into OBIEE looks like just the beginning of what the Data Set Service is going to enable!

The post OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Wed, 2016-06-01 14:06
OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

One of the big changes in OBIEE 12c for end users is the ability to upload their own data sets and start analysing them directly, without needing to go through the traditional data provisioning and modelling process and associated leadtimes. The implementation of this is one of the big architectural changes of OBIEE 12c, introducing the concept of the Extended Subject Areas (XSA), and the Data Set Service (DSS).

In this article we'll see some of how XSA and DSS work behind the scenes, providing an important insight for troubleshooting and performance analysis of this functionality.

What is an XSA?

An Extended Subject Area (XSA) is made up of a dataset, and associated XML data model. It can be used standalone, or "mashed up" in conjunction with a "traditional" subject area on a common field

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

How is an XSA Created?

At the moment the following methods are available:

  1. "Add XSA" in Visual Analzyer, to upload an Excel (XLSX) document OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

  2. CREATE DATASET logical SQL statement, that can be run through any interface to the BI Server, including 'Issue Raw SQL', nqcmd, JDBC calls, and so on OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

  3. Add Data Source in Answers. Whilst this option shouldn't actually be present according to a this doc, it will be for any users of 12.2.1 who have uploaded the SampleAppLite BAR file so I'm including it here for completeness. OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Under the covers, these all use the same REST API calls directly into datasetsvc. Note that these are entirely undocumented, and only for internal OBIEE component use. They are not intended nor supported for direct use.

How does an XSA work?

External Subject Areas (XSA) are managed by the Data Set Service (DSS). This is a java deployment (datasetsvc) running in the Managed Server (bi_server1), providing a RESTful API for the other OBIEE components that use it.

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

The end-user of the data, whether it's Visual Analyzer or the BI Server, send REST web service calls to DSS, storing and querying datasets within it.

Where is the XSA Stored?

By default, the data for XSA is stored on disk in SINGLETONDATADIRECTORY/components/DSS/storage/ssi, e.g. /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi

[oracle@demo ssi]$ ls -lrt /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi|tail -n5  
-rw-r----- 1 oracle oinstall    8495 2015-12-02 18:01 7e43a80f-dcf6-4b31-b898-68616a68e7c4.dss
-rw-r----- 1 oracle oinstall  593662 2016-05-27 11:00 1beb5e40-a794-4aa9-8c1d-5a1c59888cb4.dss
-rw-r----- 1 oracle oinstall  131262 2016-05-27 11:12 53f59d34-2037-40f0-af21-45ac611f01d3.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:04 a4fc922d-ce0e-479f-97e4-1ddba074f5ac.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:06 c93aa2bd-857c-4651-bba2-a4f239115189.dss

They're stored using the format in which they were created, which is XLSX (via VA) or CSV (via CREATE DATASET)

[oracle@demo ssi]$ head 53f59d34-2037-40f0-af21-45ac611f01d3.dss  
"7 Megapixel Digital Camera","2010 Week 27",44761.88
"MicroPod 60Gb","2010 Week 27",36460.0
"MP3 Speakers System","2010 Week 27",36988.86
"MPEG4 Camcorder","2010 Week 28",32409.78
"CompCell RX3","2010 Week 28",33005.91

There's a set of DSS-related tables installed in the RCU schema BIPLATFORM, which hold information including the XML data model for the XSA, along with metadata such as the user that uploaded the file, when they uploaded, and then name of the file on disk:

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

How Can the Data Set Service be Configured?

The configuration file, with plenty of inline comments, is at ORACLEHOME/bi/endpointmanager/jeemap/dss/DSSREST_SERVICE.properties. From here you con update settings for the data set service including upload limits as detailed here.

XSA Performance

Since XSA are based on flat files stored in disk, we need to be very careful in their use. Whilst a database may hold billions of rows in a table with with appropriate indexing and partitioning be able to provide sub-second responses, a flat file can quickly become a serious performance bottleneck. Bear in mind that a flat file is just a bunch of data plopped on disk - there is no concept of indices, blocks, partitions -- all the good stuff that makes databases able to do responsive ad-hoc querying on selections of data.

If you've got a 100MB Excel file with thousands of cells, and want to report on just a few of them, you might find it laggy - because whether you want to report on them on or not, at some point OBIEE is going to have to read all of them regardless. We can see how OBIEE is handling XSA under the covers by examining the query log. This used to be called nqquery.log in OBIEE 11g (and before), and in OBIEE 12c has been renamed obis1-query.log.

In this example here I'm using an Excel worksheet with 140,000 rows and 78 columns. Total filesize of the source XLSX on disk is ~55Mb.

First up, I'll build a query in Answers with a couple of the columns:

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

The logical query uses the new XSA syntax:

SELECT  
   0 s_0,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."first_name" s_1,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."foo" s_2
FROM XSA('prodney'.'MOCK_DATA_bigger_55Mb')  
ORDER BY 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY
The query log shows
Rows 144000, bytes 13824000 retrieved from database query
Rows returned to Client 200
So of the 55MB of data, we're pulling all the rows (144,000) back to the BI Server for it to then perform the aggregation on it, resulting in the 200 rows returned to the client (Presentation Services). Note though that the byte count is lower (13Mb) than the total size of the file (55Mb). As well as aggregation, filtering on XSA data also gets done by the BI Server. Consider this example here, where we add a predicate: OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service In the query log we can see that all the data has to come back from DSS to the BI Server, in order for it to filter it:
Rows 144000, bytes 23040000 retrieved from database  
Physical query response time 24.195 (seconds),  
Rows returned to Client 0
Note the time taken by DSS -- nearly 25 seconds. Compare this later on to when we see the XSA data served from a database, via the XSA Cache. In terms of BI Server (not XSA) caching, the query log shows that a cache entry was written for the above request:
Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736113_56359_0.TBL'
If I refresh the query in Answers, the data is fetched anew (per this changed behaviour in OBIEE 12c), and the cache repopulated. If I clear the Presentation Services cache and re-open the analysis, I get the results from the BI Server cache, and it doesn't have to refetch the data from the Data Set Service. Since the cache has two columns in, an attribute and a measure, I wondered if running a query with just the fact rolled up might hit the cache (since it has all the data there that it needs) OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service Unfortunately it didn't, and to return a single row of data required BI Server to fetch all the rows again - although looking at the byte count it appears it does prune the columns required since it's now just over 2Mb of data returned this time:
Rows 144000, bytes 2304000 retrieved from database
Rows returned to Client 1
Interestingly if I build an analysis with several more of the columns from the file (in this example, ten of a total of 78), the data returned from the DSS to BI Server (167Mb) is greater than that of the original file (55Mb). OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service
Rows 144000, bytes 175104000
Rows returned to Client 1000
And this data coming back from the DSS to the BI Server has to go somewhere - and if it's big enough it'll overflow to disk, as we can see when I run the above:
  
$ ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp
[...]
-rwxrwx--- 1 oracle oinstall 2910404 2016-06-01 14:08 nQS_AG_22345_7503_7c9c000a_50906091.TMP
-rwxrwx--- 1 oracle oinstall   43476 2016-06-01 14:08 nQS_AG_22345_7504_7c9c000a_50906091.TMP
-rw------- 1 oracle oinstall 6912000 2016-06-01 14:08 nQS_AG_22345_7508_7c9c000a_50921949.TMP
-rw------- 1 oracle oinstall  631375 2016-06-01 14:08 nQS_EX_22345_7506_7c9c000a_50921652.TMP
-rw------- 1 oracle oinstall 3670016 2016-06-01 14:08 nQS_EX_22345_7507_7c9c000a_50921673.TMP
[...]

You can read more about BI Server's use of temporary files and the impact that it can have on system performance and particularly I/O bandwidth in this OTN article here.

So - as the expression goes - "buyer beware". XSA is an excellent feature, but used in its default configuration with files stored on disk it has the potential to wreak havoc if abused.

XSA Caching

If you're planning to use XSA seriously, you should set up the database-based XSA Cache. This is described in detail in the PDF document attached to My Oracle Support note OBIEE 12c: How To Configure The External Subject Area (XSA) Cache For Data Blending| Mashup And Performance (Doc ID 2087801.1).

In a proper implementation you would follow in full the document, including provisioning a dedicated schema and tablespace for holding the data (to make it easier to manage and segregate from other data), but here I'm just going to use the existing RCU schema (BIPLATFORM), along with the Physical mapping already in the RPD (10 - System DB (ORCL)):

In NQSConfig.INI, under the XSA_CACHE section, I set:

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "10 - System DB (ORCL)"."Catalog"."dbo";
CONNECTION_POOL = "10 - System DB (ORCL)"."UT Connection Pool";
And restart the BI Server:
  
/app/oracle/biee/user_projects/domains/bi/bitools/bin/stop.sh -i obis1 && /app/oracle/biee/user_projects/domains/bi/bitools/bin/start.sh -i obis1

Per the document, note that in the BI Server log there's an entry indicating that the cache has been successfully started:

[101001] External Subject Area cache is started successfully using configuration from the repository with the logical name ssi.
[101017] External Subject Area cache has been initialized. Total number of entries: 0 Used space: 0 bytes Maximum space: 107374182400 bytes Remaining space: 107374182400 bytes. Cache table name prefix is XC2875559987.
Now when I re-run the test XSA analysis from above, returning three columns, the BI Server goes off and populates the XSA cache table:
-- Sending query to database named 10 - System DB (ORCL) (id: <<79879>> XSACache Create table Gateway), connection pool named UT Connection Pool, logical request hash b4de812e, physical request hash 5847f2ef:  
CREATE TABLE dbo.XC2875559987_ZPRODNE1926129021 ( id3209243024 DOUBLE PRECISION, first_n[..]  

Or rather, it doesn't, because PHYSICALSCHEMA seems to want the literal physical schema, rather than the logical physical one (?!) that the USAGETRACKING configuration stanza is happy with in referencing the table.

Properties: description=<<79879>> XSACache Create table Exchange; producerID=0x1561aff8; requestID=0xfffe0034; sessionID=0xfffe0000; userName=prodney;  
[nQSError: 17001] Oracle Error code: 1918, message: ORA-01918: user 'DBO' does not exist

I'm trying to piggyback on SA511's existing configruation, which uses catalog.schema notation:

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Instead of the more conventional approach to have the actual physical schema (often used in conjunction with 'Require fully qualified table names' in the connection pool):

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

So now I'll do it properly, and create a database and schema for the XSA cache - I'm still going to use the BIPLATFORM schema though...

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Updated NQSConfig.INI:

[ XSA_CACHE ]

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "XSA Cache"."BIEE_BIPLATFORM";
CONNECTION_POOL = "XSA Cache"."XSA CP";
After refreshing the analysis again, there's a successful creation of the XSA cache table:
-- Sending query to database named XSA Cache (id: <<65685>> XSACache Create table Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash ccc0a410: [[  
CREATE TABLE BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 ( id3209243024 DOUBLE PRECISION, first_name2360035083 VARCHAR2(17 CHAR), [...]  

as well as a stats gather:

-- Sending query to database named XSA Cache (id: <<65685>> XSACache Collect statistics Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d73151bb:  
BEGIN DBMS_STATS.GATHER_TABLE_STATS(ownname => 'BIEE_BIPLATFORM', tabname => 'XC2875559987_ZPRODNE1645894381' , estimate_percent => 5 , method_opt => 'FOR ALL COLUMNS SIZE AUTO' ); END;  

Although I do note that it is used a fixed estimatepercent instead of the recommended AUTOSAMPLE_SIZE. The table itself is created with a fixed prefix (as specified in the obis1-diagnostic.log at initialisation), and holds a full copy of the XSA (not just the columns in the query that triggered the cache creation):

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

With the dataset cached, the query is then run and the query log shows a XSA cache hit

External Subject Area cache hit for 'prodney'.'MOCK_DATA_bigger_55Mb'/Columns :
Cache entry shared_cache_key = 'prodney'.'MOCK_DATA_bigger_55Mb',
table name = BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357,
row count = 144000,
entry size = 201326592 bytes,
creation time = 2016-06-01 20:14:26.829,
creation elapsed time = 49779 ms,
descriptor ID = /app/oracle/biee/user_projects/domains/bi/servers/obis1/xsacache/NQSXSA_BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357_2.CACHE
with the resulting physical query fired at the XSA cache table (replacing what would have gone against the DSS web service):
-- Sending query to database named XSA Cache (id: <<65357>>), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d3ed281d: [[  
WITH  
SAWITH0 AS (select T1000001.first_name2360035083 as c1,  
     T1000001.last_name3826278858 as c2,
     sum(T1000001.foo2363149668) as c3
from  
     BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 T1000001
group by T1000001.first_name2360035083, T1000001.last_name3826278858)  
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3, D1.c4 as c4 from ( select 0 as c1,  
     D102.c1 as c2,
     D102.c2 as c3,
     D102.c3 as c4
from  
     SAWITH0 D102
order by c2, c3 ) D1 where rownum <= 5000001  

It's important to point out the difference of what's happening here: the aggregation has been pushed down to the database, meaning that the BI Server doesn't have to. In performance terms, this is a Very Good Thing usually.

Rows 988, bytes 165984 retrieved from database query
Rows returned to Client 988
Whilst it doesn't seem to be recorded in the query log from what I can see, the data returned from the XSA Cache also gets inserted into the BI Server cache, and if you open an XSA-based analysis that's not in the presentation services cache (a third cache to factor in!) you will get a cache hit on the BI Server cache. As discussed earlier in this article though, if an analysis is built against an XSA for which a BI Server cache entry exists that with manipulation could service it (eg pruning columns or rolling up), it doesn't appear to take advantage of it - but since it's hitting the XSA cache this time, it's less of a concern. If you change the underlying data in the XSA OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service The BI Server does pick this up and repopulates the XSA Cache. The XSA cache entry itself is 192Mb in size - generated from a 55Mb upload file. The difference will be down to data types and storage methods etc. However, that it is larger in the XSA Cache (database) than held natively (flat file) doesn't really matter, particularly if the data is being aggregated and/or filtered, since the performance benefit of pushing this work to the database will outweigh the overhead of storage space. Consider this example here, where I run an analysis pulling back 44 columns (of the 78 in the spreadsheet) and hit the XSA cache, it runs in just over a second, and transfers from the database a total of 5.3Mb (the data is repeated, so rolls up):
Rows 1000, bytes 5576000 retrieved from database
Rows returned to Client 1000
If I disable the XSA cache and run the same query, we see this:
Rows 144000, bytes 801792000 Retrieved from database
Physical query response time 22.086 (seconds)
Rows returned to Client 1000
That's 764Mb being sent back for the BI Server to process, which it does by dumping a whole load to disk in temporary work files:
$  ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp  
[...]]
-rwxrwx--- 1 oracle oinstall 10726190 2016-06-01 21:04 nQS_AG_29733_261_ebd70002_75835908.TMP
-rwxrwx--- 1 oracle oinstall   153388 2016-06-01 21:04 nQS_AG_29733_262_ebd70002_75835908.TMP
-rw------- 1 oracle oinstall 24192000 2016-06-01 21:04 nQS_AG_29733_266_ebd70002_75862509.TMP
-rw------- 1 oracle oinstall  4195609 2016-06-01 21:04 nQS_EX_29733_264_ebd70002_75861716.TMP
-rw------- 1 oracle oinstall 21430272 2016-06-01 21:04 nQS_EX_29733_265_ebd70002_75861739.TMP

As a reminder - this isn't "Bad", it's just not optimal (response time of 50 seconds vs 1 second), and if you scale that kind of behaviour by many users with many datasets, things could definitely get hairy for all users of the system. Hence - use the XSA Cache.

As a final point, with the XSA Cache being in the database the standard range of performance optimisations are open to us - indexing being the obvious one. No indexes are built against the XSA Cache table by default, which is fair enough since OBIEE has no idea what the key columns on the data are, and the point of mashups is less to model and optimise the data but to just get it up there in front of the user. So you could index the table if you knew the key columns that were going to be filtered against, or you could even put it into memory (assuming you've licensed the option).

The MoS document referenced above also includes further performance recommendations for XSA, including the use of RAM Disk for XSA cache metadata files, as well as the managed server temp folder

Summary

External Subject Areas are great functionality, but be aware of the performance implications of not being able to push down common operations such as filtering and aggregation. Set up XSA Caching if you are going to be using XSA properly.

If you're interested in the direction of XSA and the associated Data Set Service, this slide deck from Oracle's Socs Cappas provides some interesting reading. Uploading Excel files into OBIEE looks like just the beginning of what the Data Set Service is going to enable!

Categories: BI & Warehousing

Changes in BI Server Cache Behaviour in OBIEE 12c : OBIS_REFRESH_CACHE

Tue, 2016-05-31 13:59
 OBIS_REFRESH_CACHE

The OBIEE BI Server cache can be a great way of providing a performance boost to response times for end users - so long as it's implemented carefully. Done wrong, and you're papering over the cracks and heading for doom; done right, and it's the 'icing on the cake'. You can read more about how to use it properly here, and watch a video I did about it here. In this article we'll see how the BI Server cache has changed in OBIEE 12c in a way that could prove somewhat perplexing to developers used to OBIEE 11g.

The BI Server cache works by inspecting queries as they are sent to the BI Server, and deciding if an existing cache entry can be used to provide the data. This can include direct hits (i.e. the same query being run again), or more advanced cases, where a subset or aggregation of an existing cache entry could be used. If a cache entry is used then a trip to the database is avoided and response times will typically be better - particularly if more than one database query would have been involved, or lots of additional post-processing on the BI Server.

When an analysis or dashboard is run, Presentation Services generates the necessary Logical SQL to return the data needed, and sends this to the BI Server. It's at this point that the cache will, or won't, kick in. The BI Server will accept Logical SQL from other sources than Presentation Services - in fact, any JDBC or ODBC client. This is useful as it enables us to validate behaviour that we're observing and see how it can apply elsewhere.

When you build an Analysis in OBIEE 11g (and before), the cache will be used if applicable. Each time you add a column, or hit refresh, you'll get an entry back from the cache if one exists. This has benefits - speed - but disadvantages too. When the data in the database changes, you will still get a cache hit, regardless. The only way to force OBIEE to show you the latest version of the data is to purge the cache first. You can target cache purges based on databases, tables, or even specific queries - but you do need to purge it.

What's changed in OBIEE 12c is that when you click "Refresh" on an Analysis or Dashboard, the query is re-run against the source and the cache re-populated. Even if you have an existing cache entry, and even if the underlying data has not changed, if you hit Refresh, the cache will not be used. Which kind of makes sense, since "refresh" probably should indeed mean that.

Digging into OBIEE Cache Behaviour

Let's prove this out. I've got SampleApp v506 (OBIEE 11.1.1.9), and SampleApp v511 (OBIEE 12.2.1). First off, I'll clear the cache on each, using call saPurgeAllCache();, run via Issue SQL:

 OBIS_REFRESH_CACHE

 OBIS_REFRESH_CACHE

Then I can use another BI Server procedure call to view the current cache contents (new in 11.1.1.9), call NQS_GetAllCacheEntries(). For this one particularly make sure you've un-ticked "Use Oracle BI Presentation Services Cache". This is different from the BI Server cache which is the subject of this article, and as the name implies is a cache that Presentation Services keeps.

 OBIS_REFRESH_CACHE

I've confirmed that the BI Server cache is enabled on both servers, in NQSConfig.INI

###############################################################################
#
#  Query Result Cache Section
#
###############################################################################


[CACHE]

ENABLE = YES;  # This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control
Now I create a very simple analysis in both 11g and 12c, showing a list of Airline Carriers and their Codes:  OBIS_REFRESH_CACHE After clicking Results, a cache entry is inserted on each respective system:  OBIS_REFRESH_CACHE  OBIS_REFRESH_CACHE Of particular interest is the create time, last used time, and number of times used:  OBIS_REFRESH_CACHE If I now click Refresh in the Analysis window:  OBIS_REFRESH_CACHE We see this happen to the caches:  OBIS_REFRESH_CACHE In OBIEE 11g the cache entry is used - but in OBIEE 12c it's not. The CreatedTime is evidently not populated correctly, so instead let's dive over to the query log (nqquery/obis1-query in 11g/12c respectively). In OBIEE 11g we've got:
-- SQL Request, logical request hash:  
7c365697  
SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"  
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

-- Cache Hit on query: [[
Matching Query: SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"  
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY
Whereas 12c is:
-- SQL Request, logical request hash:  
d53f813c  
SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"  
ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

-- Sending query to database named X0 - Airlines Demo Dbs (ORCL) (id: <<320369>>), connection pool named Aggr Connection, logical request hash d53f813c, physical request hash a46c069c: [[
WITH  
SAWITH0 AS (select T243.CODE as c1,  
     T243.DESCRIPTION as c2
from  
     BI_AIRLINES.UNIQUE_CARRIERS T243 /* 30 UNIQUE_CARRIERS */ )
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select 0 as c1,  
     D1.c1 as c2,
     D1.c2 as c3
from  
     SAWITH0 D1
order by c3, c2 ) D1 where rownum <= 5000001

-- Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736117_52416_27.TBL'.
Looking closely at the 12c output shows three things:
  1. OBIEE has run a database query for this request, and not hit the cache
  2. A cache entry has clearly been created again as a result of this query
  3. The Logical SQL has a request variable set: OBIS_REFRESH_CACHE=1 This is evidently added it by Presentation Services at runtime, since the Advanced tab of the analysis shows no such variable being set:  OBIS_REFRESH_CACHE
Let's save the analysis, and experiment further. Evidently, the cache is being deliberately bypassed when the Refresh button is clicked when building an analysis - but what about when it is opened from the Catalog? We should see a cache hit here too:  OBIS_REFRESH_CACHE Nope, no hit.  OBIS_REFRESH_CACHE But, in the BI Server query log, no entry either - and the same on 11g. The reason being .... Presentation Service's cache. D'oh! From Administration > Manage Sessions I select Close All Cursors which forces a purge of the Presentation Services cache. When I reopen the analysis from the Catalog view, now I get a cache hit, in both 11g and 12c:  OBIS_REFRESH_CACHE The same happens (successful cache hit) for the analysis used in a Dashboard being opened, having purged the Presentation Services cache first.  OBIS_REFRESH_CACHE So at this point, we can say that OBIEE 11g and 12c both behave the same with the cache when opening analyses/dashboards, but differ when refreshing the analysis. In OBIEE 12c when an analysis is refreshed the cache is deliberately bypassed. Let's check on refreshing a dashboard:  OBIS_REFRESH_CACHE Same behaviour as with analyses - in 11g the cache is hit, in 12c the cache is bypassed and repopulated  OBIS_REFRESH_CACHE To round this off, let's doublecheck the behaviour of the new request variable that we've found, OBIS_REFRESH_CACHE. Since it appears that Presentation Services is adding it in at runtime, let's step over to a more basic way of interfacing with the BI Server - nqcmd. Whilst we could probably use Issue SQL (as we did above for querying the cache) I want to avoid any more behind-the-scenes funny business from Presentation Services. In OBIEE 12c, I run nqcmd:
/app/oracle/biee/user_projects/domains/bi/bitools/bin/nqcmd.sh -d AnalyticsWeb -u prodney -p Admin123
Enter Q to enter a query, as follows:
SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY
In `obis1-query.log' there's the cache bypass and populate:
Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736117_53779_29.TBL'.
If I run it again without the OBIS_REFRESH_CACHE variable:
SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY
We get the cache hit as expected:
-------------------- Cache Hit on query: [[
Matching Query: SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY
Created by:     prodney
Out of interest I ran the same two tests on 11g -- both resulted in a cache hit, since it presumably ignores the unrecognised variable. Summary In OBIEE 12c, if you click "Refresh" on an analysis or dashboard, OBIEE Presentation Services forces a cache-bypass and cache-reseed, ensuring that you really do see the latest version of the data from source. It does this using the request variable, new in OBIEE 12c, OBIS_REFRESH_CACHE. Footnote Courtesy of Steve Fitzgerald: As per the presentation server xsd, you can also revert the behavior (not sure why one would, but you can) in the instanceconfig.xml
<Cache>  
<Query>
<RefreshIncludeBIServerCache>false</RefreshIncludeBIServerCache>
</Query>
</Cache>

Categories: BI & Warehousing

Changes in BI Server Cache Behaviour in OBIEE 12c : OBIS_REFRESH_CACHE

Tue, 2016-05-31 10:36

The OBIEE BI Server cache can be a great way of providing a performance boost to response times for end users – so long as it’s implemented carefully. Done wrong, and you’re papering over the cracks and heading for doom; done right, and it’s the ‘icing on the cake’. You can read more about how to use it properly here, and watch a video I did about it here. In this article we’ll see how the BI Server cache has changed in OBIEE 12c in a way that could prove somewhat perplexing to developers used to OBIEE 11g.

The BI Server cache works by inspecting queries as they are sent to the BI Server, and deciding if an existing cache entry can be used to provide the data. This can include direct hits (i.e. the same query being run again), or more advanced cases, where a subset or aggregation of an existing cache entry could be used. If a cache entry is used then a trip to the database is avoided and response times will typically be better – particularly if more than one database query would have been involved, or lots of additional post-processing on the BI Server.

When an analysis or dashboard is run, Presentation Services generates the necessary Logical SQL to return the data needed, and sends this to the BI Server. It’s at this point that the cache will, or won’t, kick in. The BI Server will accept Logical SQL from other sources than Presentation Services – in fact, any JDBC or ODBC client. This is useful as it enables us to validate behaviour that we’re observing and see how it can apply elsewhere.

When you build an Analysis in OBIEE 11g (and before), the cache will be used if applicable. Each time you add a column, or hit refresh, you’ll get an entry back from the cache if one exists. This has benefits – speed – but disadvantages too. When the data in the database changes, you will still get a cache hit, regardless. The only way to force OBIEE to show you the latest version of the data is to purge the cache first. You can target cache purges based on databases, tables, or even specific queries – but you do need to purge it.

What’s changed in OBIEE 12c is that when you click “Refresh” on an Analysis or Dashboard, the query is re-run against the source and the cache re-populated. Even if you have an existing cache entry, and even if the underlying data has not changed, if you hit Refresh, the cache will not be used. Which kind of makes sense, since “refresh” probably should indeed mean that.

Digging into OBIEE Cache Behaviour

Let’s prove this out. I’ve got SampleApp v506 (OBIEE 11.1.1.9), and SampleApp v511 (OBIEE 12.2.1). First off, I’ll clear the cache on each, using call saPurgeAllCache();, run via Issue SQL:

Then I can use another BI Server procedure call to view the current cache contents (new in 11.1.1.9), call NQS_GetAllCacheEntries(). For this one particularly make sure you’ve un-ticked “Use Oracle BI Presentation Services Cache”. This is different from the BI Server cache which is the subject of this article, and as the name implies is a cache that Presentation Services keeps.

I’ve confirmed that the BI Server cache is enabled on both servers, in NQSConfig.INI

###############################################################################
#
#  Query Result Cache Section
#
###############################################################################


[CACHE]

ENABLE = YES;  # This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control

Now I create a very simple analysis in both 11g and 12c, showing a list of Airline Carriers and their Codes:

After clicking Results, a cache entry is inserted on each respective system:

Of particular interest is the create time, last used time, and number of times used:

If I now click Refresh in the Analysis window:

We see this happen to the caches:

In OBIEE 11g the cache entry is used – but in OBIEE 12c it’s not. The CreatedTime is evidently not populated correctly, so instead let’s dive over to the query log (nqquery/obis1-query in 11g/12c respectively). In OBIEE 11g we’ve got:

-- SQL Request, logical request hash:
7c365697
SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

-- Cache Hit on query: [[
Matching Query: SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

Whereas 12c is:

-- SQL Request, logical request hash:
d53f813c
SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"
ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

-- Sending query to database named X0 - Airlines Demo Dbs (ORCL) (id: <<320369>>), connection pool named Aggr Connection, logical request hash d53f813c, physical request hash a46c069c: [[
WITH
SAWITH0 AS (select T243.CODE as c1,
     T243.DESCRIPTION as c2
from
     BI_AIRLINES.UNIQUE_CARRIERS T243 /* 30 UNIQUE_CARRIERS */ )
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select 0 as c1,
     D1.c1 as c2,
     D1.c2 as c3
from
     SAWITH0 D1
order by c3, c2 ) D1 where rownum <= 5000001

-- Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736117_52416_27.TBL'.

Looking closely at the 12c output shows three things:

  1. OBIEE has run a database query for this request, and not hit the cache
  2. A cache entry has clearly been created again as a result of this query
  3. The Logical SQL has a request variable set: OBIS_REFRESH_CACHE=1

    This is evidently added it by Presentation Services at runtime, since the Advanced tab of the analysis shows no such variable being set:

Let’s save the analysis, and experiment further. Evidently, the cache is being deliberately bypassed when the Refresh button is clicked when building an analysis – but what about when it is opened from the Catalog? We should see a cache hit here too:

Nope, no hit.

15239r

But, in the BI Server query log, no entry either – and the same on 11g. The reason being …. Presentation Service’s cache. D’oh!

From Administration > Manage Sessions I select Close All Cursors which forces a purge of the Presentation Services cache. When I reopen the analysis from the Catalog view, now I get a cache hit, in both 11g and 12c:

The same happens (successful cache hit) for the analysis used in a Dashboard being opened, having purged the Presentation Services cache first.

So at this point, we can say that OBIEE 11g and 12c both behave the same with the cache when opening analyses/dashboards, but differ when refreshing the analysis. In OBIEE 12c when an analysis is refreshed the cache is deliberately bypassed. Let’s check on refreshing a dashboard:

Same behaviour as with analyses – in 11g the cache is hit, in 12c the cache is bypassed and repopulated

To round this off, let’s doublecheck the behaviour of the new request variable that we’ve found, OBIS_REFRESH_CACHE. Since it appears that Presentation Services is adding it in at runtime, let’s step over to a more basic way of interfacing with the BI Server – nqcmd. Whilst we could probably use Issue SQL (as we did above for querying the cache) I want to avoid any more behind-the-scenes funny business from Presentation Services.

In OBIEE 12c, I run nqcmd:

/app/oracle/biee/user_projects/domains/bi/bitools/bin/nqcmd.sh -d AnalyticsWeb -u prodney -p Admin123

Enter Q to enter a query, as follows:

SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY

In `obis1-query.log’ there’s the cache bypass and populate:

Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736117_53779_29.TBL'.

If I run it again without the OBIS_REFRESH_CACHE variable:

SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY

We get the cache hit as expected:

-------------------- Cache Hit on query: [[
Matching Query: SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY
Created by:     prodney

Out of interest I ran the same two tests on 11g — both resulted in a cache hit, since it presumably ignores the unrecognised variable.

Summary

In OBIEE 12c, if you click “Refresh” on an analysis or dashboard, OBIEE Presentation Services forces a cache-bypass and cache-reseed, ensuring that you really do see the latest version of the data from source. It does this using the request variable, new in OBIEE 12c, OBIS_REFRESH_CACHE.

Footnote

Courtesy of Steve Fitzgerald:

As per the presentation server xsd, you can also revert the behavior (not sure why one would, but you can) in the instanceconfig.xml

<Cache>
<Query>
<RefreshIncludeBIServerCache>false</RefreshIncludeBIServerCache>
</Query>
</Cache>

The post Changes in BI Server Cache Behaviour in OBIEE 12c : OBIS_REFRESH_CACHE appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

#IHateYourFace, a BizIntel Love Story

Wed, 2016-05-25 13:53
#IHateYourFace, a BizIntel Love Story

Sometimes love isn't fair

A teenager, just emerging from an awkward growing phase drives a tendency to spew non sensical permutations of varying cringe worthy affections toward their love target.  The data points behind the intent are solid.  Let’s go Boolean now:

Intent of good = 1 ;

attracted to target = 1;

ability to provide for target = 1;

time available  = 1;

prior or current committed relationship = 0.

So here we are with all the solid intentions in the world and the variables all align to an optimal outcome.  Except one thing:

love target hates your face = 1;

Exit(0)

Assuming that this is a hierarchical value chain with face hating as the parent, we are at an impasse.

The above is not intended as an affront to teenagers spitting game at their prospective boo.  It is though, an analogy to the fantastic time and effort spent in integrating, transforming and presenting Business Intelligence to the enterprise with all of the proper best practices and technology.  Only to have the user subtly tell us that the information is “great but… I don’t know what I am looking at and I was looking for it to be in 3D” (AKA, I hate your face).  There have been some great books out there about visualization and we have read a LOT of them.  Best practices aside, visualization is about taste.  The love target above may well have a completely different opinion of facial beauty than Mr. or Ms. hottie-mc-toddy that sits next to you in your advanced Kafka course (seriously, it might happen).  Love at Kafka training happens, I swear it does, really and when it does, it involves unicorns and ends of rainbows.

So now I have to figure out my end user’s tastes in visualization?

Yep, accept and move on.  

You wouldn’t saddle up with a love interest if you hated their face.  Why would an end user sign up for something they don’t want to use.  Of course, you being the BI practitioner that you are know your BI stack up and down and are well aware of all of the available visualization options.  These end users don’t know what they are talking about…right?

—they don’t have to.  

Lots of the big BI players have been losing ground in the industry of viz because new competitors are building attractive viz capabilities.  The new guys in BI are going to the end users with something the end users want…pretty and flexible visualizations.  Both the viz provider and end user need not worry themselves with data integrity, governance, or how they got the data to start.

Welcome to Enterprise BI 2.0

What if I could use my existing enterprise platform and allow users to integrate more data sources, mash it up and then have an almost story telling approach to BI?  You can and the next releases from the big BI vendors are all focused on that.  Clearly at Rittman Mead we have an affinity for OBIEE and with 12c, the dream of viz has become a an out of the box reality.  Data visualization is core to new functionality with more on the way.  So my advice is upgrade to 12c ASAP to get your new face on.

Rittman Mead has less invasive facelift versus replace options with viz for 12c and 11g.  Visual Plug in Pack (VPP) extends the native charting in OBI into dynamic attractive hollywood level viz.  We also have an User Engagement offering  that comes with that visual facelift all the girls at the salon are talking about.  Shoot me an email if you are interested in our streamlined OBI 12c upgrade service or VPP. Jason.davis@rittmanmead.com

Categories: BI & Warehousing

A Business Intelligence Love Story: #IHateYourFace

Wed, 2016-05-25 05:00

Sometimes love isn’t fair

A teenager, just emerging from an awkward growing phase drives a tendency to spew non sensical permutations of varying cringe worthy affections toward their love target. The data points behind the intent are solid. Let’s go Boolean now:

Intent of good = 1 ;

attracted to target = 1;

ability to provide for target = 1;

time available = 1;

prior or current committed relationship = 0.

So here we are with all the solid intentions in the world and the variables all align to an optimal outcome. Except one thing:

love target hates your face = 1;

Exit(0)

Assuming that this is a hierarchical value chain with face hating as the parent, we are at an impasse.

The above is not intended as an affront to teenagers spitting game at their prospective boo. It is though, an analogy to the fantastic time and effort spent in integrating, transforming and presenting Business Intelligence to the enterprise with all of the proper best practices and technology. Only to have the user subtly tell us that the information is “great but… I don’t know what I am looking at and I was looking for it to be in 3D” (AKA, I hate your face). There have been some great books out there about visualization and we have read a LOT of them. Best practices aside, visualization is about taste. The love target above may well have a completely different opinion of facial beauty than Mr. or Ms. hottie-mc-toddy that sits next to you in your advanced Kafka course (seriously, it might happen). Love at Kafka training happens, I swear it does, really and when it does, it involves unicorns and ends of rainbows.

So now I have to figure out my end user’s tastes in visualization?

Yep, accept and move on.

You wouldn’t saddle up with a love interest if you hated their face. Why would an end user sign up for something they don’t want to use. Of course, you being the BI practitioner that you are know your BI stack up and down and are well aware of all of the available visualization options. These end users don’t know what they are talking about…right?

—they don’t have to.

Lots of the big BI players have been losing ground in the industry of viz because new competitors are building attractive viz capabilities. The new guys in BI are going to the end users with something the end users want…pretty and flexible visualizations. Both the viz provider and end user need not worry themselves with data integrity, governance, or how they got the data to start.

Welcome to Enterprise BI 2.0

What if I could use my existing enterprise platform and allow users to integrate more data sources, mash it up and then have an almost story telling approach to BI? You can and the next releases from the big BI vendors are all focused on that. Clearly at Rittman Mead we have an affinity for OBIEE and with 12c, the dream of viz has become a an out of the box reality. Data visualization is core to new functionality with more on the way. So my advice is upgrade to 12c ASAP to get your new face on.

Rittman Mead has less invasive facelift versus replace options with viz for 12c and 11g. Visual Plug in Pack (VPP) extends the native charting in OBI into dynamic attractive hollywood level viz. We also have an User Engagement offering that comes with that visual facelift all the girls at the salon are talking about. Shoot me an email if you are interested in our streamlined OBI 12c upgrade service or VPP. Jason.davis@rittmanmead.com

The post A Business Intelligence Love Story: #IHateYourFace appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Pages