BI & Warehousing
This year, Rittman Mead were the Analytics Sponsor for the UKOUG Tech13 conference in Manchester, and for those that visited the UKOUG stand during their time there will have noticed the Rittman Mead-sponsored Analytics Dashboard on display. In this blog post I will cover how it was put together, the “Blue Peter Way” !
For those of you not familiar with Blue Peter it is a long running children’s TV show here in the UK that started way back in 1958, possibly it’s most infamous moment came in 1969 with lulu the defecating elephant. As a child the highlight of the show was the “How to make” something section where they would always use the phrase “Sticky-backed-plastic” instead of “Sellotape” due to a policy against using commercial terms on air. The presenters would create something from scratch with bits and bobs you could find around the house, cereal boxes, egg cartons, washing up bottles, Sellotape etc ( sorry, “sticky-backed-plastic” ). That’s exactly what I’m going to be doing here but instead of making a sock monster I’m going to show you how the analytics dashboard was created from scratch with bits you can find on the internet. So kids, without further delay, let’s begin…
(remember to ask your parents permission before downloading any of the following items)
You will need :
- A Linux server.
- A Web server
- The Redis key/value store
- A DataSift account
- The Webdis HTTP server
- The tagcanvas.js jQuery plugin
- The vTicker jQuery plugin
- The flot plotting library
- Some Sticky-backed-Plastic (jQuery)
- Lots of coffee
The Linux server I’m using is Red Hat Enterprise Server 6.4 and the very first thing we’ll need to do is install a web server.
Done. The Web server is now installed, configured to start on server-boot and up and running ready to service our http requests.
Next up is the Redis key/value datastore. I’ll be using Redis to store all incoming tweets from our datasource ( more on that in a bit )
Now that we have Redis up and running let’s perform a couple of tests, first a benchmark using the Redis-benchmark command
This command throws out a lot more output than this but it is the LRANGE command we are particularly interested in as we’ll be using it to retrieve our tweets later on. 3593 requests per second seems reasonable to me, there were around a 1000 registrations for the UKOUG TECH13 conference, the likelihood of each of them making 3 concurrent dashboard requests all within the same second is slim to say the least – regardless, I’m willing to live with the risk. Now for a quick SET/GET test using the Redis-cli command.
Ok so that’s our datastore sorted, but what about our datasource ? Clearly we will be using Twitter as our primary data source but mere mortals like myself don’t have access to Twitter’s entire data stream, for that we need to turn to specialist companies that do. DataSift, amongst other companies like Gnip and Topsy ( which interestingly was bought by Apple earlier this week ) can offer this service and will add extra features into the mix such as sentiment analysis and stream subscription services that push data to you. I’ll be using DataSift, as here at Rittman Mead we already have an account with them. The service itself charges you by DPU ( Data Processing Units ) the cost of which depends upon the computational complexity of the stream your running, suffice to say that running the stream to feed the analytics dashboard for a few days was pretty cheap.
To setup a stream you simply express what you want included from the Twitter firehose using their own Curated Stream Definition Language, CSDL. The stream used for the dashboard was very simple and the CSDL looked like this :-
This is simply searching for tweets containing the string “ukoug” from the Twitter stream. DataSift supports all kinds of other social data, for a full list head over to their website.
Now that we have our data stream setup, how do we actually populate our Redis data store with it ? Simple – get DataSift to push it to us. Using their own PUSH API you can setup a subscription in which you can specify the following output parameters: “output_type”, “host”, “port”, “delivery_frequency” and “list” amongst many others. The output type was set to “Redis”, the host, well, our hostname and the port set to the port upon which Redis is listening, by default 6379. The delivery_frequency was set to 60 seconds and the list is the name of the Redis list key you want the data pushed to. With all that setup and the subscription active, tweets will automagically start arriving in our Redis datastore in the JSON format – happy days !
The next step is to get the data from Redis to the browser. I could install PHP and configure it to work with apache and then install one of the several Redis-PHP libraries and write some backend code to serve data up to the browser, but time is limited and I don’t want to faff around with all that. Besides this is Blue Peter not a BBC4 Documentary. Here’s where we use the next cool piece of open source software, Webdis. Webdis acts as a HTTP interface to Redis which means I can make HTTP calls directly from the browser like this :-
Yep, you guessed it, we’ve just set a key in Redis directly from the browser. To retrieve it we simply do this.
Security can be handled in the Webdis config file by setting IP based rules to stop any Tom, Dick or Harry from modifying your Redis keys. So to install Webdis we do the following :-
An that’s it, we now have tweet to browser in 60 seconds and without writing a single line of code ! Here’s an overview of the architecture we have :-
Would I do this in a production environment with larger data set, probably not, it wouldn’t scale, I’d instead write server-side code to handle the data processing, test performance and push only the required data to the client. The right tools for the right job.
The function getData() is called within the setInterval() function, the second parameter to the setInterval() sets the frequency of the call, in this case every 5 mins (30000 milliseconds). The getData function performs an Ajax get request on the url that points to our Webdis server listening on port 7379 which then performs an LRANGE command on the Redis data store, it’s simply asking Redis to return all list items between 0 and 100000, each item being a single tweet. Once the Ajax request has successfully completed the “success” callback is called and within this callback I am pushing each tweet into an array so we end up with an array of tweet objects. We now have all the data in a format we can manipulate to our hearts content.
Now onto the Graphs and Visualisations.The Globe
The tweet ticker was built by grabbing the latest 30 tweets from the data array and assigning them to html <li> tags, the vTicker jQuery plugin was then applied to the containing <ul> html tag. Various plugin options allow you to control things like the delay and scroll speed.Tweet Velocity
This one proved quite popular – these speaker types are a competitive bunch ! Having obtained a list of speakers from the UKOUG Tech13 website I was able to search all the twitter content for each speaker and aggregate up to get the total twitter mentions for each speaker, again the graph was rendered using the flot plotting library. As the graph updated during each day speakers were swapping places with our own Mark Rittman tweeting out the “The Scores on the doors” at regular intervals. When the stats can me manipulated though there’s always someone willing to take advantage !
tut, tut Mark.Twitter Avatars
The twitter Avatars used the tagcavas.js library but instead of populating it with words from the Twitter content the Tweet avatars were used. A few changes to the plugin options were made to display the results as a horizontally scrolling cylinder instead of a globe.Twitter Sentiment
The Twitter sentiment graph again used flot. Tweet sentiment was graphed over time for the duration of the conference. The sentiment score was provided by DataSift as part of the Twitter payload. The scores we received as part of the 3500 tweets ranged between -15 and 15 each score reflected either a positive or negative tweet. Asking a computer to infer a human emotion from 140 characters of text is a tough ask. Having looked at the data in detail a fair few of the tweets that received negative scores weren’t negative in nature, for example tweets with the content “RAC attack” and “Dangerous anti patterns” generated a cluster of negative scores. As we know computers are not as clever as humans, how can they be aware of the context of a tweet? detect sarcasm or differentiate between banter and plain old insults? Not all the negatively scored tweets where false-negatives, some were genuine, a few regarding the food seemed to ring true.
Perhaps the data you’re analyzing needs to be taken into context as a whole. You’d expect a 1000 Techies running around a tech conference to be happy, the sentiment analysis seemed to do a better job at ranking how positive tweets were than how negative they were. Perhaps from a visualisation point of view, a logarithmic scale along with a multiplier to raise the lowest score would have worked better in this case to reflect how positive overall the event was. One thing is clear though and that is that further statistical analysis would be needed over a much larger data set to really gain insight into how positive or negative your data set is.
So that’s it kids, I hope you’ve enjoyed this episode of Blue Peter, until next time….
Edit – 6-Dec-2013 19.07:
Mark has asked me the following question over twitter:
“any chance you could let us know why these particular tools / components were chosen?”.
I thought I’d give my answer here.
All the tools were also open source, easy to install/configure and are well documented, I had also used them all previously – again, a time saver. Redis was chosen for 2 reasons, it was supported as a subscription destination by DataSift ( along with many others ) and I’m currently using it in another development project I’m working on so I was up to speed with it. Although in this solution I’m not really taking advantage of Redis’s access speed it worked well as somewhere to persist the data.
If I was coding a more permanent solution with more development time then I’d add a relational database into the mix and use it to perform all of the data aggregation and analysis, this would make the solution more scalable. I’d then either feed the browser via ajax calls directly to the database via backend code or populate a cache layer from the database and make ajax called directly on the cache, similar to what I did in this solution. It would have been nice to use D3 to develop a more elaborate visualisation instead of the canned flot charts but again this would have taken more time to develop.
Next week Rittman Mead is the analytics sponsor for the UKOUG Tech13 conference up in Manchester.
We’ve got a great line up of talks, covering everything from BI Apps, Endeca, Hadoop, OBIEE and mobile – Stewart and Charles are both flying in from the US to present alongside Mark, Adam and myself. The full list is as follows:
- Deep dive into Oracle BI Applications using Oracle Data Integrator [Mark Rittman] – Sunday 12.30-2pm
- Leveraging Hadoop / Map Reduce OBIEE 11g and ODI 11g [Mark Rittman] – Sunday 3pm – 4.30pm
- BI Across Any Data Source with OBIEE and Oracle Endeca Discovery [Mark Rittman & Adam Seed] – Monday 5.05-5.50pm
- The Changing World of Business Intelligence [Jon Mead] – Tuesday 11.20 – 12.20
- Oracle BI Multi User Development (MDS XML vs MUDE) [Stewart Bryson] – Tuesday 11.30 -12.30
- Designing for a mobile World using OBIEE [Charles Elliot] – Wednesday 1.45 – 2.45
On Monday night we are teaming up with our friends at Pythian to host some drinks and nibbles at Taps Bar, which is just around the corner from the conference venue. We’ll be there from 6.30-9.00pm, so come and join us for a free drink.Analytics Sponsor
As part of being the analytics sponsor for the event we are looking to collate as much real time, social media and demographic information about the event as possible. We will be displaying an analytics dashboard in the conference venue detailing statistics from this data. To help us, could you:
- Complete the form here to give use some background infromation about why you are going; and
- Use the official hashtag ukoug_tech13 when tweeting anything about the event.
It’s looking like it will be a great few days, so look forward to seeing you up there.
More Excel SupportDodeca has always been strong on Excel version support and this version delivers even more Excel functionality. Internally, we use the SpreadsheetGear control, which does a very good job with Excel compatibility. This version of Dodeca integrates a new version of SpreadsheetGear that now has support for 398 Excel functions including the new SUMIFS, COUNTIFS, and CELL functions.Excel Page Setup DialogThe new version of Dodeca includes our implementation of the Excel Page Setup Dialog which makes it easy for users to customize the printing of Dodeca views that are based on Excel templates. Note that for report developers, the Excel Page Setup has also been included in the Dodeca Template Designer.
New PDF View TypeCustomers who use PDF files in their environments will like the new PDF View Type. In previous releases of Dodeca, PDF documents displayed in Dodeca opened in an embedded web browser control. Beginning in this version, Dodeca includes a dedicated PDF View type that uses a specialized PDF control.
View Selector TooltipsFinally, users will like the new View Selector tooltips which optionally display the name and the description of a report as a tooltip.
PerformancePerformance is one of those things that users always appreciate, so we have added a new setting that can significantly improve performance in some circumstances. Dodeca has a well-defined set of configuration objects that are stored on the server and we were even awarded a patent recently for the unique aspects of our metadata design. That being said, depending on how you implement reports and templates, there is the possibility of having many queries issued to the server to check for configuration updates. In a few instances, we saw that optimizing the query traffic could be beneficial, so we have implemented the new CheckForMetadataUpdatesFrequencyPolicy property. This property, which is controlled by the Dodeca administrator, tells Dodeca whether we should check the server for updates before any object is used, as was previously the case, only when a view opens, or only when the Dodeca session begins. We believe the latter case will be very useful when Dodeca is deployed in production as objects configured in production often do not change during the workday and, thus, network traffic can be optimized using this setting. The screenshot below shows where the administrator can control the update frequency.
Though users will like these features, we have put a lot of new things in for the people who create Dodeca views and those who administer the system. Let’s start with something that we think all Dodeca admins will use frequently.Metadata Property Search UtilityAs our customers continue to expand their use of Dodeca, the number of objects they create in the Dodeca environment continues to grow. In fact, we now have customers who have thousands of different objects that they manage in their Dodeca environments. The Metadata Property Search Utility will help these users tremendously.
This utility allows the administrator to enter a search string and locate every object in our system that contains that string. Once a property is located, there is a hyperlink that will navigate to the given object and automatically select the relevant property. This dialog is modeless, which means you can navigate to any of the located items without closing the dialog.
Note: this version does not search the contents of Excel files in the system.Essbase Authentication ServicesIn the past, when administrators wished to use an Essbase Authentication service to validate a login against Essbase and automatically obtain Dodeca roles based on the Essbase user’s group memberships, they had to use an Essbase connection where all users had access to the Essbase application and database. The new ValidateCredentialsOnly property on both of the built-in Essbase Authentication services now flags the service to check login credentials at the server-level only, eliminating the need for users to have access to a specific Essbase database.New Template Designer ToolsPrior to Dodeca 6.x, all template editing was performed directly in Excel. Since that time, however, most template design functionality has been replicated in the Dodeca Template Designer, and we think it is preferable due to the speed and ease of use with which users can update templates stored in the Dodeca repository. We have added a couple of new features to the Template Designer in this version. The first tool is the Group/Ungroup tool that allows designers to easily apply Excel grouping to rows and/or columns within the template. The second new tool is the Freeze/Unfreeze tool that is used to freeze rows and/or columns in place for scrolling.Parameterized SQL Select StatementsSince we introduced the SQLPassthroughDataSet object in the Dodeca 5.x series, we have always supported the idea of tokenized select statements. In other words, the SQL could be written so that point-of-view selections made by users could be used directly in the select statement. In a related fashion, we introduced the concept of parameterized insert, update, and delete statements in the same version. While parameterized statements are similar in concept to tokenized statements, there is one important distinction under the covers. In Dodeca, parameterized statements are parsed and converted into prepared statements that can be used multiple times and results in more efficient use of server resources. The parameterized select statement was introduced in this version of Dodeca in order for customers using certain databases that cache the prepared statement to realize improved server efficiency on their select statements.Workbook Script Formula Editor ImprovementsWe have also been working hard to improve extensibility for developers using Workbook Scripts within Dodeca. In this release, our work focused on the Workbook Script Formula Editor. The first thing we added here is color coding that automatically detects and distinguishes Excel functions, Workbook Script functions, and Dodeca tokens. In the new version, Excel functions are displayed in green, Dodeca functions and parentheses are displayed in blue, and tokens are displayed in ochre. Here is an example.
In addition, we have implemented auto-complete for both Excel and Dodeca functions.
New SQLException EventVersion 6.6 of Dodeca introduces a new SQLException event that provides the ability for application developers to customize the behavior when a SQL Exception is encountered.XCopy Release DirectoryBeginning in version 6.6, the Dodeca Framework installation includes a pre-configured directory intended for customers who prefer to distribute their client via XCopy deployment instead using Microsoft ClickOnce distribution. The XCopy deployment directory is also for use by those customers who use Citrix for deployment.Mac OS X Release DirectoryThe Dodeca Framework installation now includes a pre-compiled Dodeca.app deployment for customers who wish to run the Dodeca Smart Client on Mac OS X operating systems. What that means is that Dodeca now runs on a Mac without the need for any special Windows emulators. Dodeca does not require Excel to run on the Mac (nor does it require Excel to run on Windows for that matter), so you can certainly save your company significant licensing fees by choosing Dodeca for your solution.
In short, you can see we continue to work hard to deliver functionality for Dodeca customers. As always, the Dodeca Release Notes provide detailed explanations of all new and updated Dodeca features. As of today, we have decided to make the Release Notes and other technical documents available for download to non-Dodeca customers. If you are curious about all of the things Dodeca can do, and if you aren't afraid to dig into the details, you can now download our 389 page cumulative Release Notes document from the Dodeca Technical Documents section of our website.
How can you conditionally turn cells borders on and off in Publishers RTF/XSLFO templates? With a little digging you'll find what appears to be the appropriate attributes to update in your template. You would logically come up with using the various border styling options:
border-top|bottom|left|right-width border-top|bottom|left|right-style border-top|bottom|left|right-color
Buuuut, that doesnt work. Updating them individually does not make a difference to the output. Not sure why and I will ask but for now here's the solution. Use the compound border formatter border-top|bottom|left|right. This takes the form ' border-bottom="0.5pt solid #000000". You set all three options at once rather than individually. In a BIP template you use:
<?if:DEPT='Accounting'?> <?attribute@incontext:border-bottom;'3.0pt solid #000000'?> <?attribute@incontext:border-top;'3.0pt solid #000000'?> <?attribute@incontext:border-left;'3.0pt solid #000000'?> <?attribute@incontext:border-right;'3.0pt solid #000000'?> <?end if?>
3pt borders is a little excessive but you get the idea. This approach can be used with the if@row option too to get the complete row borders to update. If your template will need to be run in left to right languages e.g. Arabic or Hebrew, then you will need to use start and end in place of left and right.
For the inquisitive reader, you're maybe wondering how, did this guy know that? And why the heck is this not in the user docs?
Other than my all knowing BIP guru status ;0) I hit the web for info on XSLFO cell border attributes and then the Template Builder for Word. Particularly the export option; I generated the XSLFO output from a test RTF template and took a look at the attributes. Then I started trying stuff out, Im a hacker and proud me! For the users doc updates, I'll log a request for an update.
If you’re a user of Oracle’s data integration products, you’re probably aware that ODI12c came out last month, with some of the new features covered in posts on the blog here, here and here. Rittman Mead were actually on the beta program for 12c, with several of our team attending preview events in the UK and USA at the start of the year, and then running preview releases back in our development labs as the product neared GA.
As CTO I sponged our involvement in the ODI12c beta program, and was invited down to Oracle’s offices in Reading to take part in an interview with Kulvinder Hari, Director Product Management – Fusion Middleware, to talk about our involvement in the program, and what we saw as the most interesting new features in this latest release. You can access the full video here, and I also talk about the wider data integration market, products such as GoldenGate and EDQ, and the benefits OWB customers will get when migrating to, and interoperating with, Oracle Data Integrator.
The interview was actually part of a wider set of activities around the ODI12c launch, and you read a recap of the Oracle GoldenGate 12c and Oracle Data Integration 12c Webcast on the Oracle website, as well as download resources on ODI12c from an oracle.com microsite. Keep an eye on the blog as well over the next few months, as we post more content on ODI12c along with the other new Oracle Data Integration 12c releases.odi
In yesterday’s post on running OBIEE in the cloud, I looked at a number of options for hosting the actual OBIEE element; hosting it in a public cloud service such as Amazon EC2, using Oracle’s upcoming BI-as-a-Service offering, or partner offerings such as our upcoming ExtremeBI in the Cloud service. But the more you think about this sort of thing, the more you realise that the OBIEE element is actually the easy part – it’s what you do about data storage, security, LDAP directories and ETL that makes things more complicated.
Take the example I gave yesterday where OBIEE was run in the cloud, with the multi-tenancy option enabled, the main data warehouse in the cloud, and data sourced from cloud and on-premise sources.
In this type of setup, there’s a number of things you need to consider beyond how OBIEE is hosted. For example:
- If your corporate LDAP directory is on-premise, how do we link OBIEE to it? Or does the LDAP server also need to be in the cloud?
- What sort of database do we use if we’re hosting it in the cloud. Oracle? If so, self-hosted in a public cloud, or through one of the Oracle DB-in-the-cloud offerings?
- If not Oracle database, what other options are available?
- And how do we ETL data into the cloud-based data warehouse? Do we continue to use a tool like ODI, or use a cloud-based option – or even a service such as Amazon’s AWS Data Pipeline?
What complicates things at this stage in the development of “cloud”, is that most companies won’t move 100% to cloud in one go; more likely, individual application and systems might migrate to the cloud, but for a long time we’ll be left with a “hybrid” architecture where some infrastructure stays on premise, some might sit in a public cloud, others might be hosted on third-party private clouds. So again, what are the options?
Well Oracle’s upcoming BI-as-a-service offering works at one extreme end-of-the-spectrum; the only data source it’ll initially work with is Oracle’s own database-as-a-service, which in its initial incarnation provides a single schema, with no SQL*Net access and with data instead uploaded via a web interface (this may well change when Oracle launch their database instance-as-a-service later in 2014). No SQL*Net access means no ETL tool access though, in practice, as they all use SQL*Net or ODBC to connect to the database, so this offer to my mind is aimed at either (a) small BI applications where it’s practical to upload the data via Excel files etc, or (b) wider Oracle Cloud-based systems that might use database-as-a-service to hold their data, Java-as-a-service for the application and so forth. What this service does promise though is new capabilities within OBIEE where users can upload their own data, again via spreadsheets, to the cloud OBIEE system, and have that “mashed-up” with the existing corporate data – the aim being to avoid data being downloaded into Excel to do this type of work, and with user metrics clearly marked in the catalog so they’re distinct from the corporate ones.
But assuming you’re not going for the Oracle cloud offer, what are the other options over data? Well hosting OBIEE in the cloud is conceptually no different from hosting anywhere else, in that it can connect to various data sources via the various connection methods, so in-principle you’ve got just the same options open to you if running on premise. But the driver for moving OBIEE into the cloud might be that your applications, data etc are already in the cloud, and you might also be looking to take advantage of cloud features in your database such as dynamic provisioning and scaling, or indeed use one of the new cloud-native databases such as Amazon Redshift.
I covered alternative databases for use with OBIEE a few months ago in a blog post, and Amazon Redshift at the time looked like an interesting option; based on ParAccel, a mature analytic database offering, column-store and tightly integrated in with Amazon’s other offerings, a few customers have asked us about this as an option. And they’re certainly interesting – in practice, not all that different in pricing to Oracle database as a source but with some interesting analytic features – but they all suffer from the common issue that they’re not officially supported as data sources. Amazon Redshift, for example, uses Postgres-derviced ODBC drivers to connect to it, but Postgres itself isn’t officially supported as a source, which means you could well get sub-optimal queries and you certainly won’t get specific support from Oracle for that source. But if it works for you – then this could be an option, along with more left-field data source such as Hadoop.
But to my mind, it’s the ETL element that’s the most interesting, and most challenging, part of the equation. Going back to Openworld, Oracle made a few mentions of ETL in their general Cloud Analytics talks, including talk about an upcoming data source adapter for the BI Apps that’ll enable loading from Fusion Apps in the cloud, like this:
There were also a number of other deployment options discussed, including hybrid architectures where some sources were in the cloud, some on-premise, but all involved running the ETL elements of the BI Apps – ODI or Informatica – on-premise, the same way as installs today. And to my mind, this is where the Oracle cloud offering is the weakest, around cloud-based and cloud-native ETL and data integration – the only real option at the moment is to run ODI agents in the cloud and connect them back to an on-premise ODI install, or move the whole thing into the cloud in what could be quiet a heavyweight data integration architecture.
Other vendors are, in my opinion, quite a way further forward with their cloud data integration tools strategy than Oracle, who instead seem to be still focused on on-premise (mainly), database-to-database (mostly) ETL. To take two examples; Informatica have an Informatica Cloud service which appears to be a platform-as-a-service play, with customers presumably signing-up for the service, designing their ETL flows and then paying for what they use, with a focus on cloud APIs as well as database connectivity, and full services around data quality, MDM and so forth.
Another vendor in this space is SnapLogic, a pure-play cloud ETL vendor selling a component-based product with a big focus on cloud, application and big data sources. What’s interesting about this and other similar vendor’s approaches though are they they appear to be “cloud-first” – written for the cloud, sold as a service, as much focused on APIs as database connectivity – a contrast to Oracle’s current data integration tools strategy which to my mind still assumes an on-premise architecture. What’s more concerning is the lack of any announcement around ETL-in-the-cloud at the last Openworld – if you look at the platform-as-a-service products announced at the event, whilst database, BI, documents, BPM and so forth-as-a-service were announced, there was no mention of data integration:
What I’d like to see added to this platform in terms of data integration would be something like:
- On-demand data integration, sold as a service, available as a package along with database, Java and BI
- Support for Oracle and non-Oracle application APIs – for example Salesforce.com, Workday and SAP – see for example what SnapLogic support in this area.
- No need for an install – it’s already installed and it’s a shared platform, as they’re doing with OBIEE
- Good support for big data, unstructured and social data sources
I think it’s pretty likely this will happen – whilst products such as the BI Apps can have their ETL in the cloud, via ODI in the BI Apps 11g version for example, these are inherently single-tenant, and I’d fully expect Oracle plan at some time to offer BI Apps-as-a-service, with a corresponding data integration element designed from the ground-up to work cloud-native and integrate with the rest of Oracle’s platform-as-a-service offering.
So there we have it – some thoughts on the database and ETL elements in an OBIEE-in-the-cloud offering. Keep an eye on the blog over the next few months as I built-out a few examples, and I’ll be presenting on the topic at the upcoming BIWA 2014 event in San Francisco in January – watch this space as they say.
One of the major announcements at this year’s Oracle Openworld in San Francisco was around “OBIEE in the cloud”. Part of a wide-ranging set of announcements around services and infrastructure in the cloud from Oracle, the idea with this service is that you’ll be able to access an Oracle-hosted future version of OBIEE running in Oracle’s public cloud service, so that you can create dashboards and reports without having to perpetually-license lots of software, or stand-up lots of infrastructure in your data centre. But of course “cloud” is a hot topic at the moment, and lots of our customers are asking us about options to run OBIEE in the cloud, what’s involved in it, and how we deal with the stuff that surrounds an OBIEE installation – the data, the ETL, the security, and so forth. So I thought I’d put a couple of blog posts together to go through what options are available to OBIEE customers looking to deploy in the cloud, and see what the state of the industry is around this requirement.
I’ll start-off then by looking at the most obvious element – the BI platform itself. If you want to run OBIEE “in the cloud”, what are your basic options?
Probably conceptually the simplest, is just to run the OBIEE server applications in a cloud-based, hosted environment, for example Amazon EC2. In this instance, OBIEE runs as it would do normally, but instead of the host server being in your datacenter, it’s in a public cloud. This type of setup has been available for some time in the form of services such as Oracle On-Demand, but the difference here is that you’re using a public cloud service, there’s no service contracts to set up, usually no minimum contract size, and everything to do with security, failover, maintenance and so on is down to you. We’ve been doing this sort of thing for a long-time, typically for our cloud-based training environments, or for ad-hoc development work where it makes more sense for a server to be centrally available on the internet, rather than sitting on someone’s laptop or on an internal server. Once potential complication here is licensing – for products such as the Oracle Database, there are formal schemes such as Amazon RDS where either the license is included, or there are set metrics or arrangements to either bring your own license, or buy them based on virtual CPUs. If you’re prepared to take care of the licensing, and all of the security and maintenance aspects, this is an interesting approach.
In practice though, any setup like this is going to be what’s termed a “hybrid” cloud deployment; at least the administration side of OBIEE (the BI Administration Tool) is still going to be “on-premise”, as there’s no way you can deploy that “thin-client”, unless you create a Windows VM in the cloud too, and run the BI Administration tool from there. Moreover, your data sources are more than likely to still be on-premise, with just a few at this point hosted in the cloud, so most likely your OBIEE-in-the-cloud architecture will look like this:
There’s other things to think about too; how do you do your ETL when your target data warehouse might be in the cloud, or your data sources might be cloud based; how do you connect to your LDAP server, and so forth – I’ll cover these in the next postings in the series. But for now, this is conceptually simple, and its main benefit is avoiding the need to host your own servers, buy your own hardware, and so forth.
So what if you want to use OBIEE, but you don’t want the hassle even of setting up your own servers in a public cloud? This is where Oracle’s new “BI in the Cloud” service comes in – the idea here is that you swipe your credit card, fill in your details, self-provision your system, and then start loading data into it.
Whilst the underlying technology is core OBIEE, it’s designed for departmental, power user-type scenarios where the scope of data is limited, and the focus really is on ease-of-use, self-service and “software-as-a-service”. Looking at Oracle’s slides from Openworld, you can see the focus is quite different to on-premise OBIEE – the competition here for Oracle is the various internet startups, and products like Tableau, that make it easy to get started, provide consumer-class user interfaces, and focus on the single user rather than enterprise-type requirements.
But this focus on simplicity means a limitation in functionality as well. At the start, at least, Oracle’s BI in the Cloud will only offer Answers and Dashboards; no KPIs, BI Publisher, Maps or Delivers. Moreover, at least in its first iteration, it’ll only support Oracle’s own database schema-as-a-service as the single, sole datasource, so no Hybrid cloud/on-premise federated queries, no Essbase, and no ETL tools – hence the focus on single-user, departmental use-cases.
What you do get with this setup though is a cloud-native, fully-managed service where patching is taken care of for you, all the new feature appear first, and the administration element has been re-thought to be more appropriate for deployment to the cloud – in particular, a new version of the BI Administration tool that’s cloud-native, focused on simple use-cases, and doesn’t require desktop client installs or VPN connections through to the OBIEE server. The diagram bellow shows the architecture for this option, and you can see it’s all pretty self-contained; great for simplicity, but maybe a bit limiting at least in the initial iteration.
The third major variant that I can see around OBIEE in the cloud, is where partners (such as Rittman Mead) offer shared access to their cloud OBIEE installations, either as part of a development package or through some sort of multi-tenant reporting product. For example, we’re about to launch a service we’re calling “ExtremeBI in the Cloud”, where we combine our “ExtremeBI” agile development method with cloud-based development server hosting, with the cloud element there to make it easier for clients to start on a project *today*. Endeca Information Discovery is another product that could benefit from this approach, as the infrastructure behind an Endeca install can be fairly substantial, but users are typically more-focused on the data modelling and report-building element. In our case, the back-end cloud hosting will typically be done on a service like Amazon EC2 or VMWare’s Hybrid Cloud product, but the focus is more on the development piece – cloud is there to provide business agility.
The other third-party approach to this centres around multi-tenancy; typically, a partner or company will provide a reporting or analytics element to a wider product set, and use OBIEE as its embedded dashboarding tool as part of this. In the past, setting up multi-tenancy for OBIEE was fairly difficult, to the point where most companies set up separate installs for each end-customer, but 22.214.171.124 brought a number of multi-tenant features that presumably were put there to support Oracle’s own cloud product. The product docs describe the new multi-tenant features well, but the basic premise is that a single RPD and catalog are still set up, but OBIEE can then run in a “multi-tenant” mode where GUIDs demarcate each tenant, a new BIGlobalAdministrator role is created that assumes what were the old BIAdministrator role’s privileges, with other administrator roles then set up with limited privileges and no access to Fusion Middleware Control, for example.
What this does give you though is the framework to set up tenant groupings of users, separate areas of the catalog for each tenant, additional identity store attributes to hold tenant details, and a security framework that enables more limited forms of administrator account more suited to tenant-type situations. Again, not all OBIEE features are available when running in multi-tenant mode, and presumably we’ll see this feature evolve over time as more requirements come in from Oracle’s own cloud BI product, but it’s a feature you can use now if you’re looking to set up a similar type of environment.
So that’s the three basic options if you want to run OBIEE in the cloud; host it in a public cloud service like Amazon EC2 but then just run it like regular OBIEE; go for Oracle’s upcoming BI-in-the-cloud service, if the use-case suits you and you can live with the limitations, or consider one of the services from partners such as ourselves where we bundle cloud hosting of OBIEE up with a larger service offering like our “ExtremeBI in the Cloud” offer. But of course OBIEE is only one part of the overall platform – where do you store the data, and how do you get the data from wherever it normally is up into the cloud – in other words, how do we do ETL in the cloud? Check back tomorrow for the follow-up post to this one: Part 2 – Data Warehouse and ETL.
Bit of a corner case this week but I wanted to park this as much for my reference as yours. Need to be able to test a pure XSL template against some sample data? Thats an XSL template that is going to generate HTML, Text or HTML. The Template Viewer app in the BI Publisher Desktop group does not offer that as an option. It does offer XSL-FO proccesing thou.
A few minutes digging around in the java libraries and I came up with a command line solution that is easy to set up and use.
1. Place your sample XML data and the XSL template in a directory
2. Open the lib directory where the TemplateViewer is installed. On my machine that is d:\Oracle\BIPDesktop\TemplateViewer\lib
3. Copy the xmlparserv2.jar file into the directory created in step 1.
4. Use the following command in a DOS/Shell window to process the XSL template against the XML data.
java -cp ./xmlparserv2.jar oracle.xml.parser.v2.oraxsl fileX.xml fileY.xsl > fileX.xls
The file generated will depend on your XSL. For an Excel output, you would instruct the process to generate fileX.xls in the same folder. You can then test the file with Excel, a browser or a text editor. Now you can test on the desktop until you get it right without the overhead of having to load it to the server each time.
To be completely clear, this approach is for pure XSL templates that are designed to generate text, html or xml. Its not for the XSLFO templates that might be used at runtime to generate PDF, PPT, etc. For those you should use the Template Viewer application, it supports the XSLFO templates but not the pure XSL templates.
If your template still falls into the pure XSL template category. This will be down to you using some BIP functionality in the templates. To get it to work you'll need to add in the Publisher libraries that contain the function e.g. xdo-core.jar, i18nAPI_v3.jar, etc to the classpath argument (-cp.)
So a new command including the required libraries might look like:
java -cp ./xmlparserv2.jar;./xdo-core.jar;./i18nAPI_v3.jar oracle.xml.parser.v2.oraxsl fileX.xml fileY.xsl > fileX.xls
You will need to either move the libraries to the local directory, my assumption above or include the full path to them. More info here on setting the -cp attribute.
There are two steps you have to take to "save big". First, become a full member of ODTUG for $99 and enjoy all of the benefits, including access to a members-only presentations library, throughout the year. Next, register for Kscope14 and you are eligible for the members-only price of $1500 for a savings of $150. While you are registering, simply use the code AOLAP to get an additional $100 discount!
My company, Applied OLAP, is one of top-tier Platinum Sponsors of Kscope14 and I will be there. I hope to see you at the conference and, if you were able to save some money by using our exclusive AOLAP code, be sure to stop our booth, say hello, and learn how the Dodeca Spreadsheet Management System can help your company reduce spreadsheet risk, increase spreadsheet accuracy, and reduce costs.
We all love a good commandline utility. It gives us that warm feeling of control and puts hairs on our chests. Either that, or it means we can script the heck out of a system, automate many processes, and concentrate on more important matters.
However, some of the OBIEE commandline utilities can’t be used in Production environments at many sites because they need the credentials for OBIEE stored in a plain-text file. Passwords in plain-text are bad, mmmm’kay?
Two of the utilities in particular that it is a shame that can’t be scripted up and deployed in Production because of this limitation are the Presentation Services Catalog Manager, and the Presentation Services Replication Agent. Both these perform very useful purposes, and what I want to share here is a way of invoking them more securely.Caveat
IANAC : I Am Not A Cryptographer! Nor am I a trained security professional. Always consult a security expert for the final word on security matters.
The rationale behind developing the method described below is that some some sites will have a “No Plaintext Passwords” policy which flatout prevents the use of these OBIEE utilities. However, at the same sites the use of SSH keys to enable one server to connect to another automatically is permitted. On that basis, the key-based encryption for the OBIEE credentials may therefore be considered an acceptable risk. As per Culp’s 9th law of security administration, it’s all about striking the balance between enabling functionality and mitigating risk.
The method described below I believe is a bit more secure that plaintext credentials, but it is not totally secure. It uses key based encryption to secure the previously-plaintext credentials that the OBI utility requires. This is one step better than plaintext alone, but is still not perfect. If an attacker gained access to the machine they could still decrypt the file, because the key is held on the machine without a passphrase to protect it. The risk here is that we are using security by obscurity (because the OBIEE credentials are in an encrypted file it appears secure, even though the key is held locally), and like the emperor’s new clothes, if someone takes the time to look closely enough there is still a security vulnerability.
My final point on this caveat is that you should always bear in mind that if an attacker gains access to your OBIEE machine then they will almost certainly be able to do whatever they want regardless, including decrypting the weblogic superuser credentials or reseting it to a password of their own choosing.Overview
Two new shiny tools I’ve acquired recently and am going to put to use here are GnuPG (
mkfifo. GPG provides key-based encryption and decryption and is available by default on common Linux distributions including Oracle Linux.
mkfifo is also commonly available and is a utility that creates named pipes, enabling two unreleated processes to communicate. For a detailed description and advanced usage of named pipes, see here
This is a one-time set up activity. We create a key in
gpg, and then encrypt the plain text credentials file using it.
The first step is to create a gpg key, using
gpg --gen-key. You need to specify a “Real name” to associate with the key, I just used “obiee”. Make sure you don’t specify a passphrase (otherwise you’ll be back in the position of passing plain text credentials around when you use this script).
$ gpg --gen-key gpg (GnuPG) 1.4.5; Copyright (C) 2006 Free Software Foundation, Inc. This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. See the file COPYING for details. Please select what kind of key you want: (1) DSA and Elgamal (default) (2) DSA (sign only) (5) RSA (sign only) Your selection? DSA keypair will have 1024 bits. ELG-E keys may be between 1024 and 4096 bits long. What keysize do you want? (2048) 2048 [...] Real name: obiee Email address: Comment: [...] You don't want a passphrase - this is probably a *bad* idea! I will do it anyway. You can change your passphrase at any time, using this program with the option "--edit-key". [...] gpg: key 94DF4ABA marked as ultimately trusted public and secret key created and signed.
Once this is done, you can encrypt the credentials file you need for the utility. For example, the Catalog Manager credentials file has the format:
To encrypt it use
gpg --recipient obiee --output saw_creds.gpg --encrypt saw_creds.txt
Now remove the plaintext password file
Using the secure credentials file
Once we have our encrypted credentials file we need a way of using it with the utility it is intended for. The main thing we’re doing is making sure we don’t expose the plaintext contents. We do this using the named pipes method:
In this example I am going to show how to use the secure credentials file with
runcat.sh, the Catalog Manager utility, to purge the Presentation Services cache. However it should work absolutely fine with any utility that expects credentials passed to it in a file (or stdin).
There is a three step process:
Create a named pipe with
mkfifo. This appears on a disk listing with the
pbit to indicate that it is a pipe. Access to it can be controlled by the same
chmodprocess as a regular file. With a pipe, a process can request to consume from it, and anything that is passed to it by another process will go straight to the consuming process, in a FIFO fashion. What we’re doing through the use of a named pipe is ensuring that the plain text credentials are not visible in a plain text file on the disk.
Invoke the OBIEE utility that we want to run. Where it expects the plaintext credentials file, we pass it the named pipe. The important bit here is that the utility will wait until it receives the input from the named pipe – so we call the utility with an ampersand so that it returns control whilst still running in the background
gpgto decrypt the credentials file, and pass the decrypted contents to the named pipe. The OBIEE utility is already running and listening on the named pipe, so will receive (and remove from the pipe) the credentials as soon as they are passed from
The script that will do this is as follows:
# Change folder to where we're invoking the utility from cd $FMW_HOME/instances/instance1/bifoundation/OracleBIPresentationServicesComponent/coreapplication_obips1/catalogmanager # Create a named pipe mkfifo cred_pipe # Let's make sure only we can access it chmod 600 cred_pipe # Invoke Catalog Manager. Because we're using a named pipe, it's actually going to sit and wait until it gets input on the pipe, so we need to put the ampersand in there so that it returns control to the script ./runcat.sh -cmd clearQueryCache -online http://localhost:9704/analytics/saw.dll -credentials cred_pipe & # Decrypt the credentials and send them to the named pipe gpg --quiet --recipient obiee --decrypt saw_creds.gpg > cred_pipe # Remove the named pipe rm cred_pipe
Depending on the utility that you are invoking, you may need to customise this script. For example, if the utility reads the credentials file multiple times then using the named pipes method it will fail after the first read. Your option would be to read the credentials into the pipe multiple times (possibly a bit hacky), or land the plaintext credentials to disk and delete them after the utility complete (could be less secure if the delete doesn’t get invoked)Using a secure credentials file for command line arguments
Whilst the sticking point that triggered this article was around utilities requiring whole files with credentials in, it is also common to see command line utilities that want a password passed as an argument to them. For example, nqcmd :
nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s myscript.lsql
Let’s assume we’ve created an encrypted file containing “Password01” (using the
gpg --encrypt method shown above) and saved it as password.gpg.
To invoke the utility and pass across the decrypted password, there’s no need for named pipes. Instead we can just use a normal (“unnamed”) pipe to send the output straight from gpg to the target utility (nqcmd in this example), via xargs:
gpg --batch --quiet --recipient obiee --decrypt ~/password.gpg | xargs -I GPGOUT nqcmd -d AnalyticsWeb -u weblogic -p GPGOUT -s input.lsql
xargs has a
--interactive option that makes it a lot easier when developing piped commands such as the above
Because there is no passphrase on the gpg key, a user who obtained access to the server would still be able to decrypt the credentials file. In many ways this is the same situation that would arise if a server was configured to use ssh-key authentication to carry out tasks or transfer files on another server.Uses
Here are some of the utilities that the above now enables us to run more securely:
- nqcmd is a mainstay of my OBIEE toolkit, being useful for performance testing, regression testing, aggregate building, and more. Using the method above, it’s now easy to avoid storing a plaintext password in a script that calls it.
- Keeping the Presentation Catalog in sync on an OBIEE warm standby server, using Presentation Services Replication
- Purging the Presentation Services Cache from the command line (with Catalog Manager, per the above example)
- SampleApp comes with four excellent utilities that Oracle have provided, however all but one by default requires plaintext credentials. If you’ve not looked at the utilities closely yet, you should! You can see them in action in SampleApp itself, or get an idea of what they do looking at the SampleApp User Guide pages 14–17 or watching the YouTube video.
Over the past few months I’ve been posting a number of articles about Hadoop, and how you can connect to it from ODI and OBIEE. From an ODI perspective, I covered Hadoop as one of a number of new data sources ODI11g could connect to, then looked at how it leveraged Hive to issue SQL-like data extraction commands to Hadoop, and how it used Oracle Hadoop connector tools to transfer Hadoop data into the Oracle Database, and directly work with data in HDFS files. For OBIEE, I went through the background to Hadoop, Hive and the other “big data” technologies, stepped through a typical Hive query session, then showed how OBIEE 126.96.36.199 could connect to Hadoop through its newly-added Hive adaptor, then finally built a proof-of-concept OBIEE connection through to Cloudera Impala, then extended that to a multi-node Hadoop cluster.
But why all this interest in Hadoop – what’s it really got to do with OBIEE and ODI, and why should you as developers be interested in what’s probably yet another niche BI/DW datasource? Well in my opinion, Hadoop is the classic disruptive technology – cheap, and starting-off with far less functionality than regular, relational databases – but it’s improving fast, and as BI&DW developers it offers the potential of both massive benefits – significantly lower TCO for basic DW work, and support for lots of modern, internet-scale use-cases – and threats – in that if we don’t understand it and see how it can benefit our customers and end-users, we risk being left-behind as technology moves on.
To my mind, there are two main ways in which Hadoop, Hive, HDFS and the other big-data ecosystem technologies are used, in the BI/DW context:
1. Standalone, with their own query tools, database tools, query languages and so forth – your typical “data scientist” use case, originating from customers such as Facebook, LinkedIn etc. In this context, there’s typically no Oracle footprint, users are pretty self-sufficient, any output we see is in the form of “insights”, marketing campaigns etc.
2. Alongside more mainstream, for example Oracle, technologies. In this instance, Hadoop, Hive, HDFS, NoSQL etc are used as complementary, and supporting, technologies to enhance existing Oracle-based data warehouses, capture processes, BI systems. In some cases, Hadoop-type technologies can replace more traditional relational ones, but mostly they’re used to make BI&DW systems more scaleable, cheaper to run, able to work with a wider range of data sources and so forth. This is the context in which Hadoop can be relevant to more traditional Oracle BI, ETL and DW developers.
To understand how this happened, let’s go through a bit of a history lesson. Five years ago or so, your typical DW+BI architecture looked like this:
The data warehouse was typically made-up of three layers – staging, foundation/ODS and performance/dimensional, with data stored in relational databases with some use made of OLAP servers, or some of the newer in-memory databases like Qlikview. But over the intervening years, the scale and types of data sources have increased, with customers now looking to store data from unstructured and semi-structured sources in their data warehouse, take in feeds from social media and other “streaming” sources, and access data in cloud systems typically via APIs, rather than traditional ETL loading. So now we end up with a data warehouse architecture that looks like this:
But this poses challenges for us. From an ETL perspective, how do we access these non-traditional sources, and once we’ve accessed them – how do we efficiently process them? The scale and “velocity” of some of these sources can be challenging for traditional ETL processes that expect to log every transformation in a database with transactional integrity and multi-version concurrency control, whilst in some cases it doesn’t make sense to try and impose a formal data structure on incoming data as you’re capturing it, instead giving it the structure when we finally need it, or when we choose to access it in a query.
And then came “Hadoop”, and its platform and tool ecosystem. At its core, Hadoop is a framework for processing, in a massively-parallel and low-cost fashion, large amounts of data using simple transformation building blocks – filtering (mapping) and aggregating (reducing). Hadoop and MapReduce came out of the US West Coast Internet scene as a way of processing web and behavioural data in the same massively-distributed way that companies provided web search and other web 2.0 activities, and a core part of it was that it was (a) open-source, like Linux and (b) cheap, both in being open-source but also because it was designed from the outset to run on low-cost, commodity hardware that’s expected to fail. Pretty much the opposite of Oracle’s business model, but also obviously very attractive to anyone looking to lower the TCO of their data warehouse system.
So as I said – the Hadoop pioneers went-out and built their systems without much reference to vendors such as Oracle, IBM, Microsoft and the like, and being blunt, they won’t have much time for traditional Oracle BI&DW developers like ourselves. But those customers who are largely invested in Oracle technology, but see advantages in deploying Hadoop and big data technologies to make their systems more flexible, scaleable and cheaper to run – that’s where ODI and OBIEE’s connectivity to these technologies becomes interesting.
To take the example of customers who are looking to deploy Hadoop technologies to enhance their Oracle data warehouse – a typical architecture going down this route would look like this:
In this example, we’re using HDFS – Hadoop Distributed File System – as a pre-staging area for the data warehouse, storing incoming files cheaply, and with build-in fault tolerance, to the point where storage is so cheap that you might as well keep stuff you’re not interested in now, but you think might be interesting in the future. Using Oracle Direct Connector for HDFS, you can set up Oracle Database external tables that map onto HDFS just like any other file system, so you can extract from and otherwise work with these files without worrying about writing MapReduce jobs; ODI, through Oracle Data Integration Adaptor for Hadoop, you can connect ODI to these table sources as well, and work with them just like any other topology source, as I show in the slide below from my upcoming UKOUG Tech’13 session on ODI, OBIEE and Hadoop that’s running in a couple of week’s time in Manchester:
As well as storing data, you can also do simple filtering and transformation on that data, using the Hadoop framework. Most upfront data processing you do as part of an ETL process involves filtering out data you’re not interested in, joining data sets, grouping and aggregating data, and other large-scale data transformation tasks, before you then load it into the foundation/ODS layer and do more complex work. And this simple filtering and transformation is what Hadoop does best, on cheap hardware or even in the cloud – and if your customer is already invested in ODI and runs the rest of their ETL process using it, its relatively simple to add Hadoop capabilities to it, using ODI to orchestrate the data processing steps but using Hadoop to do the heavy lifting, as my slide below shows:
Now some customers, and of course Hadoop vendors, say that in reality you don’t even need the Oracle database if you’re going to build a data warehouse, or more realistically a data mart. Now that’s a bigger question and probably one that depends on the particular customer and circumstances, but a typical architecture that takes this approach might look like this:
In this case, ODI again has capabilities to transform data entirely within Hadoop – with ODI acting as the ETL framework and co-ordinator, but Hadoop doing the heavy-lifting – and there’s always the ability to get the data of Hadoop and into a main Oracle data warehouse, if the Hadoop system is more of a data mart or deparment-specific analysis. But whichever way – in most cases the customer is going tho want to continue to use their existing BI tool, particularly if their BI strategy involves bringing together data from lots of different systems, as you can do with OBIEE’s federated query capability – giving you an overall architecture that looks like this:
So it’s this context that makes OBIEE’s connectivity to Hadoop so important; I’m not saying that someone creating a Hadoop system from scratch is going to go out and buy OBIEE as their query tool – more typically, they’ll use other open-source tools or create models in tools like R; or they might go out and buy a lightweight data visualisation tool like Tableau and use that to connect solely to their Hadoop source. But the customers we work with have typically got much wider requirements for BI, have a need for an enterprise metadata model, recognise the value of data and report governance, and (at least at present) access most of their data from traditional relational and OLAP sources. But they will still be interested in accessing data from Hadoop sources, and OBIEE’s new capability to connect to this type of data, together with closer integration with Endeca and its unstructured and semi-structured sources, addresses this need.
So there you have it – that’s why I think OBIEE and ODI’s ability to connect to Hadoop is a big deal, and it’s why I think developers using those tools should be interested in how it works, and should try and set up their own Hadoop systems and see how it all works. As I said, I’ll be covering this topic in some detail at the UKOUG Tech’13 Conference in Birmingham in a couple of weeks time, so if you’re there on the Sunday come along and I’ll try and explain how I think it all fits together.
The other day I posted an article on the blog about connecting OBIEE 188.8.131.52 to Cloudera Impala, a new “in-memory” SQL engine for Hadoop that’s much faster than Hive for interactive queries. In this example, I connected OBIEE 184.108.40.206 to the Cloudera Quickstart CDH4 VM, which comes with all the Hadoop and Cloudera tools pre-installed and configured, making it easy to get going with the Hadoop platform.
Whilst the example worked though, I couldn’t help thinking that using Impala against a single node Hadoop install isn’t really how it’d be used in real-life; in reality, if you used OBIEE in this way, you’re much more likely to be connecting to a full Hadoop cluster, with multiple server nodes handling the incoming queries and potentially gigabytes, terabytes or petabytes of data being processed. So it it possible to set up a Hadoop cluster that gets a bit nearer to this multi-node architecture, so we can practice connecting to a cluster and not a single server, and we can see Hadoop process our queries across all of the nodes – as we’d see in real life, given that this low-cost MPP processing is the key benefit of Hadoop as a whole?
Hadoop, as you’re probably aware, was designed from the ground-up to run across multiple nodes, with those nodes typically either being small, low-cost servers, or in many cases servers running in the “cloud”. As such, you’re as likely to see Hadoop running on a cluster of Amazon EC2 server as running on physical servers in a datacenter, and in most cases the underlying OS running on those servers is Linux – most usually, Ubuntu 64-bit. So if we want to set up our own Hadoop cluster, there’s a few options open to us:
1. Get hold of a bunch of physical servers (maybe, old PCs or blade servers), install Linux and Hadoop on them, and then do the configuration and setup manually.
2. Buy a preconfigured solution – Oracle’s Big Data Appliance, for example, which has all the software pre-installed along with connectivity to ODI, Oracle Database etc
3. Spin-up a bunch of servers in the Cloud, or
4. Spin-up a bunch of Linux VMs, for example using VirtualBox or VMWare Fusion/Workstation
In the past I’ve done the Hadoop setup myself, manually, using the Hadoop distribution files available on the Hadoop website, but more recently vendors such as Hortonworks, MapR and Cloudera have put together their own Hadoop + added value tools distributions, and it’s Cloudera I’ve been looking at in most detail recently (if only because there’s a lot of ex-Oracle people there who I know, and it’s the bundled Hadoop distribution that comes with Oracle Big Data Appliance). What’s particularly good about Cloudera’s Hadoop offering is their “Cloudera Manager” utility – of which there’s a free version – and which simplifies the whole process of setting up a cluster by automating most of the process.
What’s also particularly interesting about Cloudera and Cloudera Manager, is that there are a number of solution available out there that automate the process of spinning-up clusters. One of them, described in this blog post on Cloudera’s website, involves using a built-in feature in Cloudera Manager to automatically create, provision and configure X number of Amazon EC2 virtual servers, with this servers then able to handle your queries in parallel and you just paying by the hour for the compute resource you need. So let’s give it a try.
I won’t go into a tutorial or explanation into Amazon Web Services and their EC2 (“Elastic Compute Cloud”) service here, suffice to say that you can create on-demand cloud-based VMs, paying by the hour and with pricing based on the size of instance, amount of memory needed, OS needed and so forth. We use Amazon AWS and EC2 extensively within Rittman Mead for development work, training environments and so forth, and what’s particularly interesting about AWS is the fact it’s all scriptable, there’s a public API and so forth. It’s this public API that Cloudera Manager uses to provision and spin-up the cluster VMs, something Cloudera Manager will automatically offer to do if it detects it’s running on Amazon EC2. Once you’ve provisioned the basic VMs, Cloudera Manager will automatically install the required Hadoop software on each of the VMs, meaning you can just sit back and watch the install, and then at the end, log in and check it’s all working.
And it did work – compared with problems I’d had with earlier versions of Cloudera Manager I’d had, where I’d set up the servers myself, installed Linux and Cloudera Manager myself, it all went amazingly well – to the point where I could upload some data into Impala itself, run some queries, and make use of my cloud-based Hadoop cluster.
And for one-off tasks, particularly where you need a very large amount of compute resource for a relatively small amount of time, Amazon AWS is great, but the cost soon starts to mount-up if you leave it running for too long – reckon on a cost of around $200-$400 for a reasonably-high specced instance for a month, multiplied by the amount of servers in your cluster.
So what are the alternatives? Well as I said before, you could set up a cluster yourself, installing the various bits of software, and potentially creating lots of VMs on a test server to host the Hadoop nodes. But another route you could take is to use one of the various “devops” tools out there to automate the build of a number of Hadoop nodes, using virtualisation tools such as VMWare or VirtualBox and a reasonably well-specced desktop or laptop. Back on the Cloudera website again, I saw a blog post and how-to just along these lines – one that used a devops tool called Vagrant to script and automate the build of the cluster, including setting up Ubuntu Linux on the VMs, and downloading and installing Cloudera Manager, just leaving the job of configuring the Hadoop cluster to us afterwards. This is the approach I finally went with in the end, and to do the same you’d need the following bits of kit and software:
- A desktop or server with a lot of RAM – I’ve used my iMac which has 32GB RAM; you could use a 16GB server or laptop but you’ll need to alter the Vagrant scripts to reflect this
- Either VirtualBox (supported by default by Vagrant) or VMWare Fusion / Workstation (which require an extra-cost plugin for Vagrant)
- Vagrant itself – a free download with install instructions here
Vagrant itself is an easy install and is available for OS X, Windows, Linux etc, and you can download Vagrant configuration flle for setting everything up from the Cloudera blog post. Then, it was a case of running the Vagrant script, and watching it create my VMs.
In the terminal screenshot above, you can see the VMs begin created (I started off using VirtualBox, later on I switched to VMWare Fusion), and in the screenshot below, you can see the various Cloudera packages being downloaded and installed.
Once the Vagrant set-up of the VMs was complete, I then logged into the Cloudera Manager website, and similar to how I’d done it with the EC2-based install, I just selected the other VMs to configure, chose the software components, and let the configuration complete.
At the end of the install process, I had six VMs running to provide by Hadoop cluster, each one using about 4GB of memory, and playing the following roles:
To be honest – you still need to know a bit about Hadoop, what the various bits do and so on to get it working – but then you also do to get Oracle installed, SQL Server, and so on. I guess the real barrier is having a machine big enough to run multiple Hadoop server nodes – too few and you don’t really see how the query processing works – so I guess this is why the cloud / EC2 route is so popular. But for me, I’ve got the six nodes working now, along with an OBIEE Windows VM with 220.127.116.11 installed to test out the connectivity. The screenshot below shows Cloudera Manager listing out the nodes in the cluster:
whilst the screenshot below this shows the various Hadoop platform elements listed out alongside the cluster nodes (a.k.a. VirtualBox/VMWare VMs) they’re running on.
and with OS X’s Activity Monitor showing they’re (just about) comfortably running within the overall 32GB RAM in the iMac.
So – the moment of truth – let’s try out some queries. I’ll start with Hive first of all, as Hue (Hive’s web-based UI) has some nice tools for uploading files and creating Hive tables out of them – or of course you can use ODI and it’s Hadoop Adapter and upload some data to the cluster as part of an ETL process. To use a more meaningfully-large dataset, I unloaded some of the tables from the full Airline Delays dataset to CSV files (around 180m rows of flight leg data), and then created Hive tables out of those – the screenshot below shows data from the main flight leg fact table.
In the background, two things happen when you upload new data into Hive; first, the file containing the data is stored in Hadoop’s filesystem, called HDFS (Hadoop Distributed File System), a unix-like distributed filesystem that breaks data down into blocks, and stores the blocks redundantly across the nodes in the cluster. If we take a look at the file I uploaded with the flight delays fact table data in it, you can see that it’s been broken down into blocks as shown at the bottom of the page:
If you click on an individual block, you can also see that the block is stored primarily on one node, and then redundantly on three other nodes in the cluster.
HDFS does this for two reasons; first, by spreading the data file over multiple servers, it can take advantage of the parallel processing provided by the Hadoop framework. Second, though, this redundancy means that if any node goes down, there’s copies of the data blocks elsewhere in the cluster, giving you the ability to use low-cost, commodity hardware (or cloud-based servers) whilst still protecting uptime, and your data.
So let’s run a query via the Hue UI, using Hive first of all. I put together a simple query that sums up flights, and averages distances, for all flights with California as the destination. As you can see from the Hue screenshot below, the query triggered two MapReduce jobs, one to find all flights with Californian destinations (the “map”), and one to aggregate the results (the “reduce”).
Looking at the MapReduce jobs being spun-up, run and then results gathered in, you can see that the MapReduce element (i.e., query time) took just under three minutes.
Going over the Job Tracker / MapReduce admin pages in Cloudera Manager, you can see the MapReduce jobs that were triggered by the Hive query – see how it handles the join, and how the filtering (mapping) is handled separately to the aggregating (reducing).
You can also bring up the Hadoop task tracker page, to see how the task track gave out chunks of the work to the various nodes in the cluster, and then got the results back in the end.
So – what about the Impala equivalent of the same query, then? Let’s give it a go. Well the query is more or less the same, but this time the results come back in around ten seconds, as we’d expect with Impala.
Looking inside Cloudera Manager, you can see the various Impala server processes working in the background, bypassing the need to generate MapReduce code and instead, using their own in-memory MPP framework to query the HDFS files and return the results.
And finally – the $64,000 question – can we connect OBIEE to the cluster? To do this, you’ll need to download the Cloudera Impala ODBC drivers, as I outlined in my previous blog post on the subject, but once you do, it should work – see the screenshot below where I’m querying the flight delays data using OBIEE 18.104.22.168.
So there you have it – a couple of ways you can spin-up your own multi-node Hadoop cluster, and confirmation that it should all still work with OBIEE once it’s put together.
Many of you know Mike Riley. If you don't, here's a little history. He's the past president of ODTUG (for like 37 years or something) and for the last two years, he's served as Conference Chair for Kscope. Yeah, that doesn't really follow, but you know I'm a bit...scattered.
Did you read the link above? OK, well, here's the skinny. Mike has rectal cancer. Stage III. If it weren't for the stupid cancer part, the jokes would abound. Oh wait, they do anyway. Mike was diagnosed shortly after #kscope13, right around his 50th birthday (Happy Birthday Mike, Love, Cancer!). Ugh. (I want to say, "are you shittin' me?" see what I mean about the jokes? I can't help myself, I'm 14). Needless to say, cancer isn't really a joke. We all know someone affected by it. It is...well, it's not fun.
Go read his post if you haven't already. I'm going to give my version of that story. I'll wait...
So, Sunday morning, Game 3 of the World Series went to the Cardinals in a very bizarre way. I was watching highlights that morning as I had missed the end of the game (doesn't everyone know that I'm old and can't stay up that late to watch baseball?). Highlights. Mike lives in St. Louis. He's a Cardinal's fan. Wouldn't it be cool if he and his family could go to the game (mostly just his family, I don't like Mike that much). So I make some phone calls to see what people think of my idea. My idea is met with resistance. OK, I'll skip the people. Let's call Lisa (Mike's wife).
Apparently Sunday's are technology free days in the Riley household, no response. I go for a bike ride, but I take my phone, just in case Lisa calls me back. After the halfway point, my phone rings, I jump off the bike to answer.
So I talked to Lisa about my idea, can Mike handle the chaos of a World Series game?
We hang up and she goes to work. BTW, I asked her to keep my name out of it, but she didn't. We'll have words about that in the future.
She calls back (I think, it may have been over text, 2 weeks is an eternity to me). "He doesn't think he can do it."
So I call Mike directly (Lisa had already spoiled the surprise.)
"What about Box seats? You know, where the people with top hats and monocles sit? Away from the rift-raft, much more comfortable and free food and beer."
Backstory. Mike had finished his first round of chemo less than a week before Sunday. To make things worse, he decided it was a good time to throw out his back. He wasn't in the best of shape.
Mike said he thought he could do it.
OK, nay-sayers aside, let's see what we can do. I emailed approximately 50 people, mostly ODTUG people; board members, content leads, anyone I had in my address book. "Hey, wouldn't it be great to send Mike and his family to Game 5 of the World Series? We need to do this quick, tickets will probably double in price tonight especially if the Cardinals win." (that would mean Game 5 would be a clincher for the Cardinals, at home, muy expensive).
Within about 20 minutes, a couple of people pledged $600.
At the prices I had seen, I was hoping to get between $50 and $100 from 50 people, hoping. I had $600 already. Game starts. Now it's up to $1100 in pledges. Holy shit, Part II. This might just be possible. Another 30 minutes and were about an hour into Game 4. Ticket prices have already gone up by $250 a ticket. Given that maybe 4 people have responded and I have $1600 in pledges, I pull the trigger. I bought 4 box seat tickets for the Riley family. (I had to have a couple of beers because I was about to drop a significant chunk of change without actual cash in hand, I could be out a lot of money, liquid courage is awesome).
Tickets sent to the Riley family. Pretty good feeling.
Like I said, I was confident, but I was scared. Before the end of the night though, there was over $5K pledged to get Mike and family to Game 5. Holy shit, Part III.
By midday Monday, pledges were well over $7K. I'll refer you back to Mike's post for more details. Shorter: jerseys for the family and a limo to the game.
Here's the breakdown: 35 people pledged, and paid, $8,080. Holy shit, Part IV. Average donation was $230.86. Median was $200. Low was $30 and high was $1000. Six people gave $500 or more. Nineteen people gave $200 or more. The list is a veritable Who's Who in the Oracle community.
Tickets + Jerseys + Limo = $6027.76
Riley family memory = Priceless.
So, what happened to the rest? Well, they have bills. Lots of bills. With the remainder, $2052.24, we paid off some hospital bills of $1220.63. There is currently $831.61 that will be sent shortly. It doesn't stop there though. Cancer treatment is effing expensive. Mike has surgery in December. He'll be on bed rest for some time. His bed is 17 years old. He needs a new one. After that, more chemo and more bills.
"Hey Chet, I'd love to help the Riley family out, can I give you my money for them?"
Yes, absolutely. Help me help them. I started a GoFundMe campaign. Goal is $10K. Any and all donations are welcome. Gifts include a thank you card from the Riley family and the knowledge that you helped out a fellow Oracle (nerd, definitely a nerd) in need. You can find the campaign here.
If you can't donate money, I've also created a hashtag so that we can show support for Mike and his family. It's #fmcuta (I'll let you figure out what it means). Words of encouragement are welcome and appreciated.
Thank you to the 35 who have already so generously given. Thank you to the rest of you who will donate or send out (rude) tweets.
A few months ago I posted a series of articles about connecting OBIEE 22.214.171.124, Exalytics and ODI to Apache Hadoop through Hive, an SQL-interface layer for Hadoop. Hadoop/Hive connectivity is a cool new feature in OBIEE 11g but suffers from the problem common to Hive – it’s actually quite slow, particularly when you’re used to split-second response times from your BI tool. The reason it’s slow is because Hive, in the background, generates MapReduce jobs which in-turn query Hadoop, batch processing-style, for each of your queries; each one of these MapReduce jobs requires a Java program to be written, and then submitted to the Hadoop job control framework and run within a Java JVM, which adds latency to your query. It’s not uncommon for a Hive query to take a minute or so to retrieve even a small set of “rows” from the Hadoop cluster, which isn’t really what it’s for – Hive and Hadoop are typically used with very large datasets spread over multiple servers – but you can’t help thinking there must be a better way to query Hadoop data.
And there is a better way – several organisations and companies have come up with improvements to Hive, the idea being to take the central idea of SQL-access-over-Hadoop, but remove the dependency on writing MapReduce jobs on the fly, and instead create separate, typically in-memory, server processes that provide similar functionality but with much improved response times. Probably the most well-known product like this is Cloudera’s “Impala”, an open-source but commercially-sponsored Hive replacement that’s available as part of Cloudera’s “Quickstart” demo VM, downloadable for free from the Cloudera website. The architecture image below is from the Cloudera Impala datasheet, which nicely explains the differences between Hive and Impala, and where it fits in as part of the overall Hadoop framework.
Now whilst Impala isn’t officially supported by Oracle as a data source, Hive is, so I thought it’d be interesting to see if we could swap-out Hive for Impala and connect more efficiently to a Hadoop datasource. I managed to get it working, with a couple of workarounds, so I thought I’d share it here – note that in a real-world installation, where the server is on Linux/Unix, or where your server isn’t on the same machine as your BI Administration client, it’s a bit more involved as you’ll need both server-side and client-side ODBC driver install and configuration.
Keeping it simple for now though, to get this working you’ll need:
- OBIEE 126.96.36.199, for either Windows or Linux – in my case, I’ve used Windows. Oracle’s recommendation is you use Linux for Hadoop access but Windows seems to work OK.
- The Cloudera Quickstart CDH4 VM – you’ll need to add some data to Impala, I’ll leave this to you – if this article makes sense to you, I’m sure you can add some sample data ;-)
- The Cloudera ODBC Driver for Impala – the Windows 64-bit ones are here, and the rest of the drivers are on this page.
Once you’ve downloaded the Quickstart VM and got Impala up and running, and set up OBIEE 188.8.131.52 on a separate server, start by installing the ODBC drivers so you’re ready to configure them. In my case, my Impala tables were held in the standard “default” schema, and my Quickstart VM was running on the hostname cdh4.rittmandev.com, so my ODBC configuration settings looked like this:
“cloudera” is the default username on the Quickstart VM, with a password also of “cloudera”, so when I press the Test… button and put in the password, I see the success message:
So far so good. So now over to the BI Administration tool, where the process to import the Impala table metadata is the same as with Hive. First, select the new Impala DSN from the list of ODBC connections, then bring in the Impala tables that you want to include in the RPD – in this case, two tables called “product” and “product_sales” that I added myself to Impala.
Next, double-click on the new physical database entry that the import just created, and set the Database type from ODBC Basic to Apache Hadoop, like this:
When you’re prompted to change the connection pool settings as well – ignore this and press No, and leave them as they are.
Then, create your business model and presentation layer subject area as you would do normally – in my case, I add a primary key to the products table, join it in the physical layer to the product_sales fact table, and then create corresponding BMM and Presentation Layer models so that it’s then ready to report on.
Running a quick test on the datasource, displaying some sample rows from the Impala tables, indicates it’s working OK.
So over to the dashboard. I run a simple query that sums up sales by product, and … it doesn’t work.
If you take a look at the logical and physical SQL that the BI Server is generating for the query, it all looks OK …
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-0]  [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] ############################################## [[
-------------------- SQL Request, logical request hash:
SET VARIABLE QUERY_SRC_CD='Report';SELECT
"Impala Sample"."products"."prod_desc" s_1,
"Impala Sample"."products"."prod_id" s_2,
"Impala Sample"."product_sales"."amt_sold" s_3
FROM "Impala Sample"
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 65001 ROWS ONLY
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-23]  [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- General Query Info: [[
Repository: Star, Subject Area: Impala Sample, Presentation: Impala Sample
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-18]  [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- Sending query to database named ClouderaCDH4 (id: <<10894>>), connection pool named Connection Pool, logical request hash 2f2b87c8, physical request hash ee7aff05: [[
select 0 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c1 as c4
(select sum(T43766.amt_sold) as c1,
T43769.prod_desc as c2,
T43769.prod_id as c3
products T43769 inner join
product_sales T43766 On (T43766.prod_id = T43769.prod_id)
group by T43769.prod_id, T43769.prod_desc
order by c3, c2
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-34]  [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- Query Status: [nQSError: 16015] SQL statement execution failed. [[
[nQSError: 16001] ODBC error state: S1000 code: 110 message: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : NotImplementedException: ORDER BY without LIMIT currently not supported.
[nQSError: 43119] Query Failed:
But the problem is that Impala doesn’t like ORDER BY clauses without a corresponding LIMIT clause, something the product insists on presumably because of the potential size of Impala/Hadoop datasets and the need to contain the rows returned in-memory. If you’re running Impala queries directly within the Impala shell, you can issue the command set DEFAULT_ORDER_BY_LIMIT = x; to provide a default LIMIT value when one isn’t specified, but I can’t as yet see how to provide that setting over an ODBC connection, so the workaround I used was to uncheck the ORDERBY_SUPPORTED database setting in the physical database properties dialog, so that the BI Server leaves-off the ORDER BY clause entirely, and does the result ordering itself after Impala returns the (now un-ordered) result set.
Saving the RPD again now, and refreshing the results, brings back the query as expected – and in under a second or so, rather than the 30, 60 seconds etc that Hive would have taken.
So – there you go. As I said, Impala’s not officially supported which means it may work, but Oracle haven’t tested it properly and you won’t be able to raise SRs etc – but it’s an interesting alternative to Hive if you’re serious about connecting OBIEE 11g to your Hadoop datasources.
Oracle Endeca Information Discovery (OEID) 3.1 has been released yesterday, 7th of November. The new version is a good step forward to satisfy business users with providing much better self-service discovery capabilities. OEID 3.1 now enables its non-technical users to securely run agile business intelligence analysis on variety of data sources in much easier way and without need to IT. On the other hand the integration with Oracle Business Intelligence is now even more tight to the extend that Oracle announced OEID 3.1 as “the only complete data discovery platform for the enterprise”. The product data sheet is accessible here and in the same way as older versions of OEID, this software is downloadable via Oracle e-Delivery website which is the Oracle Software Delivery Cloud where you can find downloads for all licensed Oracle products.
On the first release of Provisioning-Service on OEID 3.0 on March 2013, users could upload only one file and it had to be in Excel format. Issue was proving that having data on Endeca Server was a better solution compare to Excel itself. Not being able to join datasources together and having no Text-enrichment analysis on the data was enough to keep the Provisioning service very limited. Good news is that new version has answers to all requirements from a business intelligence provisioning service tool.
Having a quick start, here is a list of new features I came across at first glance:
- Users now can join information from multiple sources such as files, databases or other pre-built Endeca Servers. Files can be in Excel format for structured data or JSON for semi-structured data mostly coming from social media interactions which will lead to more easier combination of social media with other available data in enterprise.
- Having Geocode data in your source, it should be matter of a click to add a Map component to the dashboard where it will automatically find information it requires from available data sources. New Map component is not only more powerful in what it does, such as being able to display as Heat Layer, but also looks much better!
- Business users can perform Enrichment on their own unstructured text data to identify hidden entities, sentiment and etc without support from IT. The new application setting has got the capability to add text-enrichment and text taggers to data sources.
- Managing Data sets such as reloading resources, adding record, managing attributes or deleting data sets is quite straight forward with in the application settings.
Look out for my next post where I’ll go more in-detail in the exciting new features of OEID v3.1.
Over the past few weeks myself and some of my colleagues have been posting articles on the blog about monitoring OBIEE using Enterprise Manager 12cR3’s BI Management Pack. In the various articles we’ve looked at managing individual OBIEE installations and monitoring various aspects of the product’s performance, using features like metrics, events and thresholds, integration with usage tracking, service beacons, and metric extensions.
But in each case we’ve looked at an OBIEE installation in isolation, and always from the perspective of the “system” – which makes sense if where you’re coming from is OBIEE’s Fusion Middleware Control, and you’re looking for a better way of working with OBIEE’s built-in instrumentation. But a typical BI system consists of more than just OBIEE – the database providing data for OBIEE is going to play a major part in the performance of your system, and in most cases there’ll be an ETL server loading data into it, such as Oracle Data Integrator or Informatica PowerCenter. In some cases Essbase may be providing subsets of the data or acting as an aggregation layer, and of course all of these infrastructure components run on host servers, either physical or virtualised. Wouldn’t it make more sense to look at this platform as a whole, measuring performance across it and considering all aspects of it when determining if it’s “available”?
Moreover, whilst it makes sense for you to consider just the indicators and metrics coming out of OBIEE when judging the performance of your system, for your end-users, they don’t think in terms of disk throughput or cache hits when considering system performance – what they talk about when they call you with a problem is the time it takes to log in; or the time it takes to bring up their dashboard page; or, indeed, whether they can log into the system at all. In fact, it’s not unknown for users to ring up and say the system is performing terribly, when in fact all the indicators on your DBA dashboard are showing green, and as far as you’re concerned, all is fine. So how can you align your view of the status of your system with what your users are experiencing, and indeed consider all of the BI platform when making this call? In fact, there are two features in Enterprise Manager and the BI Management Pack that make this possible – “systems” and “services” – and whilst they’re not all that well-known, they can make a massive impact on how holistically you view your system, when you put them in-place. Let’s take a look at what’s involved, based on something similar I put in place for a customer this week.
As I mentioned before, most people’s use of Enterprise Manager involves looking at an individual infrastructure component – for example, OBIEE – and setting up one or more metric thresholds and alerts to help monitor its performance.
But in reality OBIEE is just part of the overall BI platform that you need to monitor, if you’re going to understand end-to-end performance of your system. In Enterprise Manager terms, this is your “system”, and you can define a specific object called a “system” within your EM metadata, which aggregates all of these components together, giving you your “IT” view of your BI platform.
In the screenshot below, I’ve got an EM12cR3 instance set-up, and in the bottom right-hand corner you can see a list of systems managed by EM, including an Exalytics system, a BI Apps system, one running EPM Suite and another running and Oracle database. In fact, the OBIEE system relies on the Database system for its source data, but you wouldn’t be able to tell that from the default way they’re listed, as they’re all shown as independent, separate from each other.
What I can do though is aggregate these two installs together as a “system”, along with any other components – the Essbase server in the EPM stack, for example – that play a part in the overall platform. To create this system, I select Targets > Systems from the top-most menu, and then press Add > Generic System when the Systems overview page is displayed.
Note the other types of systems available – all of them except for Generic System add particular capabilities for that particular type of setup, but Generic System is just a container into which we can add any random infrastructure components, so we’ll use this to create the system to bring together our BI components.
Once the page comes up to create the new system, when you add components to it, notice how each part of each constituent “product” is available to include at different levels of granularity. For example, you can add OBIEE to the system “container” either at the whole BI Instance level – all the BI servers, BI Presentation servers and so on for a full deployment – or you can add individual system components, Essbase servers, DAC servers if that’s more relevant. In my case, I’ve got a couple of options – as it’s actually an Exalytics server, I could add that as a top-level component (complete with TimesTen, Essbase and so on), or I could just add the BI Instance, which is what I’ll do in this case by selecting that target type and then choosing the BI Instance from the list that’ll then be displayed.
In total I add in four targets – the OBIEE and database instances, and the hosts they run on. Later on, once I register my ODI servers using the new DI Management Pack, I can bring those in as well.
The next step is to define the associations, or dependencies, in the system. The wizard automatically adds the association between the BI instance and its host, and the database instance and its host, but I can then manually add in the dependency that the BI instance has on the database, so that later on, I can say that BI being down is directly related to its database being down (something called “root causal analysis”, in EM terminology).
On the next page of the wizard, I can specify which parts of the system have to be up, in order for the whole system to be considered “available”. In this case,all parts, or “targets” need to be running for the system to be OK, but if I had an ETL element, for example, then this could possibly be down but the overall system still be “available” for use, albeit in degraded form.
Next I can select a set of charts, that will be displayed along with the system overview details, from the charts and metrics available for each consistent product. By default a set of database and host charts are pre-selected, but I can add in ones specific to OBIEE – for example, total number of active sessions – from the BI Instance list.
Once that final step is completed, the system is then created and I can see the overall status of it, along with any incidents, warnings, alerts and so on, across the platform.
If I had multiple systems to manage here, I’d see them all listed in the same place, with their overall status, and a high-level view of their alert status. Drilling into this particular system, I can then see a “single pane of glass” overview of the whole system, including the status of the constituent components.
So far, so good. But this is only part of the story. Whilst this is great for the IT department, the terminology it uses – “systems”, “metrics”, “system tests” and so forth – aren’t the terms that the end-users use. They’re thinking about OBIEE as a “service” – a service providing dashboards, reports, a dashboard login and so on, and so EM has another concept, called a “service”, that builds on the system we’ve just put together, but adds a layer of business-focus to the setup.
Adam Seed touched on the concept of a “service” in his post on service beacons the other week, but they’re much more than an enabler of browser-based tests. Creating a service along with our system gives us the ability to add an extra layer of end-user focus to our EM setup, so that when our users call up and say – I can’t log in, or – It takes ages to bring up my dashboard page – we’ve got a set of metrics and tests aligned with their experience, and we’re immediately aware of the issues they’re hitting.
To create a service, we first need a system on which it will be delivered. As we’ve now got this, lets go back to the EM menu and select Targets > Services, and then select Create > Generic Service. On the next page, I name the service – for example. “Production Dashboards”, and then select the system I just created as the one that provides it.
Now the key thing about a service, is how we test for its “availability”. With EM’s services, availability can either be determined by the status of the underlying system, or more usefully, we can define one or more “service” tests that checks things from a more end-user perspective. We’ll select “Service Test” in this instance, and then move onto the next page of the wizard.
Now there are lots of service test types you can use, and Adam Seed’s post went through the most useful of them, one that records a set of browser actions and replays them to a schedule, simulating a user logging in, navigating around the OBIEE website and then logging out. Unfortunately, this requires Internet Explorer to record the browser session, so I’ll cheat and just set up a host ping, which isn’t really something you’d want in real life but if gets me onto the next stage (Robin Moffatt also covered using JMeter to do a similar thing, in his post the other day on the blog).
Next, I say where this test will run from. Again, for simplicity’s sake I just select the main EM server, but in reality you’d want to run this test from where the users are located, by setting up what’s called a “service beacon”, a feature within the EM management agent that can run tests like these geographically close to where the end-users are. That way, you can measure the service they’re actually receiving from their office (potentially, in a different country to where OBIEE is installed), giving you a more realistic measurement of response time.
I then go on to say what response times are considered OK, warning and critical, and then I can also associate system-level metrics with this service as well. In this case I add in the average query response time, so that service availability in the end will be determined by the contactability of the OBIEE server (a substitute in this case for a full browser login simulation), and respond time being within a certain threshold.
I then save the service definition, and then go and view it within EM. In the screenshot below, I’ve left EM overnight so that the various performance metrics and the service test can run for a while, and you can see that as of now, everything seems to be running OK.
Clicking on the Test Performance tab shows me the output of each of my service tests (in this case, just the host ping), whilst the Charts page shows the me output of the system performance metrics that I selected when creating the service. Clicking on Topology, moreover, shows me a graphical view of the service and its underlying system, so I can understand and visualise the relationships between the various components within it.
Another important part of services’ end-user-level focus is the ability to create service-level agreements. These are more formal versions of metric thresholds, this time based on service tests rather than system tests, and allow you to define service level indicators based on the tests you’ve created before, and then measure performance against agreed tolerances over a period of time. If you’ve got an SLA agreed with your customer that, for example, 95% of reports render within five seconds, or that the main dashboard is available 97% of the time during working hours, you can capture that SLA here and then automatically report against it over time. More importantly, if you’re starting to fall outside of your SLA, you can use EM to raise events and incidents in the meantime so you’re aware of the issue, and you can work to rectify it before it becomes an issue in your monthly customer meeting.
Finally – and this is something I find really neat – the system overview page for the system I created earlier now references the service that it supports, so I can see, at a glance, not only the status of the infrastructure components that I’m managing, but also the status of the end-user service that it’s supporting. Not bad, and a lot better than trying to manage all of these infrastructure components in isolation, and trying to work out myself what their performance means in terms of the end-users.
So there you have it – systems and services in EM and the BI Management Pack – a good example of what you get when you move from Fusion Middleware Control to the full version of Oracle’s Enterprise Systems Management platform.
This is the third article in my two-article set of posts (h/t) on extending the monitoring of OBIEE within EM12c. It comes after a brief interlude discussing Metric Extensions as an alternative using Service Tests to look at Usage Tracking data.
Moving on from the rich source of monitoring data that is Usage Tracking, we will now cast our attention to a favourite tool of mine: JMeter. I’ve written in detail before about this tool when I showed how to use it to build performance tests for OBIEE. Now I’m going to illustrate how easy it can be to take existing OBIEE JMeter scripts and incorporate them into EM12c.
Whilst JMeter can be used to build big load tests, it can also be used as a single user. Whichever way you use it the basis remains the same. It fires a bunch of web requests (HTTP POSTs and GETs) at the target server and looks at the responses. It can measure the response time alone, or it can check the data returned matches what’s expected (and doesn’t match what it shouldn’t, such as error messages).
In the context of monitoring OBIEE we can create simple JMeter scripts which do simple actions such as
- Login to OBIEE, check for errors
- Run a dashboard, check for errors
If we choose an execution frequency (“Collection Schedule” in EM12c parlance) that is not too intensive (otherwise we risk impacting the performance/availability of OBIEE!) we can easily use the execution profile of this script as indicative of both the kind of performance that the end user is going to see, as well as a pass/fail of whether user logins and dashboard refreshes in OBIEE are working.
EM12c offers the ability to run “Custom Scripts” as data collection methods in Service Tests (which I explain in my previous post), and JMeter can be invoked “Headless” (that is, without a GUI) so lends itself well to this. In addition, we are going to look at EM12c’s Beacon functionality that enables us to test our JMeter users from multiple locations. In an OBIEE deployment in which users may be geographically separated from the servers themselves this is particularly useful to check that the response times seen from one site are consistent with those from another.
Note that what we’re building here is an alternative version to the Web Transaction Test Type that Adam Seed wrote about here, but with pretty much the same net effect – a Service Test that enables to you say whether OBIEE is up or down from an end user point of view, and what the response time is. The difference between what Adam wrote about and what I describe here is the way in which the user is simulated:
- Web Transaction (or the similar ATS Transaction) Test Types are built in to EM12c and as such can be seen as the native, supported option. However, you need to record and refine the transaction that is used, which has its own overhead.
- If you already have JMeter skills at your site, and quite possibly existing JMeter OBIEE scripts, it is very easy to make use of them within EM12c to achieve the same as the aforementioned Web Transaction but utilising a single user replay technology (i.e. JMeter rather than EM12c’s Web Transaction).
So, if you are looking for a vanilla EM12c implementation, Web/ATS transactions are probably more suitable. However, if you already use JMeter then it’s certainly worth considering making use of it within EM12c tooThe JMeter test script
- Go to dashboard
The important bit to check is the Thread Group – it needs to run a single user just once. If you leave in settings from an actual load test and start running hundreds of users in this script called from EM12c on a regular basis then the effect on your OBIEE performance will be interesting to say the least
Test the script and make sure you see a single user running and successfully returning a dashboard
Running JMeter from the command line
Before we get anywhere near EM12c, let us check that the JMeter script runs successfully from the commandline. This also gives us opportunity to refine the commandline syntax without confounding any issues with its use in EM12c.
The basic syntax for calling JMeter is:
./jmeter --nongui -t /home/oracle/obi_jmeter.jmx
--nongui being the flag that tells JMeter not to run the GUI (i.e. run headless), and
-t passing the absolute path to the JMX JMeter script. JMeter runs under java so you may also need to set the
PATH environment variable so that the correct JVM is used.
To run this from EM12c we need a short little script that is going to call JMeter, and will also set a return code depending on whether an error was encountered when the user script was run (for example, an assertion failed because the login page or dashboard did not load correctly). A simple way to do this is to set the View Results In Table sampler to write to file only if an error is encountered, and then parse this file post-execution to check for any error entries.
We can then do a simple
grep against the file and check for errors. In this script I’m setting the
PATH, and using a temporary file
/tmp/jmeter.err to capture and check for any errors. I also send any JMeter console output to
export PATH=/u01/OracleHomes/Middleware/jdk16/jdk/bin:$PATH rm /tmp/jmeter.err /home/oracle/apache-jmeter-2.10/bin/jmeter --nongui -t /home/oracle/obi_jmeter.jmx 1>/dev/null 2>&1 grep --silent "<failure>true" /tmp/jmeter.err if [ $? -eq 0 ]; then exit 1 else exit 0 fi
Note that I am using absolute paths throughout, so that there is no ambiguity or dependency on the folder from which this is executed.
Test the above script that you’ll be running from EM12c, and check the return code that is set:
$ ./run_jmeter.sh ; echo $?
The return code should be 0 if everything’s worked (check in Usage Tracking for a corresponding entry) and 1 if there was a failure (check in
nqquery.log to confirm that there was a failure)
To start with we’ll be looking at getting EM12c to run this script locally. Afterwards we’ll see how it can be run on multiple servers, possible geographically separated.
So that the script can be run on EM12c, copy across your
run_jmeter.sh script, JMeter user test script, and the JMeter binary folder. Check that the script still runs after copying it across.
So now we’ve got a JMeter test script, and a little shell script harness with which to call it. We hook it into EM12c using a Service Test.
From Targets -> Services, create a new Generic Service (or if you have one already that it makes sense in which to include this, do so).
Give the service a name and associate it with the appropriate System
Set the Service’s availability as being based on a Service Test. On the Service Test screen set the Test Type to Custom Script. Give the Service Test a sensible Name and then the full path to the script that you built above. At the moment, we’re assuming it’s all local to the EM12c server. Put in the OS credentials too, and click Next
On the Beacons page, click Add and select the default EM Management Beacon. Click Next and you should be on the Performance Metrics screen. The default metric of Total Time is what we want here. The other metric we are interested in is availability, and this is defined by the Status metric which gets its value from the return code that is set by our script (anything other than zero is a failure).
Click Next through the Usage Metrics screen and then Finish on the Review screen
From the Services home page, you should see your service listed. Click on its name and then Monitoring Configuration -> Sevice Tests and Beacons. Locate your Service Test and click on Verify Service Test
Click on Perform Test and if all has gone well you should see the Status as a green arrow and a Total Time recorded. As data is recorded it will be shown on the front page of the service you have defined:
One thing to bear in mind with this test that we’ve built is that we’re measuring the total time that it takes to invoke JMeter, run the user login, run the dashboard and logout – so this is not going to be directly comparable with what a user may see in timing the execution of a dashboard alone. However, as a relative measure for performance against itself, it is still useful.Measuring response times from additional locations
One of the very cool things that EM12c can do is run tests such as the one we’ve defined but from multiple locations. It’s one thing checking the response time of OBIEE from local to the EM12c server in London, but how realistically will this reflect what users based in the New York office see? We do this through the concept of Beacons, which are bound to existing EM12c Agents and can be set as execution points for Service Tests.
To create a Beacon, go to the Services page and click on Services Features and then Beacons:
You will see the default EM Management Beacon listed. Click on Create…, and give the Beacon a name (e.g. New York) and select the Agent with which it is associated. Hopefully it is self-evident that a Beacon called New York needs to be associated with an Agent that is physically located in New York and not Norwich…
After clicking Create you should get a confirmation message and then see your new Beacon listed:
Before we can configure the Service Test to use the Beacon we need to make sure that the JMeter test rig that we put in place on the EM12c server above is available on the server on which the new Beacon’s agent runs, with the same paths. As before, run it locally on the server of the new Beacon first to make sure the script is doing what it should.
To get the Service Test to run on the new Beacon, go back to the Services page and as before go to Monitoring Configuration -> Sevice Tests and Beacons. Under the Beacons heading, click on Add.
Select both Beacons in the list and click Select
When returned to the Service Tests and Beacons page you should now see both beacons listed. To check that the new one is working, use the Verify Service Test button and set the Beacon to New York and click on Perform Test.
To see the performance of the multiple beacons, use the Test Performance page:
As stated at the beginning of this article, the use of JMeter in this way and within EM12c is not necessarily the “purest” design choice. However, if you have already invested time in JMeter then this is a quick way to make use of those scripts and get up and running with some kind of visibility within EM12c of the response times that your users are seeing.
Someone mentioned to me last night that this wouldn't happen again for 990 years. I was thinking, "wow, I'm super special now (along with the other 1/365 * 6 billion people)!" Or am I? I had to do the math. Since date math is hard, and math is hard, and I'm good at neither, SQL to the rescue.
select(In case you were wondering, 100,000 days is just shy of 274 years. 273.972602739726027397260273972602739726 to be more precise.)
to_number( to_char( sysdate + ( rownum - 1 ), 'mm' ) ) month_of,
to_number( to_char( sysdate + ( rownum - 1 ), 'dd' ) ) day_of,
to_number( to_char( sysdate + ( rownum - 1 ), 'yy' ) ) year_of,
sysdate + ( rownum - 1 ) actual
connect by level <= 100000
That query gives me this:
MONTH_OF DAY_OF YEAR_OF ACTUALSo how can I figure out where DAY_OF is equal to MONTH_OF + 1 and YEAR_OF is equal to DAY_OF + 1? In my head, I thought it would be far more complicated, but it's not.
-------- ------ ------- ----------
11 06 13 2013/11/06
11 07 13 2013/11/07
11 08 13 2013/11/08
11 09 13 2013/11/09
11 10 13 2013/11/10
11 11 13 2013/11/11
select *Which gives me:
to_number( to_char( sysdate + ( rownum - 1 ), 'mm' ) ) month_of,
to_number( to_char( sysdate + ( rownum - 1 ), 'dd' ) ) day_of,
to_number( to_char( sysdate + ( rownum - 1 ), 'yy' ) ) year_of,
sysdate + ( rownum - 1 ) actual
connect by level <= 100000
where month_of + 1 = day_of
and day_of + 1 = year_of
order by actual asc
MONTH_OF DAY_OF YEAR_OF ACTUALOK, so it looks closer to 100 years, not 990. Let's subtract. LAG to the rescue.
-------- ------ ------- ----------
11 12 13 2013/11/12
12 13 14 2014/12/13
01 02 03 2103/01/02
02 03 04 2104/02/03
03 04 05 2105/03/04
04 05 06 2106/04/05
05 06 07 2107/05/06
selectWhich gives me:
lag( actual, 1 ) over ( partition by 1 order by 2 ) previous_actual,
actual - ( lag( actual, 1 ) over ( partition by 1 order by 2 ) ) time_between
to_number( to_char( sysdate + ( rownum - 1 ), 'mm' ) ) month_of,
to_number( to_char( sysdate + ( rownum - 1 ), 'dd' ) ) day_of,
to_number( to_char( sysdate + ( rownum - 1 ), 'yy' ) ) year_of,
sysdate + ( rownum - 1 ) actual
connect by level <= 100000
where month_of + 1 = day_of
and day_of + 1 = year_of
order by actual asc
ACTUAL PREVIOUS_ACTUAL TIME_BETWEENSo, it looks like every 88 years it occurs and is followed by 11 consecutive years of matching numbers. The next time 11/12/13 and 12/13/14 will appear is in 2113 and 2114. Yay for SQL!
---------- --------------- ------------
In my previous post I demonstrated how OBIEE’s Usage Tracking data could be monitored by EM12c through a Service Test. It was pointed out to me that an alternative for collecting the same data would be the use of EM12c’s Metric Extensions.
A Metric Extension is a metric definition associated with a type of target, that can optionally be deployed to any agent that collects data from that type of target. The point is that unlike the Service Test we defined, a Metric Extension is define-once-use-many, and is more “lightweight” as it doesn’t require the definition of a Service. The value of the metric can be obtained from sources including shell script, JMX, and SQL queries.
The first step in using a Metric Extension is to create it. Once it has been created, it can be deployed and utilised.Creating a Metric Extension
Let us see now how to create a Metric Extension. First, access the screen under Enterprise -> Monitoring -> Metric Extensions.
To create a new Metric Extension click on Create…. From the Target Type list choose Database Instance. We need to use this target type because it enables us to use the SQL Adapter to retrieve the metric data. Give the metric a name, and choose the SQL Adaptor.
Leave the other options as default, and click on Next.
In a Metric Extension, the values of the columns (one or more) of data returned are mapped to individual metrics. In this simple example I am going to return a count of the number of failed analyses in the last 15 minutes (which matches the collection interval).
On the next page you define the metric columns, matching those specified in the adaptor. Here, we just have a single column defined:
Click Next and you will be prompted to define the Database Credentials, which for now leave set to the default.
Now, importantly, you can test the metric adaptor to make sure that it is going to work. Click on Add to create a Test Target. Select the Database Instance target on which your RCU resides. Click Run Test
What you’ll almost certainly see now is an error:
Failed to get test Metric Extension metric result.: ORA–00942: table or view does not exist
The reason? The SQL is being executed by the “Default Monitoring Credential” on the Database Instance, which is usually DBSNMP. In our SQL we didn’t specify the owner of the Usage Tracking table S_NQ_ACCT, and nor is DBSNMP going to have permission on the table. We could create a new set of monitoring credentials that connect as the RCU table owner, or we could enable DBSNMP to access the table. Depending on your organisation’s policies and the scale of your EM12c deployment, you may choose one over the other (manageability vs simplicity). For the sake of ease I am going to take the shortest (not best) option, running as SYS the following on my RCU database to create a synonym in the DBSNMP schema and give DBSNMP access to the table.
GRANT SELECT ON DEV_BIPLATFORM.S_NQ_ACCT TO DBSNMP; CREATE SYNONYM DBSNMP.S_NQ_ACCT FOR DEV_BIPLATFORM.S_NQ_ACCT;
Now retest the Metric Extension and all should be good:
Click Next and review the new Metric Extension
When you click on Finish you return to the main Metric Extension page, where your new Metric Extension will be listed.A note about performance
When building Metric Extensions bear in mind the impact that your data extraction is going to have on the target. If you are running a beast of a SQL query that is horrendously inefficient on a collection schedule of every minute, you can expect to cause problems. The metrics that are shipped with EM12c by default have been designed by Oracle to be as lightweight in collection as possible, so in adding your own Metric Extensions you are responsible for testing and ensuring yours are too.Deploying a Metric Extension for testing
Once you have built a Metric Extension as shown above, it will be listed in the Metric Extension page of EM12c. Select the Metric Extension and from the Actions menu select Save As Deployable Draft.
You will notice that the Status is now Deployable and on the Actions menu the Edit option has been greyed out. Now, click on the Actions menu again and choose Deploy To Targets…, and specify your RCU Database Instance as the target
Return to the main Metric Extension page and click refresh, and you should see that the Deployed Targets number is now showing 1. You can click on this to confirm to which target(s) the Metric Extension is deployed.Viewing Metric Extension data
Metric Extensions are defined against target types, and we have created the example against the Database Instance target type in order to get the SQL Adaptor available to us. Having deployed it to the target, we can now go and look at the new data being collected. From the target itself, click on All Metrics and scroll down to the Metric Extension itself, which will be in amongst the predefined metrics for the target:
After deployment, thresholds for Metric Extension data can be set in the same way they are for existing metrics:
Thresholds can also be predefined as part of a Metric Extension so that they are already defined when it is deployed to a target.Amending a Metric Extension
Once a Metric Extension has been deployed, it cannot be edited in its current state. You first create a new version using the Create Next Version… option, which creates a new version of the Metric Extension based on the previous one, and with a Status of Editable. Make the changes required, and then go through the same Save As Deployable Draft and Deploy to Target route as before, except you will want to Undeploy the original version.Publishing a Metric Extension
The final stage of producing a Metric Extension is publishing it, which moves it on beyond the test/draft “Deployable” phase and marks it as ready for use in anger. Select Publish Metric Extension from the Actions menu to do this.
A published Metric Extension can be included in a Monitoring Template, and also supports the nice functionality of managed upgrades of Metric Extension versions deployed. In this example I have three versions of the Metric Extension, version 2 is Published and deployed to a target, version 3 is new and has just been published:
Clicking on Deployed Targets brings up the Manage Target Deployments page, and from here I can select my target on which v2 is deployed, and click on Upgrade
After the confirmation message Metric Extension ME$USAGE_TRACKING upgrade operation successfully submitted. return to the Metric Extension page and you should see that v3 is now deployed to a target and v2 is not.
Finally, you can export Metric Extensions from one EM12c deployment for import and use on another EM12c deployment:
So that wraps up this brief interlude in my planned two-part set of blogs about EM12c. Next I plan to return to the promised JMeter/EM12c integration … unless something else shiny catches my eye in between …
Being some what follicly challenged, and to my wife's utter relief, the comb over is not something I have ever considered. The title is a tenuous reference to a formatting feature that Adobe offers in their PDF documents.
The comb provides the ability to equally space a string of characters on a pre-defined form layout so that it fits neatly in the area. See the numbers above are being spaced correctly. Its not a function of the font but a property of the form field.
For the first time, in a long time I had the chance to build a PDF template today to help out a colleague. I spotted the property and thought, hey, lets give it a whirl and see in Publisher supports it? Low and behold, Publisher handles the comb spacing in its PDF outputs. Exciting eh? OK, maybe not that exciting but I was very pleasantly surprise to see it working.
I am reliably informed, by Leslie, BIP Evangelist and Tech Writer that, this feature was introduced from version 10.1.3.4.2 onwards.
Official docs and no mention of comb overs here.