BI & Warehousing
This is a great overview of the Financial Statement Generator (FSG) engine from GL in EBS and how Publisher fits into the picture.Thanks to Helle Hellings on the Financials PM team.
In a few week’s time I’m presenting at the BIWA Summit 2014 on running OBIEE in the Cloud, with the idea being that I’ll go through the various options for deploying OBIEE in public clouds, including Amazon EC2 and Oracle’s own cloud offering. But the more I thought through the various architecture and product options, the more I began to think that ETL – getting data into the BI system, from wherever it is now – is the key enabling technology. The vast majority of BI systems use data that’s refreshed on a daily, weekly or even real-time basis, and on premise we tend to use tools such as Oracle Data Integrator, Oracle Warehouse Builder, Informatica PowerCenter and the like to move data from source to target, transforming and cleansing it en-route.
When you deploy a BI platform such as OBIEE into the cloud, you’ve got a couple of options around where you hold its data, and how you do the ETL:
- You can keep the data warehouse database on-premise, along with the ETL tool, and just run the BI part in the cloud
- You can move the database into the cloud as well (using a service such as Amazon RDS), but keep the ETL tool on-premise
- You can try and move the ETL into the cloud too – either hosting a tool like ODI in its own cloud VM, or make use of ETL-as-a-service such as that provided by Informatica Cloud.
Things get more interesting when part of your IT infrastructure sits on-premise, and some sits in the cloud. It gets even more interesting when your data sources are also software-as-a-service (SaaS), for example Salesforce.com and Workday, where data access is via APIs rather than direct SQL*Net connections.
So where do we start with ETL-in-the-cloud? One place to start is Oracle’s own slide-deck on OBIEE in the cloud, where they set out what they see as the key differences between traditional on-premise software and software delivered via SaaS:
In this SaaS world, the expectation is that:
- The infrastructure, software etc is already in placed, is patched regularly for you in the background, and new features emerge regularly
- You’re on a shared platform, but you won’t be aware of other tenants on a day-to-day basis
- The “innards” and workings are hidden from you – you just consume a service
- It’s cloud-native – from connectivity to cloud apps through to common social sign-ons, commenting and collaboration
- Everything is thin-client, web-based
- Everything is self-service, self-provisioning, right from creation of your user account
- Sign-up is on-demand, paid by the hour/day etc, and with no long setup process
What this means for data integration (and ODI) then, in my opinion, is:
- It must be available as a service – log in, create data source connections, define mappings, run code
- The internals typically would be hidden – so no need to know about agents, GoldenGate vs. ODI, choices of knowledge modules etc
- Web-based data flow modellers as well as admin tools
- Connectors through to Salesforce.com and other cloud apps
- Connectors through to Hadoop, Hive, JSON compatibility etc
- Instant sign-up and charging by time/use
With bonus points for:
- Low-cost or free option to get you started
- A consistent story that goes through from messaging, application integration through to large-scale data movement
- Easy clustering, scale-out, deployment on services such as Amazon EC2
- Options to deploy on-premise or cloud, and run integration tasks on-premise or cloud
Looking at it architecturally, ETL-in-the-cloud would sit between sources and targets, both on-premise and in the cloud, providing the vital data movement capability between source systems and the BI Platform itself – most probably to its own database-in-the-cloud, running the data warehouse.
So what are the options then, if you want to use cloud-based ETL to load a cloud-based data warehouse and BI system? To my mind, there’s four main options:
- If you’re using a cloud-based database service such as Amazon’s Redshift, or Oracle’s own public cloud “schema-as-a-service” database, you can use the ETL tools provided with the service
- You can try and put ODI in the cloud, maybe using an Amazon EC2 instance running WebLogic and a JEE agent, with another instance providing the metadata repository database
- or as an alternative, do the same for one of the open-source ETL tools
- You can use one of the new “cloud-native” ETL-as-a-service products, such as Informatica Cloud or SnapLogic
- Or – you can try and abstract away ETL altogether – more on this later on.
The first option really applies if (a) you’re using a service such as Amazon Web Services’ EC2, (b) your data also most probably sits in AWS cloud storage, (c) you want to move data between your source application or export into the main data warehouse database, and (d) you don’t really need to integrate data from different sources. Amazon AWS provides a number of options for loading data into EC2 and the various database and analytic services they provide, including Amazon Data Pipeline (shown in the screenshot below), the most “ETL-like” of their loading services, along with a sneakernet service, and the new Amazon Kinesis service for real-time streaming data.
Oracle’s Cloud Database service at the moment restricts you to a single schema, and more importantly there’s no SQL*Net access, so uploading data is either through SQL*Developer (which has a special, custom Oracle Cloud connector), or through utilities provided via ApEX.
Clearly this sort of web-based data loading isn’t designed for data warehouse scenarios, and in this initial iteration Oracle’s cloud database is probably designed to support cloud applications running on Oracle’s cloud Java service, with OBIEE-in-the-cloud mainly designed for reporting against this data, and personal “sandbox” scenarios. What’s not immediately obvious when you use these cloud ETL tools is that each one has its own admin console, its own API, it’s own scheduler, it’s own metadata – which is usually the point at which you decide you need an enterprise ETL tool.
So what about the idea of running Oracle Data Integrator in the cloud? There’s a few options here, that I can think of:
- Creating a cloud server instance that runs ODI, along with all of its metadata repositories, WebLogic instances and agents in a single location
- Run the repository database either on-premise or on another cloud server (or even Amazon’s RDS service), with agents running either in their own cloud servers, or on the cloud servers holding the target database
- Build ODI into the target platform, as Oracle have done with BI Apps 188.8.131.52.1, along with management tools to hide the complexity and workings of ODI
The first option sounds the “neatest” in terms of a wholly-cloud deployment, as all the parts of ODI are held in one place, and you can think of it as an applicance or service. But ODI’s Extract-Transform-Load approach ends-up complicating things in a cloud-only deployment; the target platform (for example one of the Postgres-compatible cloud databases) might not support all of the target-layer transformations you want, and the integration part can’t really sit in the ODI cloud instance, unless you run it hub-and-spoke style and either use a local database for transformations, or use the in-memory feature within the agents. In fact where ODI makes most sense is in a hybrid on-premise/cloud setup, where most of your data resides on-premise, as does your ETL process, with the cloud being gradually migrated to and in the meantime used alongside on-premise applications.
Oracle’s white paper on cloud data integration majors on this type of scenario, with Oracle GoldenGate also used to replicate data between the two environments. At this year’s Oracle Openworld the BI Apps team also announced cloud adapters for the BI Applications, initially used to extract data from Fusion Apps in the Cloud back to the on-premise BI Apps data warehouse, with other Oracle cloud applications data sources following-on.
Where things get interesting with Oracle’s approach is when it’s non-Oracle cloud applications that we’re looking to integrate with. “Cloud Integration – A Comprehensive Solution”, another Oracle white paper, describes techniques to integrate with Salesforce.com, Workday and the Fusion Apps, but all of this takes place through Oracle SOA Suite and Oracle Jdeveloper, products that are way too technical for the average ETL developer (or customer implementor).
In fact there’s two obvious things that are missing here, that are present in the Cloud Database and Cloud OBIEE offerings that Oracle recently launched:
- It’s not a “service” – you can’t just create an account, design some mappings, load a target – the assumption is that each system is single tenant, perpetual license, self-install
- It’s not consumer-level in terms of simplicity – there’s bits of ODI, bits of GoldenGate, bits of SOA, bits of Jdeveloper – all good bits, but not a single integration-platform-as-a-service
But it’s probably fine if you’re running a hybrid on-premise/cloud strategy, and as I’ll talk about later on, the ELT approach does still have some distinct advantages over cloud-native ETL tools.
What about the more recent, “cloud-native” ETL tools such as SnapLogic, and Informatica Cloud? Informatica Cloud is probably the easiest product to understand if, like me, you’re from an ODI background. What Informatica have done here in terms of architecture is move the metadata and application-layer parts of Informatica PowerCenter into the cloud, add a bunch of cloud application adapters (including a third-party marketplace for them), but still do the actual data integration on-premise, using a cut-down version of the main PowerCenter engine.
Some of the nicer features in this setup are its multi-tenant and “integration-as-a-service” nature, the way it deals with firewall issues (do the integration on-premise), and the interoperability with traditional Informatica PowerCenter, where you can publish custom maplets from on-premise Informatica and push them into the cloud version. If Oracle came out with an ODI-in-the-cloud service, I think it’d look a lot like this.
To my mind though, the most impressive of the cloud integration vendors is SnapLogic. Their SnapLogic Integration Cloud product looks like it was designed first and foremost to run “in the cloud”, it’s available as a service, and the data integration designer focuses on data paths and application integration rather than the low-level database centric approach traditional ETL tools take,
What’s particularly impressive to me is the way that they’ve taken concepts used in tools like ODI – agents that perform the data integration tasks, running as standalone servers – and built on it to create the concept of “snaplexes”, collections of JVMs that can sit in-the-cloud, or on-premise, elastically scale-up to handle larger workloads, and use Amazon’s EC2 and S3 compute and storage clouds to perform any data transformations en-route. Data being transformed in SnapLogic streams through the system with low latency and using web protocols, and the whole thing is just neatly designed to run native on the cloud.
Where tools like SnapLogic do fall short though is on the “last mile” into the data warehouse, where tools like ODI come with lots of templates, database integration features and so forth to handle slowly-changing dimensions, dimension lookups, analytic functions and the like, but cloud-native tools really focus on basic transformations and getting the data from source to target. Where this of course gets interesting is when SnapLogic is used to load databases such as Amazon Redshift, for example, in combination with Tableau, as some of the ETL tool features we’re used to using with tools such as ODI and Informatica just aren’t there yet. What this means in practice is that if you’re looking to move OBIEE into the cloud, along with an Oracle database, you’ll probably still want to use a tool like ODI to do your data load, as these tools are just so mature when it comes to loading relational data warehouses.
So finally, then, onto what I probably think will be the future of ETL – “invisible ETL”, or ETL so abstracted away that you don’t even think of it as a separate product in itself. If you think about it, ETL is a necessary evil when it comes to BI – customers just want their data loaded, they don’t want to worry about how it’s transported, which data loading tool you use and so on. Oracle have had a number of initiatives over the past few years to automate the creation of ETL code from logical table source mappings in the OBIEE RPD, and one of the major drivers in the BI Apps product roadmap is to reduce the amount of manual work propagating application data model changes through the various layers of ETL and BI tool metadata.
A really nice example though of taking this approach of hiding the complexity of ETL can be found with a company called Good Data, who sell a BI in the Cloud product that comes with data storage (via Vertica) and ETL all wrapped-up in a consumer-like service that’s focused on analytics and visualisations but supports data loading from a number of database and cloud application sources. The screenshot below from the GoodData CloudConnect LDM Modeler Guide shows GoodData’s logical data model development environment, with this part of the product handling the facts, dimensions, attributes and other user-facing data elements, and a more traditional ETL tool and server (based on Clover ETL, no less) doing the actual data heavy-lifting.
GoodData splits its data models into these logical and physical elements, which of course is exactly what ODI does – and what OBIEE does too. In fact, the more you look at GoodData, the more you think that all of the elements are already in place at Oracle to create a very similar product, with the added benefit of access to tools such as GoldenGate, EDQ and SOA Suite. Even SnapLogic’s Snaplex concept is conceptually very similar to ODI’s agents, but what they’ve done, and what GoodData, and Informatica and others have done, is wrap the products up into a service, made it all consistent (at least on paper), architected it for cloud-only, and hybrid on-premise/cloud environments, and built-out all of the third-party adapters. It’ll be interesting to see what Oracle’s long-term response to this will be – honestly I’ve no special insight into this part of the product roadmap so I’m just as interested as anyone as to how it will turn out.
So – is any approach better, is Oracle’s approach the worst, are vendors like SnapLogic the future, and what should we use if deploying OBIEE in the cloud? Well to my mind there’s no black-and-white answer, and the choice comes down to a number of factors including:
- Where are your sources and targets; the more that still reside on-premise, the more a traditional on-premise tool such as ODI makes sense
- To what extent are you interacting with non-Oracle targets and sources; the more non-Oracle ones you’re using, the more a non-Oracle tool will probably make sense
- How complex will your end data transformations be – if you’re expecting to do lots of SCD2 transformations, analytic queries, clever DW-stuff, a tool like ODI will make sense
- How much upfront investment do you want to make? Services like Informatica Cloud are far easier to get provisioned on, and involve far less up-front license costs, than an on-premise install
- Or are you just loading data from a single source into a managed cloud database – if so, one of the vendor-supplied utilities will probably do the job
Anyway – I’d be interested in others’ opinions, and whether I’ve missed anything out. But for now – that was some thoughts from me on running ETL “in the cloud”.
I will be in Seattle on Thursday, January 9th for the Meet the Oracle ACE Directors Panel. It is at the Sheraton Seattle from 4 - 6 pm and will feature several other ACE Directors including Martin D'Souza, Kellyn Pot'Vin, Tim Gorman, and my longtime friend and collaborator, Cameron Lackpour.
Come see and the panel and stay for the Happy Hour; the beer will be on me!
As well as offering consulting and training around Oracle’s BI, DW and EPM products, Rittman Mead also provide managed services and support around these products for customers around the world. We’re just about to move over to a new service desk system based on Atlassian Jira, and as part of our final “soak test” we’re going to open up the service publicly for the next 24 hours, taking on public OBIEE support requests on a “best endeavour” basis.
From mid-day today (Wednesday, 18th December 2014, UK time) through to mid-day tomorrow, you can register here and then raise a support ticket, and we’ll endeavour to answer your query over the next few days. As you’d expect, there are a few conditions; you must register with your work email address, and not one from Gmail, Yahoo, Outlook.com etc, and whilst we’ll try and answer your query, there are no promises and we may have to limit replies and accepted tickets. Full details of the offer, and for our global managed services, are on our support page here including terms and conditions, how to register and how to raise a ticket.
UPDATE: Thanks for your interest in our support system. We have now closed registration…
Keep checking back here for future support offerings! Thanks and have a great Christmas and New Year.
It’s that time of year again when we start planning out next year’s BI Forum, which like this year’s event will be running in May 2014 in Brighton and Atlanta. This will be our sixth annual event, and as with previous year’s the most important part is the content – and as such I’m pleased to announce that the Call for Papers for BI Forum 2014 is now open, running through to January 31st 2014.
If you’ve not been to one of our BI Forum events in past years, the Rittman Mead BI Forum is all about Oracle Business Intelligence, and the technologies and techniques that surround it – data warehousing, data analysis, big data, unstructured data analysis, OLAP analysis and this year – in-memory analytics. Each year we select around ten speakers for Brighton, and ten for Atlanta, along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Stewart Bryson.
Last year we had sessions on OBIEE internals and new features, OBIEE visualisations and data analysis, OBIEE and “big data”, along with sessions on Endeca, Exalytics, Exadata, Essbase and anything else that starts with an “E”. This year we’re continuing the theme, but are particularly looking for sessions on what’s hot this year and next – integration with unstructured and big data sources, use of engineered systems and in-memory analysis, advanced and innovative data visualisations, cloud deployment and analytics, and anything that “pushes the envelope” around Oracle BI, data warehousing and analytics.
The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues. We’re also looking for presenters for ten-minute “TED”-style sessions, and any ideas you might have for keynote speakers, send them directly to me at email@example.com. Other than that – have a think about abstract ideas now, and make sure you get them in by January 31st 2014.
But first, a mea culpa:
In 2008, I Was An IdiotBack in early 2008, I wrote a blog entry comparing Collaborate, Kaleidoscope, and OpenWorld. In this entry, I said that Collaborate was the obvious successor to the Hyperion Solutions conference and I wasn't terribly nice to Kaleidoscope. Here's me answering which of the three conferences I think the Hyperion community should attend (I dare you to hold in the laughter):
Now which one would I attend if I could only go to one?Collaborate. Without reservation. If I'm going to a conference, it's primarily to learn. As such, content is key.I actually got asked a very similar question on Network 54's Essbase discussion board just yesterday (apparently, it's a popular question these days). To parrot what I said there, OpenWorld was very, very marketing-oriented. 80% of the fewer than 100 presentations in the Hyperion track were delivered by Oracle (in some cases, with clients/partners as co-speakers). COLLABORATE is supposed to have 100-150 presentations with 100+ of those delivered by clients and partners.In the interest of full-disclosure, my company, interRel, is paying to be a 4-star partner of COLLABORATE. Why? Because we're hoping that COLLABORATE becomes the successor to the Solutions conference. Solutions was a great opportunity to learn (partying was always secondary) and I refuse to believe it's dead with nothing to take it's mantle. We're investing a great deal of money with the assumption that something has to take the place of Hyperion Solutions conference, and it certainly isn't OpenWorld.Is OpenWorld completely bad? Absolutely not. In addition to the great bribes, it's a much larger conference than COLLABORATE or ODTUG's Kaleidoscope, so if your thing is networking, by all means, go to OpenWorld. OpenWorld is the best place to get the official Oracle party line on upcoming releases and what not. OpenWorld is also the place to hear better keynotes (well, at least by More Famous People like Larry Ellison, himself). OpenWorld has better parties too. OpenWorld is also in San Francisco which is just a generally cooler town. In short, OpenWorld was very well organized, but since it's being put on by Oracle, it's about them getting out their message to their existing and prospective client base.So why aren't I recommending Kaleidoscope (since I haven't been to that either)? Size, mostly. Their entire conference will have around 100 presentations, so their Hyperion track will most likely be fewer than 10 presentations. I've been to regional Hyperion User Group meetings that have more than that (well, the one interRel hosted in August of 2007 had 9, but close enough). While Kaleidoscope may one day grow their Hyperion track, it's going to be a long time until they equal the 100-150 presentations that COLLABORATE is supposed to have on Hyperion alone.If you're only going to one Hyperion-oriented conference this year, register for COLLABORATE. If you've got money in the budget for two conferences, also go to OpenWorld. If you're a developer that finds both COLLABORATE and OpenWorld to be too much high-level fluff, then go to Kaleidoscope.
In 2008, Mike Riley Luckily Wasn't An Idiot
It’s true that ODTUG is a smaller conference, however that is by choice. At every ODTUG conference, the majority of the content is by a user, not by Oracle or even another vendor. And even though Collaborate might seem like the better buy because of its scale, for developers and true technologists ODTUG offers a much more targeted and efficient conference experience. Relevant tracks in your experience level are typically consecutive, rather than side-by-side so you don’t miss sessions you want to attend. The networking is also one of the most valuable pieces. The people that come to ODTUG are the doers, so everyone you meet will be a valuable contact in the future.
It’s true, COLLABORATE will have many presentations with a number of those delivered by clients and partners, but what difference does that make? You can’t attend all of them. ODTUG’s Kaleidoscope will have 17 Hyperion sessions that are all technical.
In the interest of full disclosure, I have been a member of ODTUG for eight years and this is my second year as a board member. What attracted me to ODTUG from the start was the quality of the content delivered, and the networking opportunities. This remains true today.
I won’t censor or disparage any of the other conferences. We are lucky to have so many choices available to us. My personal choice and my highest recommendation goes to Kaleidoscope for all the reasons I mentioned above (and I have attended all three of the above mentioned conferences).
One last thing; New Orleans holds its own against San Francisco or Denver. All of the cities are wonderful, but when it comes to food, fun, and great entertainment there’s nothing like the Big Easy. Mike was only in his second year as a board member of ODTUG, but he was willing to put himself out there, so I wrote him an e-mail back. In that e-mail, dated February 10, 2008, I said that for Kaleidoscope to become a conference that Hyperion users would love, it would require a few key components: keynote(s) by headliner(s), panels of experts, high-quality presentations, a narrow focus that wasn't all things to all people, and a critical mass of attendees.
Mike Helped Us, Let's Help Him
Note that the file was saved in the Excel 2007 and higher format (with an .xlsx file extension). Did you know that the xlsx format is really just a specialized zip file? Seriously. It is a zip file containing various files that are primarily in xml format. I saved the workbook, added the .zip extension to the filename, and opened it in WinRar. Here is what I found.
I opened the xl folder to find a series of files and folders.
Next, I opened the worksheets folder to see what was in there. Of course, it is a directory of xml files containing the contents of the worksheets.
My Essbase retrieves were on the sheet named Sheet1, so let’s take a look at what is in the sheet1.xml file. The xml is quite large, so I can’t show all of it here, but needless to say, there is a bunch of information in the file. The cell contents are only one of the things in the file. Here is an excerpt that shows the contents of row 5 of the spreadsheet.
This is interesting as it shows the numbers but not the member name. What is the deal with that? I noticed there is an attribute, ‘t’, on that node. I am guessing that the attribute t=”s” means the cell type is a string. I had noticed that in one of the zip file screenshots, there was a file named sharedStrings.xml. Hmm... I took a look at that file and guess what I found?
That’s right! The 5th item, assuming you start counting at zero like all good programmers do, is Profit. That number corresponds perfectly with the value specified in the xml for cell B5, which was five (circled in blue in the xml file above). OK, so when are we going to get to Smart View stuff? The answer is pretty quick. I continued looking at sheet1.xml and found these nodes near the bottom.
Hmm, custom properties that contain the name Hyperion? Bingo! There were a number of custom property files in the xml file. Let’s focus on those.
Custom property #1 is identified by the name CellIDs. The corresponding file, customProperty1.bin, contained only the empty xml node <root />. Apparently there aren’t any CellIDs in this workbook.
Custom property #2 is identified by the name ConnName. The file customProperty2.bin contains the string ‘Sample Basic’ which is the name of my connection.
Custom property #3 is named ConnPOV but it appears to contain the connection details in xml format. Here is an excerpt of the xml.
Custom property #4 is named HyperionPOVXML and the corresponding file contains xml which lines up with the page fields I have in my worksheet.
What is interesting about the POV xml is that I have two different retrieves that both have working POV selectors which are both implemented as list-type data validations in Excel. I don’t know what happens internally if I save different values for the POV.
Custom property #5 is labeled HyperionXML. It appears to contain the information about the Essbase retrieval, but it doesn't appear to be the actual retrieval xml because it doesn't contain the numeric data. My guess is that this xml is used to track what is on the worksheet from a Hyperion standpoint.
There is a lot of information in this simple xml stream, but the most interesting information is contained in the slice element. Below is a close-up of contents in the slice.
The slice covers 6 rows and 7 columns for a total of 42 cells. It is interesting that the Smart View team chose to serialize their XML in this manner for a couple of reasons. First, the pipe delimited format means that every cell must be represented regardless of whether it has a value or not. This really isn’t too much of a problem unless your spreadsheet range is pretty sparse. The second thing about this format is that the xml itself is easy and fast to parse, but the resulting strings need to be parsed again to be usable. For example, the vals node will get split into an array containing 42 elements. The code must then loop the 42 elements and process them individually. The other nodes, such as the status, contain other pieces of information about the grid. The status codes appear to be cell attributes returned by Essbase; these attributes are used to apply formatting to cells in the same way the Excel add-in UseStyles would apply formatting. There are a couple of things to take away:
- In addition to the data on the worksheet itself, there is potentially *a lot* of information stored under the covers in a Smart View file.
- String parsing is a computation-intensive operation and can hurt performance. Multiply that workload by 8 because, depending on the operation and perhaps the provider, all 8 xml nodes above may need to be parsed.
In addition, the number of rows and columns shown in the slice may be important when you are looking at performance. Smart View must look at the worksheet to determine the size of the range to read in order to send it to Essbase. In the case of a non-multi-grid retrieve, the range may not be known and, as a result, the grid may be sized based on the UsedRange of the worksheet. In our work with Dodeca, we have found that workbooks converted from the older xls format to the newer xlsx format, which support a larger number of cells, may have the UsedRange flagged internally to be 65,536 rows by 256 columns. One culprit appears to be formatting applied to the sheet in a haphazard fashion. In Dodeca, this resulted in a minor issue which resulted in a larger memory allocation on the server. Based on the format of the Smart View xml, as compared to the more efficient design of the Dodeca xml format, if this were to happen in Smart View it may cause a larger issue due to the number of cells that would need to be parsed and processed. Disclaimer: I did not attempt to replicate this issue in Smart View but rather is an educated guess based on my experience with spreadsheet behavior.
Note: The Dodeca xml format does not need to contain information for cells that are blank. This format reduces the size and the processing cycles necessary to complete the task. In addition, when we originally designed Dodeca, we tested a format similar to the one used today by Smart View and found it to be slower and less efficient.
Considering all of this information, I believe the xml format would be difficult for the Smart View team to change at this point as it would cause compatibility issues with previously created workbooks. Further, this discussion should give some visibility to the fact that the Smart View team faces an on-going challenge to maintain compatibility between different versions of Smart View considering that different versions distributed on desktops and different versions of the internal formats that customers may have stored in their existing Excel files. I don’t envy their job there.
After looking at all of this, I was curious to see what the xml string would look like on a large retrieve, so I opened up Smart View, connected to Sample Basic and drilled to the bottom of the 4 largest dimensions. The resulting sheet contained nearly 159,000 rows of data. Interestingly enough, when I looked at the contents of customProperty5.bin inside that xlsx file, the contents were compressed. It occurred to be a bit strange to me as the xlsx file format is already compressed, but after thinking about it for a minute it makes sense as the old xls file format probably did not automatically compress content, so compression was there primarily to compress the content when saved in the xls file format.
Custom property #6 is labeled NameConnectionMap. The corresponding property file contains xml that appears to map the range names in the workbook to the actual grid and the connection.
Custom property #7 is labeled POVPosition. The file customProperty7.bin contains the number 4 followed by a NUL character. Frankly, I have no idea what position 4 means.
Moving on to custom property #8 which is labeled SheetHasParityContent. This file contains the number 1 followed by a NUL character. This is obviously a boolean flag that tells the Smart View code that new features, such as support for multiple grids, are present in this file.
Custom property #9 is labeled SheetOptions. The corresponding file, customProperty9.bin, contains an xml stream that (obviously) contains the Hyperion options for the sheet.
Custom property #10 is labeled ShowPOV and appears to contain a simple Boolean flag much like that in custom property #8.
Finally, custom property #11 is labeled USER_FORMATTING and may not be related to Smart View.
I did look through some of the other files in the .zip and found a few other references to Smart View, but I did not see anything significant.
So, now that we have completed an overview of what is contained in one, very simple, multi-grid file, what have we learned?
- There is a bunch of stuff stored under the covers when you save a Smart View retrieve as an Excel file.
- With the reported performance issues in certain situations with Smart View, you should now have an idea of where to look to resolve Smart View issues in your environment.
There are a number of files I did not cover in this overview that could also cause performance issues. For example, Oracle support handled one case where they found over 60,000 Excel styles in the file. Smart View uses Excel Styles when it applies automatic formatting to member and data cells. When there are that many styles in the workbook, however, it is logical that Excel would have a lot of overhead searching through its internal list of Style objects to find the right one. Accordingly, there is a styles.xml file that contains custom styles. If you have a bunch of Style objects, you could delete the internal styles.xml file.
Note: Be sure to make a copy of your original workbook before you mess with the internal structures. There is a possibility that you may mess it up and lose everything you have in the workbook. Further, Oracle does not support people going under-the-covers and messing with the workbook, so don’t even bring it up to support if you mess something up.
Wow, that should give you some idea of what may be going on behind the scenes with Smart View. Even with the experience I have designing and writing the Dodeca web services that talk to Essbase, I wouldn't say that I have a deep understanding of how the information in a Smart View workbook really works. However, one thing is for certain; Dodeca does not put stuff like this in your Excel files. It may be interesting to hear what you find when you explore the internals of your workbooks.
This year, Rittman Mead were the Analytics Sponsor for the UKOUG Tech13 conference in Manchester, and for those that visited the UKOUG stand during their time there will have noticed the Rittman Mead-sponsored Analytics Dashboard on display. In this blog post I will cover how it was put together, the “Blue Peter Way” !
For those of you not familiar with Blue Peter it is a long running children’s TV show here in the UK that started way back in 1958, possibly it’s most infamous moment came in 1969 with lulu the defecating elephant. As a child the highlight of the show was the “How to make” something section where they would always use the phrase “Sticky-backed-plastic” instead of “Sellotape” due to a policy against using commercial terms on air. The presenters would create something from scratch with bits and bobs you could find around the house, cereal boxes, egg cartons, washing up bottles, Sellotape etc ( sorry, “sticky-backed-plastic” ). That’s exactly what I’m going to be doing here but instead of making a sock monster I’m going to show you how the analytics dashboard was created from scratch with bits you can find on the internet. So kids, without further delay, let’s begin…
(remember to ask your parents permission before downloading any of the following items)
You will need :
- A Linux server.
- A Web server
- The Redis key/value store
- A DataSift account
- The Webdis HTTP server
- The tagcanvas.js jQuery plugin
- The vTicker jQuery plugin
- The flot plotting library
- Some Sticky-backed-Plastic (jQuery)
- Lots of coffee
The Linux server I’m using is Red Hat Enterprise Server 6.4 and the very first thing we’ll need to do is install a web server.
Done. The Web server is now installed, configured to start on server-boot and up and running ready to service our http requests.
Next up is the Redis key/value datastore. I’ll be using Redis to store all incoming tweets from our datasource ( more on that in a bit )
Now that we have Redis up and running let’s perform a couple of tests, first a benchmark using the Redis-benchmark command
This command throws out a lot more output than this but it is the LRANGE command we are particularly interested in as we’ll be using it to retrieve our tweets later on. 3593 requests per second seems reasonable to me, there were around a 1000 registrations for the UKOUG TECH13 conference, the likelihood of each of them making 3 concurrent dashboard requests all within the same second is slim to say the least – regardless, I’m willing to live with the risk. Now for a quick SET/GET test using the Redis-cli command.
Ok so that’s our datastore sorted, but what about our datasource ? Clearly we will be using Twitter as our primary data source but mere mortals like myself don’t have access to Twitter’s entire data stream, for that we need to turn to specialist companies that do. DataSift, amongst other companies like Gnip and Topsy ( which interestingly was bought by Apple earlier this week ) can offer this service and will add extra features into the mix such as sentiment analysis and stream subscription services that push data to you. I’ll be using DataSift, as here at Rittman Mead we already have an account with them. The service itself charges you by DPU ( Data Processing Units ) the cost of which depends upon the computational complexity of the stream your running, suffice to say that running the stream to feed the analytics dashboard for a few days was pretty cheap.
To setup a stream you simply express what you want included from the Twitter firehose using their own Curated Stream Definition Language, CSDL. The stream used for the dashboard was very simple and the CSDL looked like this :-
This is simply searching for tweets containing the string “ukoug” from the Twitter stream. DataSift supports all kinds of other social data, for a full list head over to their website.
Now that we have our data stream setup, how do we actually populate our Redis data store with it ? Simple – get DataSift to push it to us. Using their own PUSH API you can setup a subscription in which you can specify the following output parameters: “output_type”, “host”, “port”, “delivery_frequency” and “list” amongst many others. The output type was set to “Redis”, the host, well, our hostname and the port set to the port upon which Redis is listening, by default 6379. The delivery_frequency was set to 60 seconds and the list is the name of the Redis list key you want the data pushed to. With all that setup and the subscription active, tweets will automagically start arriving in our Redis datastore in the JSON format – happy days !
The next step is to get the data from Redis to the browser. I could install PHP and configure it to work with apache and then install one of the several Redis-PHP libraries and write some backend code to serve data up to the browser, but time is limited and I don’t want to faff around with all that. Besides this is Blue Peter not a BBC4 Documentary. Here’s where we use the next cool piece of open source software, Webdis. Webdis acts as a HTTP interface to Redis which means I can make HTTP calls directly from the browser like this :-
Yep, you guessed it, we’ve just set a key in Redis directly from the browser. To retrieve it we simply do this.
Security can be handled in the Webdis config file by setting IP based rules to stop any Tom, Dick or Harry from modifying your Redis keys. So to install Webdis we do the following :-
An that’s it, we now have tweet to browser in 60 seconds and without writing a single line of code ! Here’s an overview of the architecture we have :-
Would I do this in a production environment with larger data set, probably not, it wouldn’t scale, I’d instead write server-side code to handle the data processing, test performance and push only the required data to the client. The right tools for the right job.
The function getData() is called within the setInterval() function, the second parameter to the setInterval() sets the frequency of the call, in this case every 5 mins (30000 milliseconds). The getData function performs an Ajax get request on the url that points to our Webdis server listening on port 7379 which then performs an LRANGE command on the Redis data store, it’s simply asking Redis to return all list items between 0 and 100000, each item being a single tweet. Once the Ajax request has successfully completed the “success” callback is called and within this callback I am pushing each tweet into an array so we end up with an array of tweet objects. We now have all the data in a format we can manipulate to our hearts content.
Now onto the Graphs and Visualisations.The Globe
The tweet ticker was built by grabbing the latest 30 tweets from the data array and assigning them to html <li> tags, the vTicker jQuery plugin was then applied to the containing <ul> html tag. Various plugin options allow you to control things like the delay and scroll speed.Tweet Velocity
This one proved quite popular – these speaker types are a competitive bunch ! Having obtained a list of speakers from the UKOUG Tech13 website I was able to search all the twitter content for each speaker and aggregate up to get the total twitter mentions for each speaker, again the graph was rendered using the flot plotting library. As the graph updated during each day speakers were swapping places with our own Mark Rittman tweeting out the “The Scores on the doors” at regular intervals. When the stats can me manipulated though there’s always someone willing to take advantage !
tut, tut Mark.Twitter Avatars
The twitter Avatars used the tagcavas.js library but instead of populating it with words from the Twitter content the Tweet avatars were used. A few changes to the plugin options were made to display the results as a horizontally scrolling cylinder instead of a globe.Twitter Sentiment
The Twitter sentiment graph again used flot. Tweet sentiment was graphed over time for the duration of the conference. The sentiment score was provided by DataSift as part of the Twitter payload. The scores we received as part of the 3500 tweets ranged between -15 and 15 each score reflected either a positive or negative tweet. Asking a computer to infer a human emotion from 140 characters of text is a tough ask. Having looked at the data in detail a fair few of the tweets that received negative scores weren’t negative in nature, for example tweets with the content “RAC attack” and “Dangerous anti patterns” generated a cluster of negative scores. As we know computers are not as clever as humans, how can they be aware of the context of a tweet? detect sarcasm or differentiate between banter and plain old insults? Not all the negatively scored tweets where false-negatives, some were genuine, a few regarding the food seemed to ring true.
Perhaps the data you’re analyzing needs to be taken into context as a whole. You’d expect a 1000 Techies running around a tech conference to be happy, the sentiment analysis seemed to do a better job at ranking how positive tweets were than how negative they were. Perhaps from a visualisation point of view, a logarithmic scale along with a multiplier to raise the lowest score would have worked better in this case to reflect how positive overall the event was. One thing is clear though and that is that further statistical analysis would be needed over a much larger data set to really gain insight into how positive or negative your data set is.
So that’s it kids, I hope you’ve enjoyed this episode of Blue Peter, until next time….
Edit – 6-Dec-2013 19.07:
Mark has asked me the following question over twitter:
“any chance you could let us know why these particular tools / components were chosen?”.
I thought I’d give my answer here.
All the tools were also open source, easy to install/configure and are well documented, I had also used them all previously – again, a time saver. Redis was chosen for 2 reasons, it was supported as a subscription destination by DataSift ( along with many others ) and I’m currently using it in another development project I’m working on so I was up to speed with it. Although in this solution I’m not really taking advantage of Redis’s access speed it worked well as somewhere to persist the data.
If I was coding a more permanent solution with more development time then I’d add a relational database into the mix and use it to perform all of the data aggregation and analysis, this would make the solution more scalable. I’d then either feed the browser via ajax calls directly to the database via backend code or populate a cache layer from the database and make ajax called directly on the cache, similar to what I did in this solution. It would have been nice to use D3 to develop a more elaborate visualisation instead of the canned flot charts but again this would have taken more time to develop.
Next week Rittman Mead is the analytics sponsor for the UKOUG Tech13 conference up in Manchester.
We’ve got a great line up of talks, covering everything from BI Apps, Endeca, Hadoop, OBIEE and mobile – Stewart and Charles are both flying in from the US to present alongside Mark, Adam and myself. The full list is as follows:
- Deep dive into Oracle BI Applications using Oracle Data Integrator [Mark Rittman] – Sunday 12.30-2pm
- Leveraging Hadoop / Map Reduce OBIEE 11g and ODI 11g [Mark Rittman] – Sunday 3pm – 4.30pm
- BI Across Any Data Source with OBIEE and Oracle Endeca Discovery [Mark Rittman & Adam Seed] – Monday 5.05-5.50pm
- The Changing World of Business Intelligence [Jon Mead] – Tuesday 11.20 – 12.20
- Oracle BI Multi User Development (MDS XML vs MUDE) [Stewart Bryson] – Tuesday 11.30 -12.30
- Designing for a mobile World using OBIEE [Charles Elliot] – Wednesday 1.45 – 2.45
On Monday night we are teaming up with our friends at Pythian to host some drinks and nibbles at Taps Bar, which is just around the corner from the conference venue. We’ll be there from 6.30-9.00pm, so come and join us for a free drink.Analytics Sponsor
As part of being the analytics sponsor for the event we are looking to collate as much real time, social media and demographic information about the event as possible. We will be displaying an analytics dashboard in the conference venue detailing statistics from this data. To help us, could you:
- Complete the form here to give use some background infromation about why you are going; and
- Use the official hashtag ukoug_tech13 when tweeting anything about the event.
It’s looking like it will be a great few days, so look forward to seeing you up there.
More Excel SupportDodeca has always been strong on Excel version support and this version delivers even more Excel functionality. Internally, we use the SpreadsheetGear control, which does a very good job with Excel compatibility. This version of Dodeca integrates a new version of SpreadsheetGear that now has support for 398 Excel functions including the new SUMIFS, COUNTIFS, and CELL functions.Excel Page Setup DialogThe new version of Dodeca includes our implementation of the Excel Page Setup Dialog which makes it easy for users to customize the printing of Dodeca views that are based on Excel templates. Note that for report developers, the Excel Page Setup has also been included in the Dodeca Template Designer.
New PDF View TypeCustomers who use PDF files in their environments will like the new PDF View Type. In previous releases of Dodeca, PDF documents displayed in Dodeca opened in an embedded web browser control. Beginning in this version, Dodeca includes a dedicated PDF View type that uses a specialized PDF control.
View Selector TooltipsFinally, users will like the new View Selector tooltips which optionally display the name and the description of a report as a tooltip.
PerformancePerformance is one of those things that users always appreciate, so we have added a new setting that can significantly improve performance in some circumstances. Dodeca has a well-defined set of configuration objects that are stored on the server and we were even awarded a patent recently for the unique aspects of our metadata design. That being said, depending on how you implement reports and templates, there is the possibility of having many queries issued to the server to check for configuration updates. In a few instances, we saw that optimizing the query traffic could be beneficial, so we have implemented the new CheckForMetadataUpdatesFrequencyPolicy property. This property, which is controlled by the Dodeca administrator, tells Dodeca whether we should check the server for updates before any object is used, as was previously the case, only when a view opens, or only when the Dodeca session begins. We believe the latter case will be very useful when Dodeca is deployed in production as objects configured in production often do not change during the workday and, thus, network traffic can be optimized using this setting. The screenshot below shows where the administrator can control the update frequency.
Though users will like these features, we have put a lot of new things in for the people who create Dodeca views and those who administer the system. Let’s start with something that we think all Dodeca admins will use frequently.Metadata Property Search UtilityAs our customers continue to expand their use of Dodeca, the number of objects they create in the Dodeca environment continues to grow. In fact, we now have customers who have thousands of different objects that they manage in their Dodeca environments. The Metadata Property Search Utility will help these users tremendously.
This utility allows the administrator to enter a search string and locate every object in our system that contains that string. Once a property is located, there is a hyperlink that will navigate to the given object and automatically select the relevant property. This dialog is modeless, which means you can navigate to any of the located items without closing the dialog.
Note: this version does not search the contents of Excel files in the system.Essbase Authentication ServicesIn the past, when administrators wished to use an Essbase Authentication service to validate a login against Essbase and automatically obtain Dodeca roles based on the Essbase user’s group memberships, they had to use an Essbase connection where all users had access to the Essbase application and database. The new ValidateCredentialsOnly property on both of the built-in Essbase Authentication services now flags the service to check login credentials at the server-level only, eliminating the need for users to have access to a specific Essbase database.New Template Designer ToolsPrior to Dodeca 6.x, all template editing was performed directly in Excel. Since that time, however, most template design functionality has been replicated in the Dodeca Template Designer, and we think it is preferable due to the speed and ease of use with which users can update templates stored in the Dodeca repository. We have added a couple of new features to the Template Designer in this version. The first tool is the Group/Ungroup tool that allows designers to easily apply Excel grouping to rows and/or columns within the template. The second new tool is the Freeze/Unfreeze tool that is used to freeze rows and/or columns in place for scrolling.Parameterized SQL Select StatementsSince we introduced the SQLPassthroughDataSet object in the Dodeca 5.x series, we have always supported the idea of tokenized select statements. In other words, the SQL could be written so that point-of-view selections made by users could be used directly in the select statement. In a related fashion, we introduced the concept of parameterized insert, update, and delete statements in the same version. While parameterized statements are similar in concept to tokenized statements, there is one important distinction under the covers. In Dodeca, parameterized statements are parsed and converted into prepared statements that can be used multiple times and results in more efficient use of server resources. The parameterized select statement was introduced in this version of Dodeca in order for customers using certain databases that cache the prepared statement to realize improved server efficiency on their select statements.Workbook Script Formula Editor ImprovementsWe have also been working hard to improve extensibility for developers using Workbook Scripts within Dodeca. In this release, our work focused on the Workbook Script Formula Editor. The first thing we added here is color coding that automatically detects and distinguishes Excel functions, Workbook Script functions, and Dodeca tokens. In the new version, Excel functions are displayed in green, Dodeca functions and parentheses are displayed in blue, and tokens are displayed in ochre. Here is an example.
In addition, we have implemented auto-complete for both Excel and Dodeca functions.
New SQLException EventVersion 6.6 of Dodeca introduces a new SQLException event that provides the ability for application developers to customize the behavior when a SQL Exception is encountered.XCopy Release DirectoryBeginning in version 6.6, the Dodeca Framework installation includes a pre-configured directory intended for customers who prefer to distribute their client via XCopy deployment instead using Microsoft ClickOnce distribution. The XCopy deployment directory is also for use by those customers who use Citrix for deployment.Mac OS X Release DirectoryThe Dodeca Framework installation now includes a pre-compiled Dodeca.app deployment for customers who wish to run the Dodeca Smart Client on Mac OS X operating systems. What that means is that Dodeca now runs on a Mac without the need for any special Windows emulators. Dodeca does not require Excel to run on the Mac (nor does it require Excel to run on Windows for that matter), so you can certainly save your company significant licensing fees by choosing Dodeca for your solution.
In short, you can see we continue to work hard to deliver functionality for Dodeca customers. As always, the Dodeca Release Notes provide detailed explanations of all new and updated Dodeca features. As of today, we have decided to make the Release Notes and other technical documents available for download to non-Dodeca customers. If you are curious about all of the things Dodeca can do, and if you aren't afraid to dig into the details, you can now download our 389 page cumulative Release Notes document from the Dodeca Technical Documents section of our website.
How can you conditionally turn cells borders on and off in Publishers RTF/XSLFO templates? With a little digging you'll find what appears to be the appropriate attributes to update in your template. You would logically come up with using the various border styling options:
border-top|bottom|left|right-width border-top|bottom|left|right-style border-top|bottom|left|right-color
Buuuut, that doesnt work. Updating them individually does not make a difference to the output. Not sure why and I will ask but for now here's the solution. Use the compound border formatter border-top|bottom|left|right. This takes the form ' border-bottom="0.5pt solid #000000". You set all three options at once rather than individually. In a BIP template you use:
<?if:DEPT='Accounting'?> <?attribute@incontext:border-bottom;'3.0pt solid #000000'?> <?attribute@incontext:border-top;'3.0pt solid #000000'?> <?attribute@incontext:border-left;'3.0pt solid #000000'?> <?attribute@incontext:border-right;'3.0pt solid #000000'?> <?end if?>
3pt borders is a little excessive but you get the idea. This approach can be used with the if@row option too to get the complete row borders to update. If your template will need to be run in left to right languages e.g. Arabic or Hebrew, then you will need to use start and end in place of left and right.
For the inquisitive reader, you're maybe wondering how, did this guy know that? And why the heck is this not in the user docs?
Other than my all knowing BIP guru status ;0) I hit the web for info on XSLFO cell border attributes and then the Template Builder for Word. Particularly the export option; I generated the XSLFO output from a test RTF template and took a look at the attributes. Then I started trying stuff out, Im a hacker and proud me! For the users doc updates, I'll log a request for an update.
If you’re a user of Oracle’s data integration products, you’re probably aware that ODI12c came out last month, with some of the new features covered in posts on the blog here, here and here. Rittman Mead were actually on the beta program for 12c, with several of our team attending preview events in the UK and USA at the start of the year, and then running preview releases back in our development labs as the product neared GA.
As CTO I sponged our involvement in the ODI12c beta program, and was invited down to Oracle’s offices in Reading to take part in an interview with Kulvinder Hari, Director Product Management – Fusion Middleware, to talk about our involvement in the program, and what we saw as the most interesting new features in this latest release. You can access the full video here, and I also talk about the wider data integration market, products such as GoldenGate and EDQ, and the benefits OWB customers will get when migrating to, and interoperating with, Oracle Data Integrator.
The interview was actually part of a wider set of activities around the ODI12c launch, and you read a recap of the Oracle GoldenGate 12c and Oracle Data Integration 12c Webcast on the Oracle website, as well as download resources on ODI12c from an oracle.com microsite. Keep an eye on the blog as well over the next few months, as we post more content on ODI12c along with the other new Oracle Data Integration 12c releases.odi
In yesterday’s post on running OBIEE in the cloud, I looked at a number of options for hosting the actual OBIEE element; hosting it in a public cloud service such as Amazon EC2, using Oracle’s upcoming BI-as-a-Service offering, or partner offerings such as our upcoming ExtremeBI in the Cloud service. But the more you think about this sort of thing, the more you realise that the OBIEE element is actually the easy part – it’s what you do about data storage, security, LDAP directories and ETL that makes things more complicated.
Take the example I gave yesterday where OBIEE was run in the cloud, with the multi-tenancy option enabled, the main data warehouse in the cloud, and data sourced from cloud and on-premise sources.
In this type of setup, there’s a number of things you need to consider beyond how OBIEE is hosted. For example:
- If your corporate LDAP directory is on-premise, how do we link OBIEE to it? Or does the LDAP server also need to be in the cloud?
- What sort of database do we use if we’re hosting it in the cloud. Oracle? If so, self-hosted in a public cloud, or through one of the Oracle DB-in-the-cloud offerings?
- If not Oracle database, what other options are available?
- And how do we ETL data into the cloud-based data warehouse? Do we continue to use a tool like ODI, or use a cloud-based option – or even a service such as Amazon’s AWS Data Pipeline?
What complicates things at this stage in the development of “cloud”, is that most companies won’t move 100% to cloud in one go; more likely, individual application and systems might migrate to the cloud, but for a long time we’ll be left with a “hybrid” architecture where some infrastructure stays on premise, some might sit in a public cloud, others might be hosted on third-party private clouds. So again, what are the options?
Well Oracle’s upcoming BI-as-a-service offering works at one extreme end-of-the-spectrum; the only data source it’ll initially work with is Oracle’s own database-as-a-service, which in its initial incarnation provides a single schema, with no SQL*Net access and with data instead uploaded via a web interface (this may well change when Oracle launch their database instance-as-a-service later in 2014). No SQL*Net access means no ETL tool access though, in practice, as they all use SQL*Net or ODBC to connect to the database, so this offer to my mind is aimed at either (a) small BI applications where it’s practical to upload the data via Excel files etc, or (b) wider Oracle Cloud-based systems that might use database-as-a-service to hold their data, Java-as-a-service for the application and so forth. What this service does promise though is new capabilities within OBIEE where users can upload their own data, again via spreadsheets, to the cloud OBIEE system, and have that “mashed-up” with the existing corporate data – the aim being to avoid data being downloaded into Excel to do this type of work, and with user metrics clearly marked in the catalog so they’re distinct from the corporate ones.
But assuming you’re not going for the Oracle cloud offer, what are the other options over data? Well hosting OBIEE in the cloud is conceptually no different from hosting anywhere else, in that it can connect to various data sources via the various connection methods, so in-principle you’ve got just the same options open to you if running on premise. But the driver for moving OBIEE into the cloud might be that your applications, data etc are already in the cloud, and you might also be looking to take advantage of cloud features in your database such as dynamic provisioning and scaling, or indeed use one of the new cloud-native databases such as Amazon Redshift.
I covered alternative databases for use with OBIEE a few months ago in a blog post, and Amazon Redshift at the time looked like an interesting option; based on ParAccel, a mature analytic database offering, column-store and tightly integrated in with Amazon’s other offerings, a few customers have asked us about this as an option. And they’re certainly interesting – in practice, not all that different in pricing to Oracle database as a source but with some interesting analytic features – but they all suffer from the common issue that they’re not officially supported as data sources. Amazon Redshift, for example, uses Postgres-derviced ODBC drivers to connect to it, but Postgres itself isn’t officially supported as a source, which means you could well get sub-optimal queries and you certainly won’t get specific support from Oracle for that source. But if it works for you – then this could be an option, along with more left-field data source such as Hadoop.
But to my mind, it’s the ETL element that’s the most interesting, and most challenging, part of the equation. Going back to Openworld, Oracle made a few mentions of ETL in their general Cloud Analytics talks, including talk about an upcoming data source adapter for the BI Apps that’ll enable loading from Fusion Apps in the cloud, like this:
There were also a number of other deployment options discussed, including hybrid architectures where some sources were in the cloud, some on-premise, but all involved running the ETL elements of the BI Apps – ODI or Informatica – on-premise, the same way as installs today. And to my mind, this is where the Oracle cloud offering is the weakest, around cloud-based and cloud-native ETL and data integration – the only real option at the moment is to run ODI agents in the cloud and connect them back to an on-premise ODI install, or move the whole thing into the cloud in what could be quiet a heavyweight data integration architecture.
Other vendors are, in my opinion, quite a way further forward with their cloud data integration tools strategy than Oracle, who instead seem to be still focused on on-premise (mainly), database-to-database (mostly) ETL. To take two examples; Informatica have an Informatica Cloud service which appears to be a platform-as-a-service play, with customers presumably signing-up for the service, designing their ETL flows and then paying for what they use, with a focus on cloud APIs as well as database connectivity, and full services around data quality, MDM and so forth.
Another vendor in this space is SnapLogic, a pure-play cloud ETL vendor selling a component-based product with a big focus on cloud, application and big data sources. What’s interesting about this and other similar vendor’s approaches though are they they appear to be “cloud-first” – written for the cloud, sold as a service, as much focused on APIs as database connectivity – a contrast to Oracle’s current data integration tools strategy which to my mind still assumes an on-premise architecture. What’s more concerning is the lack of any announcement around ETL-in-the-cloud at the last Openworld – if you look at the platform-as-a-service products announced at the event, whilst database, BI, documents, BPM and so forth-as-a-service were announced, there was no mention of data integration:
What I’d like to see added to this platform in terms of data integration would be something like:
- On-demand data integration, sold as a service, available as a package along with database, Java and BI
- Support for Oracle and non-Oracle application APIs – for example Salesforce.com, Workday and SAP – see for example what SnapLogic support in this area.
- No need for an install – it’s already installed and it’s a shared platform, as they’re doing with OBIEE
- Good support for big data, unstructured and social data sources
I think it’s pretty likely this will happen – whilst products such as the BI Apps can have their ETL in the cloud, via ODI in the BI Apps 11g version for example, these are inherently single-tenant, and I’d fully expect Oracle plan at some time to offer BI Apps-as-a-service, with a corresponding data integration element designed from the ground-up to work cloud-native and integrate with the rest of Oracle’s platform-as-a-service offering.
So there we have it – some thoughts on the database and ETL elements in an OBIEE-in-the-cloud offering. Keep an eye on the blog over the next few months as I built-out a few examples, and I’ll be presenting on the topic at the upcoming BIWA 2014 event in San Francisco in January – watch this space as they say.
One of the major announcements at this year’s Oracle Openworld in San Francisco was around “OBIEE in the cloud”. Part of a wide-ranging set of announcements around services and infrastructure in the cloud from Oracle, the idea with this service is that you’ll be able to access an Oracle-hosted future version of OBIEE running in Oracle’s public cloud service, so that you can create dashboards and reports without having to perpetually-license lots of software, or stand-up lots of infrastructure in your data centre. But of course “cloud” is a hot topic at the moment, and lots of our customers are asking us about options to run OBIEE in the cloud, what’s involved in it, and how we deal with the stuff that surrounds an OBIEE installation – the data, the ETL, the security, and so forth. So I thought I’d put a couple of blog posts together to go through what options are available to OBIEE customers looking to deploy in the cloud, and see what the state of the industry is around this requirement.
I’ll start-off then by looking at the most obvious element – the BI platform itself. If you want to run OBIEE “in the cloud”, what are your basic options?
Probably conceptually the simplest, is just to run the OBIEE server applications in a cloud-based, hosted environment, for example Amazon EC2. In this instance, OBIEE runs as it would do normally, but instead of the host server being in your datacenter, it’s in a public cloud. This type of setup has been available for some time in the form of services such as Oracle On-Demand, but the difference here is that you’re using a public cloud service, there’s no service contracts to set up, usually no minimum contract size, and everything to do with security, failover, maintenance and so on is down to you. We’ve been doing this sort of thing for a long-time, typically for our cloud-based training environments, or for ad-hoc development work where it makes more sense for a server to be centrally available on the internet, rather than sitting on someone’s laptop or on an internal server. Once potential complication here is licensing – for products such as the Oracle Database, there are formal schemes such as Amazon RDS where either the license is included, or there are set metrics or arrangements to either bring your own license, or buy them based on virtual CPUs. If you’re prepared to take care of the licensing, and all of the security and maintenance aspects, this is an interesting approach.
In practice though, any setup like this is going to be what’s termed a “hybrid” cloud deployment; at least the administration side of OBIEE (the BI Administration Tool) is still going to be “on-premise”, as there’s no way you can deploy that “thin-client”, unless you create a Windows VM in the cloud too, and run the BI Administration tool from there. Moreover, your data sources are more than likely to still be on-premise, with just a few at this point hosted in the cloud, so most likely your OBIEE-in-the-cloud architecture will look like this:
There’s other things to think about too; how do you do your ETL when your target data warehouse might be in the cloud, or your data sources might be cloud based; how do you connect to your LDAP server, and so forth – I’ll cover these in the next postings in the series. But for now, this is conceptually simple, and its main benefit is avoiding the need to host your own servers, buy your own hardware, and so forth.
So what if you want to use OBIEE, but you don’t want the hassle even of setting up your own servers in a public cloud? This is where Oracle’s new “BI in the Cloud” service comes in – the idea here is that you swipe your credit card, fill in your details, self-provision your system, and then start loading data into it.
Whilst the underlying technology is core OBIEE, it’s designed for departmental, power user-type scenarios where the scope of data is limited, and the focus really is on ease-of-use, self-service and “software-as-a-service”. Looking at Oracle’s slides from Openworld, you can see the focus is quite different to on-premise OBIEE – the competition here for Oracle is the various internet startups, and products like Tableau, that make it easy to get started, provide consumer-class user interfaces, and focus on the single user rather than enterprise-type requirements.
But this focus on simplicity means a limitation in functionality as well. At the start, at least, Oracle’s BI in the Cloud will only offer Answers and Dashboards; no KPIs, BI Publisher, Maps or Delivers. Moreover, at least in its first iteration, it’ll only support Oracle’s own database schema-as-a-service as the single, sole datasource, so no Hybrid cloud/on-premise federated queries, no Essbase, and no ETL tools – hence the focus on single-user, departmental use-cases.
What you do get with this setup though is a cloud-native, fully-managed service where patching is taken care of for you, all the new feature appear first, and the administration element has been re-thought to be more appropriate for deployment to the cloud – in particular, a new version of the BI Administration tool that’s cloud-native, focused on simple use-cases, and doesn’t require desktop client installs or VPN connections through to the OBIEE server. The diagram bellow shows the architecture for this option, and you can see it’s all pretty self-contained; great for simplicity, but maybe a bit limiting at least in the initial iteration.
The third major variant that I can see around OBIEE in the cloud, is where partners (such as Rittman Mead) offer shared access to their cloud OBIEE installations, either as part of a development package or through some sort of multi-tenant reporting product. For example, we’re about to launch a service we’re calling “ExtremeBI in the Cloud”, where we combine our “ExtremeBI” agile development method with cloud-based development server hosting, with the cloud element there to make it easier for clients to start on a project *today*. Endeca Information Discovery is another product that could benefit from this approach, as the infrastructure behind an Endeca install can be fairly substantial, but users are typically more-focused on the data modelling and report-building element. In our case, the back-end cloud hosting will typically be done on a service like Amazon EC2 or VMWare’s Hybrid Cloud product, but the focus is more on the development piece – cloud is there to provide business agility.
The other third-party approach to this centres around multi-tenancy; typically, a partner or company will provide a reporting or analytics element to a wider product set, and use OBIEE as its embedded dashboarding tool as part of this. In the past, setting up multi-tenancy for OBIEE was fairly difficult, to the point where most companies set up separate installs for each end-customer, but 184.108.40.206 brought a number of multi-tenant features that presumably were put there to support Oracle’s own cloud product. The product docs describe the new multi-tenant features well, but the basic premise is that a single RPD and catalog are still set up, but OBIEE can then run in a “multi-tenant” mode where GUIDs demarcate each tenant, a new BIGlobalAdministrator role is created that assumes what were the old BIAdministrator role’s privileges, with other administrator roles then set up with limited privileges and no access to Fusion Middleware Control, for example.
What this does give you though is the framework to set up tenant groupings of users, separate areas of the catalog for each tenant, additional identity store attributes to hold tenant details, and a security framework that enables more limited forms of administrator account more suited to tenant-type situations. Again, not all OBIEE features are available when running in multi-tenant mode, and presumably we’ll see this feature evolve over time as more requirements come in from Oracle’s own cloud BI product, but it’s a feature you can use now if you’re looking to set up a similar type of environment.
So that’s the three basic options if you want to run OBIEE in the cloud; host it in a public cloud service like Amazon EC2 but then just run it like regular OBIEE; go for Oracle’s upcoming BI-in-the-cloud service, if the use-case suits you and you can live with the limitations, or consider one of the services from partners such as ourselves where we bundle cloud hosting of OBIEE up with a larger service offering like our “ExtremeBI in the Cloud” offer. But of course OBIEE is only one part of the overall platform – where do you store the data, and how do you get the data from wherever it normally is up into the cloud – in other words, how do we do ETL in the cloud? Check back tomorrow for the follow-up post to this one: Part 2 – Data Warehouse and ETL.
Bit of a corner case this week but I wanted to park this as much for my reference as yours. Need to be able to test a pure XSL template against some sample data? Thats an XSL template that is going to generate HTML, Text or HTML. The Template Viewer app in the BI Publisher Desktop group does not offer that as an option. It does offer XSL-FO proccesing thou.
A few minutes digging around in the java libraries and I came up with a command line solution that is easy to set up and use.
1. Place your sample XML data and the XSL template in a directory
2. Open the lib directory where the TemplateViewer is installed. On my machine that is d:\Oracle\BIPDesktop\TemplateViewer\lib
3. Copy the xmlparserv2.jar file into the directory created in step 1.
4. Use the following command in a DOS/Shell window to process the XSL template against the XML data.
java -cp ./xmlparserv2.jar oracle.xml.parser.v2.oraxsl fileX.xml fileY.xsl > fileX.xls
The file generated will depend on your XSL. For an Excel output, you would instruct the process to generate fileX.xls in the same folder. You can then test the file with Excel, a browser or a text editor. Now you can test on the desktop until you get it right without the overhead of having to load it to the server each time.
To be completely clear, this approach is for pure XSL templates that are designed to generate text, html or xml. Its not for the XSLFO templates that might be used at runtime to generate PDF, PPT, etc. For those you should use the Template Viewer application, it supports the XSLFO templates but not the pure XSL templates.
If your template still falls into the pure XSL template category. This will be down to you using some BIP functionality in the templates. To get it to work you'll need to add in the Publisher libraries that contain the function e.g. xdo-core.jar, i18nAPI_v3.jar, etc to the classpath argument (-cp.)
So a new command including the required libraries might look like:
java -cp ./xmlparserv2.jar;./xdo-core.jar;./i18nAPI_v3.jar oracle.xml.parser.v2.oraxsl fileX.xml fileY.xsl > fileX.xls
You will need to either move the libraries to the local directory, my assumption above or include the full path to them. More info here on setting the -cp attribute.
There are two steps you have to take to "save big". First, become a full member of ODTUG for $99 and enjoy all of the benefits, including access to a members-only presentations library, throughout the year. Next, register for Kscope14 and you are eligible for the members-only price of $1500 for a savings of $150. While you are registering, simply use the code AOLAP to get an additional $100 discount!
My company, Applied OLAP, is one of top-tier Platinum Sponsors of Kscope14 and I will be there. I hope to see you at the conference and, if you were able to save some money by using our exclusive AOLAP code, be sure to stop our booth, say hello, and learn how the Dodeca Spreadsheet Management System can help your company reduce spreadsheet risk, increase spreadsheet accuracy, and reduce costs.
We all love a good commandline utility. It gives us that warm feeling of control and puts hairs on our chests. Either that, or it means we can script the heck out of a system, automate many processes, and concentrate on more important matters.
However, some of the OBIEE commandline utilities can’t be used in Production environments at many sites because they need the credentials for OBIEE stored in a plain-text file. Passwords in plain-text are bad, mmmm’kay?
Two of the utilities in particular that it is a shame that can’t be scripted up and deployed in Production because of this limitation are the Presentation Services Catalog Manager, and the Presentation Services Replication Agent. Both these perform very useful purposes, and what I want to share here is a way of invoking them more securely.Caveat
IANAC : I Am Not A Cryptographer! Nor am I a trained security professional. Always consult a security expert for the final word on security matters.
The rationale behind developing the method described below is that some some sites will have a “No Plaintext Passwords” policy which flatout prevents the use of these OBIEE utilities. However, at the same sites the use of SSH keys to enable one server to connect to another automatically is permitted. On that basis, the key-based encryption for the OBIEE credentials may therefore be considered an acceptable risk. As per Culp’s 9th law of security administration, it’s all about striking the balance between enabling functionality and mitigating risk.
The method described below I believe is a bit more secure that plaintext credentials, but it is not totally secure. It uses key based encryption to secure the previously-plaintext credentials that the OBI utility requires. This is one step better than plaintext alone, but is still not perfect. If an attacker gained access to the machine they could still decrypt the file, because the key is held on the machine without a passphrase to protect it. The risk here is that we are using security by obscurity (because the OBIEE credentials are in an encrypted file it appears secure, even though the key is held locally), and like the emperor’s new clothes, if someone takes the time to look closely enough there is still a security vulnerability.
My final point on this caveat is that you should always bear in mind that if an attacker gains access to your OBIEE machine then they will almost certainly be able to do whatever they want regardless, including decrypting the weblogic superuser credentials or reseting it to a password of their own choosing.Overview
Two new shiny tools I’ve acquired recently and am going to put to use here are GnuPG (
mkfifo. GPG provides key-based encryption and decryption and is available by default on common Linux distributions including Oracle Linux.
mkfifo is also commonly available and is a utility that creates named pipes, enabling two unreleated processes to communicate. For a detailed description and advanced usage of named pipes, see here
This is a one-time set up activity. We create a key in
gpg, and then encrypt the plain text credentials file using it.
The first step is to create a gpg key, using
gpg --gen-key. You need to specify a “Real name” to associate with the key, I just used “obiee”. Make sure you don’t specify a passphrase (otherwise you’ll be back in the position of passing plain text credentials around when you use this script).
$ gpg --gen-key gpg (GnuPG) 1.4.5; Copyright (C) 2006 Free Software Foundation, Inc. This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. See the file COPYING for details. Please select what kind of key you want: (1) DSA and Elgamal (default) (2) DSA (sign only) (5) RSA (sign only) Your selection? DSA keypair will have 1024 bits. ELG-E keys may be between 1024 and 4096 bits long. What keysize do you want? (2048) 2048 [...] Real name: obiee Email address: Comment: [...] You don't want a passphrase - this is probably a *bad* idea! I will do it anyway. You can change your passphrase at any time, using this program with the option "--edit-key". [...] gpg: key 94DF4ABA marked as ultimately trusted public and secret key created and signed.
Once this is done, you can encrypt the credentials file you need for the utility. For example, the Catalog Manager credentials file has the format:
To encrypt it use
gpg --recipient obiee --output saw_creds.gpg --encrypt saw_creds.txt
Now remove the plaintext password file
Using the secure credentials file
Once we have our encrypted credentials file we need a way of using it with the utility it is intended for. The main thing we’re doing is making sure we don’t expose the plaintext contents. We do this using the named pipes method:
In this example I am going to show how to use the secure credentials file with
runcat.sh, the Catalog Manager utility, to purge the Presentation Services cache. However it should work absolutely fine with any utility that expects credentials passed to it in a file (or stdin).
There is a three step process:
Create a named pipe with
mkfifo. This appears on a disk listing with the
pbit to indicate that it is a pipe. Access to it can be controlled by the same
chmodprocess as a regular file. With a pipe, a process can request to consume from it, and anything that is passed to it by another process will go straight to the consuming process, in a FIFO fashion. What we’re doing through the use of a named pipe is ensuring that the plain text credentials are not visible in a plain text file on the disk.
Invoke the OBIEE utility that we want to run. Where it expects the plaintext credentials file, we pass it the named pipe. The important bit here is that the utility will wait until it receives the input from the named pipe – so we call the utility with an ampersand so that it returns control whilst still running in the background
gpgto decrypt the credentials file, and pass the decrypted contents to the named pipe. The OBIEE utility is already running and listening on the named pipe, so will receive (and remove from the pipe) the credentials as soon as they are passed from
The script that will do this is as follows:
# Change folder to where we're invoking the utility from cd $FMW_HOME/instances/instance1/bifoundation/OracleBIPresentationServicesComponent/coreapplication_obips1/catalogmanager # Create a named pipe mkfifo cred_pipe # Let's make sure only we can access it chmod 600 cred_pipe # Invoke Catalog Manager. Because we're using a named pipe, it's actually going to sit and wait until it gets input on the pipe, so we need to put the ampersand in there so that it returns control to the script ./runcat.sh -cmd clearQueryCache -online http://localhost:9704/analytics/saw.dll -credentials cred_pipe & # Decrypt the credentials and send them to the named pipe gpg --quiet --recipient obiee --decrypt saw_creds.gpg > cred_pipe # Remove the named pipe rm cred_pipe
Depending on the utility that you are invoking, you may need to customise this script. For example, if the utility reads the credentials file multiple times then using the named pipes method it will fail after the first read. Your option would be to read the credentials into the pipe multiple times (possibly a bit hacky), or land the plaintext credentials to disk and delete them after the utility complete (could be less secure if the delete doesn’t get invoked)Using a secure credentials file for command line arguments
Whilst the sticking point that triggered this article was around utilities requiring whole files with credentials in, it is also common to see command line utilities that want a password passed as an argument to them. For example, nqcmd :
nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s myscript.lsql
Let’s assume we’ve created an encrypted file containing “Password01” (using the
gpg --encrypt method shown above) and saved it as password.gpg.
To invoke the utility and pass across the decrypted password, there’s no need for named pipes. Instead we can just use a normal (“unnamed”) pipe to send the output straight from gpg to the target utility (nqcmd in this example), via xargs:
gpg --batch --quiet --recipient obiee --decrypt ~/password.gpg | xargs -I GPGOUT nqcmd -d AnalyticsWeb -u weblogic -p GPGOUT -s input.lsql
xargs has a
--interactive option that makes it a lot easier when developing piped commands such as the above
Because there is no passphrase on the gpg key, a user who obtained access to the server would still be able to decrypt the credentials file. In many ways this is the same situation that would arise if a server was configured to use ssh-key authentication to carry out tasks or transfer files on another server.Uses
Here are some of the utilities that the above now enables us to run more securely:
- nqcmd is a mainstay of my OBIEE toolkit, being useful for performance testing, regression testing, aggregate building, and more. Using the method above, it’s now easy to avoid storing a plaintext password in a script that calls it.
- Keeping the Presentation Catalog in sync on an OBIEE warm standby server, using Presentation Services Replication
- Purging the Presentation Services Cache from the command line (with Catalog Manager, per the above example)
- SampleApp comes with four excellent utilities that Oracle have provided, however all but one by default requires plaintext credentials. If you’ve not looked at the utilities closely yet, you should! You can see them in action in SampleApp itself, or get an idea of what they do looking at the SampleApp User Guide pages 14–17 or watching the YouTube video.