BI & Warehousing
My new article at Oracle Magazine is on Oracle BI Mobile App Designer, the new HTML5-based mobile BI tool for OBIEE 220.127.116.11.1 built on Oracle BI Publisher technology. In the article, I walk the reader through creating a simple Mobile App Designer App, then publish it to the Apps Library for use with iOS, Android, Blackberry and other HTML5-compatible mobile devices.
You can also read my “first look” post on BI Mobile App Designer from our blog when the feature first came out, and we’re also running a promotion where we’ll implement your first Mobile App Designer app within a week, including patching up your OBIEE 18.104.22.168 installation to the required 22.214.171.124.1 version. More details on the offer, and on BI Mobile App Designer in-general, are on this QuickStart Mobile Analytic Apps for OBIEE 11g with Rittman Mead data sheet.
This is a great overview of the Financial Statement Generator (FSG) engine from GL in EBS and how Publisher fits into the picture.Thanks to Helle Hellings on the Financials PM team.
In a few week’s time I’m presenting at the BIWA Summit 2014 on running OBIEE in the Cloud, with the idea being that I’ll go through the various options for deploying OBIEE in public clouds, including Amazon EC2 and Oracle’s own cloud offering. But the more I thought through the various architecture and product options, the more I began to think that ETL – getting data into the BI system, from wherever it is now – is the key enabling technology. The vast majority of BI systems use data that’s refreshed on a daily, weekly or even real-time basis, and on premise we tend to use tools such as Oracle Data Integrator, Oracle Warehouse Builder, Informatica PowerCenter and the like to move data from source to target, transforming and cleansing it en-route.
When you deploy a BI platform such as OBIEE into the cloud, you’ve got a couple of options around where you hold its data, and how you do the ETL:
- You can keep the data warehouse database on-premise, along with the ETL tool, and just run the BI part in the cloud
- You can move the database into the cloud as well (using a service such as Amazon RDS), but keep the ETL tool on-premise
- You can try and move the ETL into the cloud too – either hosting a tool like ODI in its own cloud VM, or make use of ETL-as-a-service such as that provided by Informatica Cloud.
Things get more interesting when part of your IT infrastructure sits on-premise, and some sits in the cloud. It gets even more interesting when your data sources are also software-as-a-service (SaaS), for example Salesforce.com and Workday, where data access is via APIs rather than direct SQL*Net connections.
So where do we start with ETL-in-the-cloud? One place to start is Oracle’s own slide-deck on OBIEE in the cloud, where they set out what they see as the key differences between traditional on-premise software and software delivered via SaaS:
In this SaaS world, the expectation is that:
- The infrastructure, software etc is already in placed, is patched regularly for you in the background, and new features emerge regularly
- You’re on a shared platform, but you won’t be aware of other tenants on a day-to-day basis
- The “innards” and workings are hidden from you – you just consume a service
- It’s cloud-native – from connectivity to cloud apps through to common social sign-ons, commenting and collaboration
- Everything is thin-client, web-based
- Everything is self-service, self-provisioning, right from creation of your user account
- Sign-up is on-demand, paid by the hour/day etc, and with no long setup process
What this means for data integration (and ODI) then, in my opinion, is:
- It must be available as a service – log in, create data source connections, define mappings, run code
- The internals typically would be hidden – so no need to know about agents, GoldenGate vs. ODI, choices of knowledge modules etc
- Web-based data flow modellers as well as admin tools
- Connectors through to Salesforce.com and other cloud apps
- Connectors through to Hadoop, Hive, JSON compatibility etc
- Instant sign-up and charging by time/use
With bonus points for:
- Low-cost or free option to get you started
- A consistent story that goes through from messaging, application integration through to large-scale data movement
- Easy clustering, scale-out, deployment on services such as Amazon EC2
- Options to deploy on-premise or cloud, and run integration tasks on-premise or cloud
Looking at it architecturally, ETL-in-the-cloud would sit between sources and targets, both on-premise and in the cloud, providing the vital data movement capability between source systems and the BI Platform itself – most probably to its own database-in-the-cloud, running the data warehouse.
So what are the options then, if you want to use cloud-based ETL to load a cloud-based data warehouse and BI system? To my mind, there’s four main options:
- If you’re using a cloud-based database service such as Amazon’s Redshift, or Oracle’s own public cloud “schema-as-a-service” database, you can use the ETL tools provided with the service
- You can try and put ODI in the cloud, maybe using an Amazon EC2 instance running WebLogic and a JEE agent, with another instance providing the metadata repository database
- or as an alternative, do the same for one of the open-source ETL tools
- You can use one of the new “cloud-native” ETL-as-a-service products, such as Informatica Cloud or SnapLogic
- Or – you can try and abstract away ETL altogether – more on this later on.
The first option really applies if (a) you’re using a service such as Amazon Web Services’ EC2, (b) your data also most probably sits in AWS cloud storage, (c) you want to move data between your source application or export into the main data warehouse database, and (d) you don’t really need to integrate data from different sources. Amazon AWS provides a number of options for loading data into EC2 and the various database and analytic services they provide, including Amazon Data Pipeline (shown in the screenshot below), the most “ETL-like” of their loading services, along with a sneakernet service, and the new Amazon Kinesis service for real-time streaming data.
Oracle’s Cloud Database service at the moment restricts you to a single schema, and more importantly there’s no SQL*Net access, so uploading data is either through SQL*Developer (which has a special, custom Oracle Cloud connector), or through utilities provided via ApEX.
Clearly this sort of web-based data loading isn’t designed for data warehouse scenarios, and in this initial iteration Oracle’s cloud database is probably designed to support cloud applications running on Oracle’s cloud Java service, with OBIEE-in-the-cloud mainly designed for reporting against this data, and personal “sandbox” scenarios. What’s not immediately obvious when you use these cloud ETL tools is that each one has its own admin console, its own API, it’s own scheduler, it’s own metadata – which is usually the point at which you decide you need an enterprise ETL tool.
So what about the idea of running Oracle Data Integrator in the cloud? There’s a few options here, that I can think of:
- Creating a cloud server instance that runs ODI, along with all of its metadata repositories, WebLogic instances and agents in a single location
- Run the repository database either on-premise or on another cloud server (or even Amazon’s RDS service), with agents running either in their own cloud servers, or on the cloud servers holding the target database
- Build ODI into the target platform, as Oracle have done with BI Apps 126.96.36.199.1, along with management tools to hide the complexity and workings of ODI
The first option sounds the “neatest” in terms of a wholly-cloud deployment, as all the parts of ODI are held in one place, and you can think of it as an applicance or service. But ODI’s Extract-Transform-Load approach ends-up complicating things in a cloud-only deployment; the target platform (for example one of the Postgres-compatible cloud databases) might not support all of the target-layer transformations you want, and the integration part can’t really sit in the ODI cloud instance, unless you run it hub-and-spoke style and either use a local database for transformations, or use the in-memory feature within the agents. In fact where ODI makes most sense is in a hybrid on-premise/cloud setup, where most of your data resides on-premise, as does your ETL process, with the cloud being gradually migrated to and in the meantime used alongside on-premise applications.
Oracle’s white paper on cloud data integration majors on this type of scenario, with Oracle GoldenGate also used to replicate data between the two environments. At this year’s Oracle Openworld the BI Apps team also announced cloud adapters for the BI Applications, initially used to extract data from Fusion Apps in the Cloud back to the on-premise BI Apps data warehouse, with other Oracle cloud applications data sources following-on.
Where things get interesting with Oracle’s approach is when it’s non-Oracle cloud applications that we’re looking to integrate with. “Cloud Integration – A Comprehensive Solution”, another Oracle white paper, describes techniques to integrate with Salesforce.com, Workday and the Fusion Apps, but all of this takes place through Oracle SOA Suite and Oracle Jdeveloper, products that are way too technical for the average ETL developer (or customer implementor).
In fact there’s two obvious things that are missing here, that are present in the Cloud Database and Cloud OBIEE offerings that Oracle recently launched:
- It’s not a “service” – you can’t just create an account, design some mappings, load a target – the assumption is that each system is single tenant, perpetual license, self-install
- It’s not consumer-level in terms of simplicity – there’s bits of ODI, bits of GoldenGate, bits of SOA, bits of Jdeveloper – all good bits, but not a single integration-platform-as-a-service
But it’s probably fine if you’re running a hybrid on-premise/cloud strategy, and as I’ll talk about later on, the ELT approach does still have some distinct advantages over cloud-native ETL tools.
What about the more recent, “cloud-native” ETL tools such as SnapLogic, and Informatica Cloud? Informatica Cloud is probably the easiest product to understand if, like me, you’re from an ODI background. What Informatica have done here in terms of architecture is move the metadata and application-layer parts of Informatica PowerCenter into the cloud, add a bunch of cloud application adapters (including a third-party marketplace for them), but still do the actual data integration on-premise, using a cut-down version of the main PowerCenter engine.
Some of the nicer features in this setup are its multi-tenant and “integration-as-a-service” nature, the way it deals with firewall issues (do the integration on-premise), and the interoperability with traditional Informatica PowerCenter, where you can publish custom maplets from on-premise Informatica and push them into the cloud version. If Oracle came out with an ODI-in-the-cloud service, I think it’d look a lot like this.
To my mind though, the most impressive of the cloud integration vendors is SnapLogic. Their SnapLogic Integration Cloud product looks like it was designed first and foremost to run “in the cloud”, it’s available as a service, and the data integration designer focuses on data paths and application integration rather than the low-level database centric approach traditional ETL tools take,
What’s particularly impressive to me is the way that they’ve taken concepts used in tools like ODI – agents that perform the data integration tasks, running as standalone servers – and built on it to create the concept of “snaplexes”, collections of JVMs that can sit in-the-cloud, or on-premise, elastically scale-up to handle larger workloads, and use Amazon’s EC2 and S3 compute and storage clouds to perform any data transformations en-route. Data being transformed in SnapLogic streams through the system with low latency and using web protocols, and the whole thing is just neatly designed to run native on the cloud.
Where tools like SnapLogic do fall short though is on the “last mile” into the data warehouse, where tools like ODI come with lots of templates, database integration features and so forth to handle slowly-changing dimensions, dimension lookups, analytic functions and the like, but cloud-native tools really focus on basic transformations and getting the data from source to target. Where this of course gets interesting is when SnapLogic is used to load databases such as Amazon Redshift, for example, in combination with Tableau, as some of the ETL tool features we’re used to using with tools such as ODI and Informatica just aren’t there yet. What this means in practice is that if you’re looking to move OBIEE into the cloud, along with an Oracle database, you’ll probably still want to use a tool like ODI to do your data load, as these tools are just so mature when it comes to loading relational data warehouses.
So finally, then, onto what I probably think will be the future of ETL – “invisible ETL”, or ETL so abstracted away that you don’t even think of it as a separate product in itself. If you think about it, ETL is a necessary evil when it comes to BI – customers just want their data loaded, they don’t want to worry about how it’s transported, which data loading tool you use and so on. Oracle have had a number of initiatives over the past few years to automate the creation of ETL code from logical table source mappings in the OBIEE RPD, and one of the major drivers in the BI Apps product roadmap is to reduce the amount of manual work propagating application data model changes through the various layers of ETL and BI tool metadata.
A really nice example though of taking this approach of hiding the complexity of ETL can be found with a company called Good Data, who sell a BI in the Cloud product that comes with data storage (via Vertica) and ETL all wrapped-up in a consumer-like service that’s focused on analytics and visualisations but supports data loading from a number of database and cloud application sources. The screenshot below from the GoodData CloudConnect LDM Modeler Guide shows GoodData’s logical data model development environment, with this part of the product handling the facts, dimensions, attributes and other user-facing data elements, and a more traditional ETL tool and server (based on Clover ETL, no less) doing the actual data heavy-lifting.
GoodData splits its data models into these logical and physical elements, which of course is exactly what ODI does – and what OBIEE does too. In fact, the more you look at GoodData, the more you think that all of the elements are already in place at Oracle to create a very similar product, with the added benefit of access to tools such as GoldenGate, EDQ and SOA Suite. Even SnapLogic’s Snaplex concept is conceptually very similar to ODI’s agents, but what they’ve done, and what GoodData, and Informatica and others have done, is wrap the products up into a service, made it all consistent (at least on paper), architected it for cloud-only, and hybrid on-premise/cloud environments, and built-out all of the third-party adapters. It’ll be interesting to see what Oracle’s long-term response to this will be – honestly I’ve no special insight into this part of the product roadmap so I’m just as interested as anyone as to how it will turn out.
So – is any approach better, is Oracle’s approach the worst, are vendors like SnapLogic the future, and what should we use if deploying OBIEE in the cloud? Well to my mind there’s no black-and-white answer, and the choice comes down to a number of factors including:
- Where are your sources and targets; the more that still reside on-premise, the more a traditional on-premise tool such as ODI makes sense
- To what extent are you interacting with non-Oracle targets and sources; the more non-Oracle ones you’re using, the more a non-Oracle tool will probably make sense
- How complex will your end data transformations be – if you’re expecting to do lots of SCD2 transformations, analytic queries, clever DW-stuff, a tool like ODI will make sense
- How much upfront investment do you want to make? Services like Informatica Cloud are far easier to get provisioned on, and involve far less up-front license costs, than an on-premise install
- Or are you just loading data from a single source into a managed cloud database – if so, one of the vendor-supplied utilities will probably do the job
Anyway – I’d be interested in others’ opinions, and whether I’ve missed anything out. But for now – that was some thoughts from me on running ETL “in the cloud”.
I will be in Seattle on Thursday, January 9th for the Meet the Oracle ACE Directors Panel. It is at the Sheraton Seattle from 4 - 6 pm and will feature several other ACE Directors including Martin D'Souza, Kellyn Pot'Vin, Tim Gorman, and my longtime friend and collaborator, Cameron Lackpour.
Come see and the panel and stay for the Happy Hour; the beer will be on me!
As well as offering consulting and training around Oracle’s BI, DW and EPM products, Rittman Mead also provide managed services and support around these products for customers around the world. We’re just about to move over to a new service desk system based on Atlassian Jira, and as part of our final “soak test” we’re going to open up the service publicly for the next 24 hours, taking on public OBIEE support requests on a “best endeavour” basis.
From mid-day today (Wednesday, 18th December 2014, UK time) through to mid-day tomorrow, you can register here and then raise a support ticket, and we’ll endeavour to answer your query over the next few days. As you’d expect, there are a few conditions; you must register with your work email address, and not one from Gmail, Yahoo, Outlook.com etc, and whilst we’ll try and answer your query, there are no promises and we may have to limit replies and accepted tickets. Full details of the offer, and for our global managed services, are on our support page here including terms and conditions, how to register and how to raise a ticket.
UPDATE: Thanks for your interest in our support system. We have now closed registration…
Keep checking back here for future support offerings! Thanks and have a great Christmas and New Year.
It’s that time of year again when we start planning out next year’s BI Forum, which like this year’s event will be running in May 2014 in Brighton and Atlanta. This will be our sixth annual event, and as with previous year’s the most important part is the content – and as such I’m pleased to announce that the Call for Papers for BI Forum 2014 is now open, running through to January 31st 2014.
If you’ve not been to one of our BI Forum events in past years, the Rittman Mead BI Forum is all about Oracle Business Intelligence, and the technologies and techniques that surround it – data warehousing, data analysis, big data, unstructured data analysis, OLAP analysis and this year – in-memory analytics. Each year we select around ten speakers for Brighton, and ten for Atlanta, along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Stewart Bryson.
Last year we had sessions on OBIEE internals and new features, OBIEE visualisations and data analysis, OBIEE and “big data”, along with sessions on Endeca, Exalytics, Exadata, Essbase and anything else that starts with an “E”. This year we’re continuing the theme, but are particularly looking for sessions on what’s hot this year and next – integration with unstructured and big data sources, use of engineered systems and in-memory analysis, advanced and innovative data visualisations, cloud deployment and analytics, and anything that “pushes the envelope” around Oracle BI, data warehousing and analytics.
The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues. We’re also looking for presenters for ten-minute “TED”-style sessions, and any ideas you might have for keynote speakers, send them directly to me at firstname.lastname@example.org. Other than that – have a think about abstract ideas now, and make sure you get them in by January 31st 2014.
But first, a mea culpa:
In 2008, I Was An IdiotBack in early 2008, I wrote a blog entry comparing Collaborate, Kaleidoscope, and OpenWorld. In this entry, I said that Collaborate was the obvious successor to the Hyperion Solutions conference and I wasn't terribly nice to Kaleidoscope. Here's me answering which of the three conferences I think the Hyperion community should attend (I dare you to hold in the laughter):
Now which one would I attend if I could only go to one?Collaborate. Without reservation. If I'm going to a conference, it's primarily to learn. As such, content is key.I actually got asked a very similar question on Network 54's Essbase discussion board just yesterday (apparently, it's a popular question these days). To parrot what I said there, OpenWorld was very, very marketing-oriented. 80% of the fewer than 100 presentations in the Hyperion track were delivered by Oracle (in some cases, with clients/partners as co-speakers). COLLABORATE is supposed to have 100-150 presentations with 100+ of those delivered by clients and partners.In the interest of full-disclosure, my company, interRel, is paying to be a 4-star partner of COLLABORATE. Why? Because we're hoping that COLLABORATE becomes the successor to the Solutions conference. Solutions was a great opportunity to learn (partying was always secondary) and I refuse to believe it's dead with nothing to take it's mantle. We're investing a great deal of money with the assumption that something has to take the place of Hyperion Solutions conference, and it certainly isn't OpenWorld.Is OpenWorld completely bad? Absolutely not. In addition to the great bribes, it's a much larger conference than COLLABORATE or ODTUG's Kaleidoscope, so if your thing is networking, by all means, go to OpenWorld. OpenWorld is the best place to get the official Oracle party line on upcoming releases and what not. OpenWorld is also the place to hear better keynotes (well, at least by More Famous People like Larry Ellison, himself). OpenWorld has better parties too. OpenWorld is also in San Francisco which is just a generally cooler town. In short, OpenWorld was very well organized, but since it's being put on by Oracle, it's about them getting out their message to their existing and prospective client base.So why aren't I recommending Kaleidoscope (since I haven't been to that either)? Size, mostly. Their entire conference will have around 100 presentations, so their Hyperion track will most likely be fewer than 10 presentations. I've been to regional Hyperion User Group meetings that have more than that (well, the one interRel hosted in August of 2007 had 9, but close enough). While Kaleidoscope may one day grow their Hyperion track, it's going to be a long time until they equal the 100-150 presentations that COLLABORATE is supposed to have on Hyperion alone.If you're only going to one Hyperion-oriented conference this year, register for COLLABORATE. If you've got money in the budget for two conferences, also go to OpenWorld. If you're a developer that finds both COLLABORATE and OpenWorld to be too much high-level fluff, then go to Kaleidoscope.
In 2008, Mike Riley Luckily Wasn't An Idiot
It’s true that ODTUG is a smaller conference, however that is by choice. At every ODTUG conference, the majority of the content is by a user, not by Oracle or even another vendor. And even though Collaborate might seem like the better buy because of its scale, for developers and true technologists ODTUG offers a much more targeted and efficient conference experience. Relevant tracks in your experience level are typically consecutive, rather than side-by-side so you don’t miss sessions you want to attend. The networking is also one of the most valuable pieces. The people that come to ODTUG are the doers, so everyone you meet will be a valuable contact in the future.
It’s true, COLLABORATE will have many presentations with a number of those delivered by clients and partners, but what difference does that make? You can’t attend all of them. ODTUG’s Kaleidoscope will have 17 Hyperion sessions that are all technical.
In the interest of full disclosure, I have been a member of ODTUG for eight years and this is my second year as a board member. What attracted me to ODTUG from the start was the quality of the content delivered, and the networking opportunities. This remains true today.
I won’t censor or disparage any of the other conferences. We are lucky to have so many choices available to us. My personal choice and my highest recommendation goes to Kaleidoscope for all the reasons I mentioned above (and I have attended all three of the above mentioned conferences).
One last thing; New Orleans holds its own against San Francisco or Denver. All of the cities are wonderful, but when it comes to food, fun, and great entertainment there’s nothing like the Big Easy. Mike was only in his second year as a board member of ODTUG, but he was willing to put himself out there, so I wrote him an e-mail back. In that e-mail, dated February 10, 2008, I said that for Kaleidoscope to become a conference that Hyperion users would love, it would require a few key components: keynote(s) by headliner(s), panels of experts, high-quality presentations, a narrow focus that wasn't all things to all people, and a critical mass of attendees.
Mike Helped Us, Let's Help Him
Note that the file was saved in the Excel 2007 and higher format (with an .xlsx file extension). Did you know that the xlsx format is really just a specialized zip file? Seriously. It is a zip file containing various files that are primarily in xml format. I saved the workbook, added the .zip extension to the filename, and opened it in WinRar. Here is what I found.
I opened the xl folder to find a series of files and folders.
Next, I opened the worksheets folder to see what was in there. Of course, it is a directory of xml files containing the contents of the worksheets.
My Essbase retrieves were on the sheet named Sheet1, so let’s take a look at what is in the sheet1.xml file. The xml is quite large, so I can’t show all of it here, but needless to say, there is a bunch of information in the file. The cell contents are only one of the things in the file. Here is an excerpt that shows the contents of row 5 of the spreadsheet.
This is interesting as it shows the numbers but not the member name. What is the deal with that? I noticed there is an attribute, ‘t’, on that node. I am guessing that the attribute t=”s” means the cell type is a string. I had noticed that in one of the zip file screenshots, there was a file named sharedStrings.xml. Hmm... I took a look at that file and guess what I found?
That’s right! The 5th item, assuming you start counting at zero like all good programmers do, is Profit. That number corresponds perfectly with the value specified in the xml for cell B5, which was five (circled in blue in the xml file above). OK, so when are we going to get to Smart View stuff? The answer is pretty quick. I continued looking at sheet1.xml and found these nodes near the bottom.
Hmm, custom properties that contain the name Hyperion? Bingo! There were a number of custom property files in the xml file. Let’s focus on those.
Custom property #1 is identified by the name CellIDs. The corresponding file, customProperty1.bin, contained only the empty xml node <root />. Apparently there aren’t any CellIDs in this workbook.
Custom property #2 is identified by the name ConnName. The file customProperty2.bin contains the string ‘Sample Basic’ which is the name of my connection.
Custom property #3 is named ConnPOV but it appears to contain the connection details in xml format. Here is an excerpt of the xml.
Custom property #4 is named HyperionPOVXML and the corresponding file contains xml which lines up with the page fields I have in my worksheet.
What is interesting about the POV xml is that I have two different retrieves that both have working POV selectors which are both implemented as list-type data validations in Excel. I don’t know what happens internally if I save different values for the POV.
Custom property #5 is labeled HyperionXML. It appears to contain the information about the Essbase retrieval, but it doesn't appear to be the actual retrieval xml because it doesn't contain the numeric data. My guess is that this xml is used to track what is on the worksheet from a Hyperion standpoint.
There is a lot of information in this simple xml stream, but the most interesting information is contained in the slice element. Below is a close-up of contents in the slice.
The slice covers 6 rows and 7 columns for a total of 42 cells. It is interesting that the Smart View team chose to serialize their XML in this manner for a couple of reasons. First, the pipe delimited format means that every cell must be represented regardless of whether it has a value or not. This really isn’t too much of a problem unless your spreadsheet range is pretty sparse. The second thing about this format is that the xml itself is easy and fast to parse, but the resulting strings need to be parsed again to be usable. For example, the vals node will get split into an array containing 42 elements. The code must then loop the 42 elements and process them individually. The other nodes, such as the status, contain other pieces of information about the grid. The status codes appear to be cell attributes returned by Essbase; these attributes are used to apply formatting to cells in the same way the Excel add-in UseStyles would apply formatting. There are a couple of things to take away:
- In addition to the data on the worksheet itself, there is potentially *a lot* of information stored under the covers in a Smart View file.
- String parsing is a computation-intensive operation and can hurt performance. Multiply that workload by 8 because, depending on the operation and perhaps the provider, all 8 xml nodes above may need to be parsed.
In addition, the number of rows and columns shown in the slice may be important when you are looking at performance. Smart View must look at the worksheet to determine the size of the range to read in order to send it to Essbase. In the case of a non-multi-grid retrieve, the range may not be known and, as a result, the grid may be sized based on the UsedRange of the worksheet. In our work with Dodeca, we have found that workbooks converted from the older xls format to the newer xlsx format, which support a larger number of cells, may have the UsedRange flagged internally to be 65,536 rows by 256 columns. One culprit appears to be formatting applied to the sheet in a haphazard fashion. In Dodeca, this resulted in a minor issue which resulted in a larger memory allocation on the server. Based on the format of the Smart View xml, as compared to the more efficient design of the Dodeca xml format, if this were to happen in Smart View it may cause a larger issue due to the number of cells that would need to be parsed and processed. Disclaimer: I did not attempt to replicate this issue in Smart View but rather is an educated guess based on my experience with spreadsheet behavior.
Note: The Dodeca xml format does not need to contain information for cells that are blank. This format reduces the size and the processing cycles necessary to complete the task. In addition, when we originally designed Dodeca, we tested a format similar to the one used today by Smart View and found it to be slower and less efficient.
Considering all of this information, I believe the xml format would be difficult for the Smart View team to change at this point as it would cause compatibility issues with previously created workbooks. Further, this discussion should give some visibility to the fact that the Smart View team faces an on-going challenge to maintain compatibility between different versions of Smart View considering that different versions distributed on desktops and different versions of the internal formats that customers may have stored in their existing Excel files. I don’t envy their job there.
After looking at all of this, I was curious to see what the xml string would look like on a large retrieve, so I opened up Smart View, connected to Sample Basic and drilled to the bottom of the 4 largest dimensions. The resulting sheet contained nearly 159,000 rows of data. Interestingly enough, when I looked at the contents of customProperty5.bin inside that xlsx file, the contents were compressed. It occurred to be a bit strange to me as the xlsx file format is already compressed, but after thinking about it for a minute it makes sense as the old xls file format probably did not automatically compress content, so compression was there primarily to compress the content when saved in the xls file format.
Custom property #6 is labeled NameConnectionMap. The corresponding property file contains xml that appears to map the range names in the workbook to the actual grid and the connection.
Custom property #7 is labeled POVPosition. The file customProperty7.bin contains the number 4 followed by a NUL character. Frankly, I have no idea what position 4 means.
Moving on to custom property #8 which is labeled SheetHasParityContent. This file contains the number 1 followed by a NUL character. This is obviously a boolean flag that tells the Smart View code that new features, such as support for multiple grids, are present in this file.
Custom property #9 is labeled SheetOptions. The corresponding file, customProperty9.bin, contains an xml stream that (obviously) contains the Hyperion options for the sheet.
Custom property #10 is labeled ShowPOV and appears to contain a simple Boolean flag much like that in custom property #8.
Finally, custom property #11 is labeled USER_FORMATTING and may not be related to Smart View.
I did look through some of the other files in the .zip and found a few other references to Smart View, but I did not see anything significant.
So, now that we have completed an overview of what is contained in one, very simple, multi-grid file, what have we learned?
- There is a bunch of stuff stored under the covers when you save a Smart View retrieve as an Excel file.
- With the reported performance issues in certain situations with Smart View, you should now have an idea of where to look to resolve Smart View issues in your environment.
There are a number of files I did not cover in this overview that could also cause performance issues. For example, Oracle support handled one case where they found over 60,000 Excel styles in the file. Smart View uses Excel Styles when it applies automatic formatting to member and data cells. When there are that many styles in the workbook, however, it is logical that Excel would have a lot of overhead searching through its internal list of Style objects to find the right one. Accordingly, there is a styles.xml file that contains custom styles. If you have a bunch of Style objects, you could delete the internal styles.xml file.
Note: Be sure to make a copy of your original workbook before you mess with the internal structures. There is a possibility that you may mess it up and lose everything you have in the workbook. Further, Oracle does not support people going under-the-covers and messing with the workbook, so don’t even bring it up to support if you mess something up.
Wow, that should give you some idea of what may be going on behind the scenes with Smart View. Even with the experience I have designing and writing the Dodeca web services that talk to Essbase, I wouldn't say that I have a deep understanding of how the information in a Smart View workbook really works. However, one thing is for certain; Dodeca does not put stuff like this in your Excel files. It may be interesting to hear what you find when you explore the internals of your workbooks.
This year, Rittman Mead were the Analytics Sponsor for the UKOUG Tech13 conference in Manchester, and for those that visited the UKOUG stand during their time there will have noticed the Rittman Mead-sponsored Analytics Dashboard on display. In this blog post I will cover how it was put together, the “Blue Peter Way” !
For those of you not familiar with Blue Peter it is a long running children’s TV show here in the UK that started way back in 1958, possibly it’s most infamous moment came in 1969 with lulu the defecating elephant. As a child the highlight of the show was the “How to make” something section where they would always use the phrase “Sticky-backed-plastic” instead of “Sellotape” due to a policy against using commercial terms on air. The presenters would create something from scratch with bits and bobs you could find around the house, cereal boxes, egg cartons, washing up bottles, Sellotape etc ( sorry, “sticky-backed-plastic” ). That’s exactly what I’m going to be doing here but instead of making a sock monster I’m going to show you how the analytics dashboard was created from scratch with bits you can find on the internet. So kids, without further delay, let’s begin…
(remember to ask your parents permission before downloading any of the following items)
You will need :
- A Linux server.
- A Web server
- The Redis key/value store
- A DataSift account
- The Webdis HTTP server
- The tagcanvas.js jQuery plugin
- The vTicker jQuery plugin
- The flot plotting library
- Some Sticky-backed-Plastic (jQuery)
- Lots of coffee
The Linux server I’m using is Red Hat Enterprise Server 6.4 and the very first thing we’ll need to do is install a web server.
Done. The Web server is now installed, configured to start on server-boot and up and running ready to service our http requests.
Next up is the Redis key/value datastore. I’ll be using Redis to store all incoming tweets from our datasource ( more on that in a bit )
Now that we have Redis up and running let’s perform a couple of tests, first a benchmark using the Redis-benchmark command
This command throws out a lot more output than this but it is the LRANGE command we are particularly interested in as we’ll be using it to retrieve our tweets later on. 3593 requests per second seems reasonable to me, there were around a 1000 registrations for the UKOUG TECH13 conference, the likelihood of each of them making 3 concurrent dashboard requests all within the same second is slim to say the least – regardless, I’m willing to live with the risk. Now for a quick SET/GET test using the Redis-cli command.
Ok so that’s our datastore sorted, but what about our datasource ? Clearly we will be using Twitter as our primary data source but mere mortals like myself don’t have access to Twitter’s entire data stream, for that we need to turn to specialist companies that do. DataSift, amongst other companies like Gnip and Topsy ( which interestingly was bought by Apple earlier this week ) can offer this service and will add extra features into the mix such as sentiment analysis and stream subscription services that push data to you. I’ll be using DataSift, as here at Rittman Mead we already have an account with them. The service itself charges you by DPU ( Data Processing Units ) the cost of which depends upon the computational complexity of the stream your running, suffice to say that running the stream to feed the analytics dashboard for a few days was pretty cheap.
To setup a stream you simply express what you want included from the Twitter firehose using their own Curated Stream Definition Language, CSDL. The stream used for the dashboard was very simple and the CSDL looked like this :-
This is simply searching for tweets containing the string “ukoug” from the Twitter stream. DataSift supports all kinds of other social data, for a full list head over to their website.
Now that we have our data stream setup, how do we actually populate our Redis data store with it ? Simple – get DataSift to push it to us. Using their own PUSH API you can setup a subscription in which you can specify the following output parameters: “output_type”, “host”, “port”, “delivery_frequency” and “list” amongst many others. The output type was set to “Redis”, the host, well, our hostname and the port set to the port upon which Redis is listening, by default 6379. The delivery_frequency was set to 60 seconds and the list is the name of the Redis list key you want the data pushed to. With all that setup and the subscription active, tweets will automagically start arriving in our Redis datastore in the JSON format – happy days !
The next step is to get the data from Redis to the browser. I could install PHP and configure it to work with apache and then install one of the several Redis-PHP libraries and write some backend code to serve data up to the browser, but time is limited and I don’t want to faff around with all that. Besides this is Blue Peter not a BBC4 Documentary. Here’s where we use the next cool piece of open source software, Webdis. Webdis acts as a HTTP interface to Redis which means I can make HTTP calls directly from the browser like this :-
Yep, you guessed it, we’ve just set a key in Redis directly from the browser. To retrieve it we simply do this.
Security can be handled in the Webdis config file by setting IP based rules to stop any Tom, Dick or Harry from modifying your Redis keys. So to install Webdis we do the following :-
An that’s it, we now have tweet to browser in 60 seconds and without writing a single line of code ! Here’s an overview of the architecture we have :-
Would I do this in a production environment with larger data set, probably not, it wouldn’t scale, I’d instead write server-side code to handle the data processing, test performance and push only the required data to the client. The right tools for the right job.
The function getData() is called within the setInterval() function, the second parameter to the setInterval() sets the frequency of the call, in this case every 5 mins (30000 milliseconds). The getData function performs an Ajax get request on the url that points to our Webdis server listening on port 7379 which then performs an LRANGE command on the Redis data store, it’s simply asking Redis to return all list items between 0 and 100000, each item being a single tweet. Once the Ajax request has successfully completed the “success” callback is called and within this callback I am pushing each tweet into an array so we end up with an array of tweet objects. We now have all the data in a format we can manipulate to our hearts content.
Now onto the Graphs and Visualisations.The Globe
The tweet ticker was built by grabbing the latest 30 tweets from the data array and assigning them to html <li> tags, the vTicker jQuery plugin was then applied to the containing <ul> html tag. Various plugin options allow you to control things like the delay and scroll speed.Tweet Velocity
This one proved quite popular – these speaker types are a competitive bunch ! Having obtained a list of speakers from the UKOUG Tech13 website I was able to search all the twitter content for each speaker and aggregate up to get the total twitter mentions for each speaker, again the graph was rendered using the flot plotting library. As the graph updated during each day speakers were swapping places with our own Mark Rittman tweeting out the “The Scores on the doors” at regular intervals. When the stats can me manipulated though there’s always someone willing to take advantage !
tut, tut Mark.Twitter Avatars
The twitter Avatars used the tagcavas.js library but instead of populating it with words from the Twitter content the Tweet avatars were used. A few changes to the plugin options were made to display the results as a horizontally scrolling cylinder instead of a globe.Twitter Sentiment
The Twitter sentiment graph again used flot. Tweet sentiment was graphed over time for the duration of the conference. The sentiment score was provided by DataSift as part of the Twitter payload. The scores we received as part of the 3500 tweets ranged between -15 and 15 each score reflected either a positive or negative tweet. Asking a computer to infer a human emotion from 140 characters of text is a tough ask. Having looked at the data in detail a fair few of the tweets that received negative scores weren’t negative in nature, for example tweets with the content “RAC attack” and “Dangerous anti patterns” generated a cluster of negative scores. As we know computers are not as clever as humans, how can they be aware of the context of a tweet? detect sarcasm or differentiate between banter and plain old insults? Not all the negatively scored tweets where false-negatives, some were genuine, a few regarding the food seemed to ring true.
Perhaps the data you’re analyzing needs to be taken into context as a whole. You’d expect a 1000 Techies running around a tech conference to be happy, the sentiment analysis seemed to do a better job at ranking how positive tweets were than how negative they were. Perhaps from a visualisation point of view, a logarithmic scale along with a multiplier to raise the lowest score would have worked better in this case to reflect how positive overall the event was. One thing is clear though and that is that further statistical analysis would be needed over a much larger data set to really gain insight into how positive or negative your data set is.
So that’s it kids, I hope you’ve enjoyed this episode of Blue Peter, until next time….
Edit – 6-Dec-2013 19.07:
Mark has asked me the following question over twitter:
“any chance you could let us know why these particular tools / components were chosen?”.
I thought I’d give my answer here.
All the tools were also open source, easy to install/configure and are well documented, I had also used them all previously – again, a time saver. Redis was chosen for 2 reasons, it was supported as a subscription destination by DataSift ( along with many others ) and I’m currently using it in another development project I’m working on so I was up to speed with it. Although in this solution I’m not really taking advantage of Redis’s access speed it worked well as somewhere to persist the data.
If I was coding a more permanent solution with more development time then I’d add a relational database into the mix and use it to perform all of the data aggregation and analysis, this would make the solution more scalable. I’d then either feed the browser via ajax calls directly to the database via backend code or populate a cache layer from the database and make ajax called directly on the cache, similar to what I did in this solution. It would have been nice to use D3 to develop a more elaborate visualisation instead of the canned flot charts but again this would have taken more time to develop.
Next week Rittman Mead is the analytics sponsor for the UKOUG Tech13 conference up in Manchester.
We’ve got a great line up of talks, covering everything from BI Apps, Endeca, Hadoop, OBIEE and mobile – Stewart and Charles are both flying in from the US to present alongside Mark, Adam and myself. The full list is as follows:
- Deep dive into Oracle BI Applications using Oracle Data Integrator [Mark Rittman] – Sunday 12.30-2pm
- Leveraging Hadoop / Map Reduce OBIEE 11g and ODI 11g [Mark Rittman] – Sunday 3pm – 4.30pm
- BI Across Any Data Source with OBIEE and Oracle Endeca Discovery [Mark Rittman & Adam Seed] – Monday 5.05-5.50pm
- The Changing World of Business Intelligence [Jon Mead] – Tuesday 11.20 – 12.20
- Oracle BI Multi User Development (MDS XML vs MUDE) [Stewart Bryson] – Tuesday 11.30 -12.30
- Designing for a mobile World using OBIEE [Charles Elliot] – Wednesday 1.45 – 2.45
On Monday night we are teaming up with our friends at Pythian to host some drinks and nibbles at Taps Bar, which is just around the corner from the conference venue. We’ll be there from 6.30-9.00pm, so come and join us for a free drink.Analytics Sponsor
As part of being the analytics sponsor for the event we are looking to collate as much real time, social media and demographic information about the event as possible. We will be displaying an analytics dashboard in the conference venue detailing statistics from this data. To help us, could you:
- Complete the form here to give use some background infromation about why you are going; and
- Use the official hashtag ukoug_tech13 when tweeting anything about the event.
It’s looking like it will be a great few days, so look forward to seeing you up there.
More Excel SupportDodeca has always been strong on Excel version support and this version delivers even more Excel functionality. Internally, we use the SpreadsheetGear control, which does a very good job with Excel compatibility. This version of Dodeca integrates a new version of SpreadsheetGear that now has support for 398 Excel functions including the new SUMIFS, COUNTIFS, and CELL functions.Excel Page Setup DialogThe new version of Dodeca includes our implementation of the Excel Page Setup Dialog which makes it easy for users to customize the printing of Dodeca views that are based on Excel templates. Note that for report developers, the Excel Page Setup has also been included in the Dodeca Template Designer.
New PDF View TypeCustomers who use PDF files in their environments will like the new PDF View Type. In previous releases of Dodeca, PDF documents displayed in Dodeca opened in an embedded web browser control. Beginning in this version, Dodeca includes a dedicated PDF View type that uses a specialized PDF control.
View Selector TooltipsFinally, users will like the new View Selector tooltips which optionally display the name and the description of a report as a tooltip.
PerformancePerformance is one of those things that users always appreciate, so we have added a new setting that can significantly improve performance in some circumstances. Dodeca has a well-defined set of configuration objects that are stored on the server and we were even awarded a patent recently for the unique aspects of our metadata design. That being said, depending on how you implement reports and templates, there is the possibility of having many queries issued to the server to check for configuration updates. In a few instances, we saw that optimizing the query traffic could be beneficial, so we have implemented the new CheckForMetadataUpdatesFrequencyPolicy property. This property, which is controlled by the Dodeca administrator, tells Dodeca whether we should check the server for updates before any object is used, as was previously the case, only when a view opens, or only when the Dodeca session begins. We believe the latter case will be very useful when Dodeca is deployed in production as objects configured in production often do not change during the workday and, thus, network traffic can be optimized using this setting. The screenshot below shows where the administrator can control the update frequency.
Though users will like these features, we have put a lot of new things in for the people who create Dodeca views and those who administer the system. Let’s start with something that we think all Dodeca admins will use frequently.Metadata Property Search UtilityAs our customers continue to expand their use of Dodeca, the number of objects they create in the Dodeca environment continues to grow. In fact, we now have customers who have thousands of different objects that they manage in their Dodeca environments. The Metadata Property Search Utility will help these users tremendously.
This utility allows the administrator to enter a search string and locate every object in our system that contains that string. Once a property is located, there is a hyperlink that will navigate to the given object and automatically select the relevant property. This dialog is modeless, which means you can navigate to any of the located items without closing the dialog.
Note: this version does not search the contents of Excel files in the system.Essbase Authentication ServicesIn the past, when administrators wished to use an Essbase Authentication service to validate a login against Essbase and automatically obtain Dodeca roles based on the Essbase user’s group memberships, they had to use an Essbase connection where all users had access to the Essbase application and database. The new ValidateCredentialsOnly property on both of the built-in Essbase Authentication services now flags the service to check login credentials at the server-level only, eliminating the need for users to have access to a specific Essbase database.New Template Designer ToolsPrior to Dodeca 6.x, all template editing was performed directly in Excel. Since that time, however, most template design functionality has been replicated in the Dodeca Template Designer, and we think it is preferable due to the speed and ease of use with which users can update templates stored in the Dodeca repository. We have added a couple of new features to the Template Designer in this version. The first tool is the Group/Ungroup tool that allows designers to easily apply Excel grouping to rows and/or columns within the template. The second new tool is the Freeze/Unfreeze tool that is used to freeze rows and/or columns in place for scrolling.Parameterized SQL Select StatementsSince we introduced the SQLPassthroughDataSet object in the Dodeca 5.x series, we have always supported the idea of tokenized select statements. In other words, the SQL could be written so that point-of-view selections made by users could be used directly in the select statement. In a related fashion, we introduced the concept of parameterized insert, update, and delete statements in the same version. While parameterized statements are similar in concept to tokenized statements, there is one important distinction under the covers. In Dodeca, parameterized statements are parsed and converted into prepared statements that can be used multiple times and results in more efficient use of server resources. The parameterized select statement was introduced in this version of Dodeca in order for customers using certain databases that cache the prepared statement to realize improved server efficiency on their select statements.Workbook Script Formula Editor ImprovementsWe have also been working hard to improve extensibility for developers using Workbook Scripts within Dodeca. In this release, our work focused on the Workbook Script Formula Editor. The first thing we added here is color coding that automatically detects and distinguishes Excel functions, Workbook Script functions, and Dodeca tokens. In the new version, Excel functions are displayed in green, Dodeca functions and parentheses are displayed in blue, and tokens are displayed in ochre. Here is an example.
In addition, we have implemented auto-complete for both Excel and Dodeca functions.
New SQLException EventVersion 6.6 of Dodeca introduces a new SQLException event that provides the ability for application developers to customize the behavior when a SQL Exception is encountered.XCopy Release DirectoryBeginning in version 6.6, the Dodeca Framework installation includes a pre-configured directory intended for customers who prefer to distribute their client via XCopy deployment instead using Microsoft ClickOnce distribution. The XCopy deployment directory is also for use by those customers who use Citrix for deployment.Mac OS X Release DirectoryThe Dodeca Framework installation now includes a pre-compiled Dodeca.app deployment for customers who wish to run the Dodeca Smart Client on Mac OS X operating systems. What that means is that Dodeca now runs on a Mac without the need for any special Windows emulators. Dodeca does not require Excel to run on the Mac (nor does it require Excel to run on Windows for that matter), so you can certainly save your company significant licensing fees by choosing Dodeca for your solution.
In short, you can see we continue to work hard to deliver functionality for Dodeca customers. As always, the Dodeca Release Notes provide detailed explanations of all new and updated Dodeca features. As of today, we have decided to make the Release Notes and other technical documents available for download to non-Dodeca customers. If you are curious about all of the things Dodeca can do, and if you aren't afraid to dig into the details, you can now download our 389 page cumulative Release Notes document from the Dodeca Technical Documents section of our website.
How can you conditionally turn cells borders on and off in Publishers RTF/XSLFO templates? With a little digging you'll find what appears to be the appropriate attributes to update in your template. You would logically come up with using the various border styling options:
border-top|bottom|left|right-width border-top|bottom|left|right-style border-top|bottom|left|right-color
Buuuut, that doesnt work. Updating them individually does not make a difference to the output. Not sure why and I will ask but for now here's the solution. Use the compound border formatter border-top|bottom|left|right. This takes the form ' border-bottom="0.5pt solid #000000". You set all three options at once rather than individually. In a BIP template you use:
<?if:DEPT='Accounting'?> <?attribute@incontext:border-bottom;'3.0pt solid #000000'?> <?attribute@incontext:border-top;'3.0pt solid #000000'?> <?attribute@incontext:border-left;'3.0pt solid #000000'?> <?attribute@incontext:border-right;'3.0pt solid #000000'?> <?end if?>
3pt borders is a little excessive but you get the idea. This approach can be used with the if@row option too to get the complete row borders to update. If your template will need to be run in left to right languages e.g. Arabic or Hebrew, then you will need to use start and end in place of left and right.
For the inquisitive reader, you're maybe wondering how, did this guy know that? And why the heck is this not in the user docs?
Other than my all knowing BIP guru status ;0) I hit the web for info on XSLFO cell border attributes and then the Template Builder for Word. Particularly the export option; I generated the XSLFO output from a test RTF template and took a look at the attributes. Then I started trying stuff out, Im a hacker and proud me! For the users doc updates, I'll log a request for an update.