BI & Warehousing

Getting Smarter in Renting with Tableau 10

Rittman Mead Consulting - Thu, 2017-06-22 03:26
Preface

Not a long time ago a friend of mine spent a significant amount of time trying to find a flat to rent. And according to what he said it wasn't an easy task. It took him a decent time and efforts to find something that is big enough (but not too big) not too far from a workplace, had required features and affordable at the same time. And as a specialist in data analysis, I prefer to think about this task as a data discovery one (yes, when you have a hammer everything looks like a nail). And I decided to see if a data analysis tool can help me understand the rental market better. I'm sure you've already read the name of this post so I can't pretend I'm keeping intrigue. This tool is Tableau 10.3.

The Data

The friend I was talking before was looking for a flat in Moscow, but I think that this market is completely unknown to the most of the readers. And also I'd have to spend a half of time translating everything into English so for this exercise I took Brighton and Hove data from http://rightmove.co.uk and got a nice JSON Lines file. JSON Lines files are basically the same JSON as we all know but every file has multiple JSONs delimited by a newline.

{json line #1}
{json line #2}
...
{json line #n}

That could be a real problem but luckily Tableau introduced JSON support in Tableau 10.1 and that means I don't have to transform my data to a set of flat tables. Thanks to Tableau developers we may simply open JSON Lines files without any transformations.

Typical property description looks like this:

10

It has a few major blocks:

  • Property name - 2 bedroom apartment to rent
  • Monthly price - £1,250
  • Description tab:
    • Letting information - this part is more or less standard and has only a small number of possible values. This part has Property name: Property value structure ('Date available':'Now').
    • Key features - this part is an unformalized set of features. Every property may have its own unique features. And it is not a key-value list like Letting information, but a simple list of features.
    • Full description - simply a block of unstructured text.
  • Nearest stations - shows three nearest train stations (there could be underground stations too if they had it in Brighton).
  • School checker - this shows 10 closest primary and 10 secondary schools. For this, I found a kind of API which brought me a detailed description of every school.

And finally, the JSON for one property has the following look. In reality, it is one line but just to make it more easy to read I formatted it to a human readable format. And also I deleted most of the schools' info as it is not as important as it is huge.


Property JSON

{  
   "furnish":"Unfurnished",
   "key_features":[  
      "LARGE BRIGHT SPACIOUS LOUNGE WITH PATIO DOORS",
      "FULLY FITTED KITCHEN",
      "TWO DOUBLE BEDROOMS WITH WARDROBES",
      "A FURTHER SINGLE BEDROOM/OFFICE/STUDY",
      "A GOOD SIZED SHOWER ROOM ",
      "SINGLE GARAGE AND ON STREET PARKING",
      "EASY ACCESS TO THE CITY CENTRE OF CHICHESTER AND COMMUTER ROUTES. ",
      "TO ARRANGE A VIEWING PLEASE CONTACT US ON 01243 839149"
   ],
   "property_price_week":"£254 pw",
   "nearest_stations":[  
      {  
         "station_name":"Fishbourne",
         "station_dist":"(0.4 mi)"
      },
      {  
         "station_name":"Chichester",
         "station_dist":"(1.2 mi)"
      },
      {  
         "station_name":"Bosham",
         "station_dist":"(1.7 mi)"
      }
   ],
   "letting_type":"Long term",
   "secondary_schools":{  
      "schools":[  
         {  
            "distance":"0.6 miles",
            "ukCountryCode":"ENG",
            "name":"Bishop Luffa School, Chichester",
           ...
         }]
    }
   "url":"http://www.rightmove.co.uk/property-to-rent/property-66941567.html",
   "date_available":"Now",
   "date_reduced":"",
   "agent":"On The Move, South",
   "full_description":"<p itemprop=\"description\">We are delighted to bring to market, this fabulous semi detached bungalow ... </p>",
   "primary_schools":{  
      "schools":[  
         {  
            "distance":"0.3 miles",
            "ukCountryCode":"ENG",
            "name":"Fishbourne CofE Primary School",
         }]
    }
   },
   "property_address":[ "Mill Close, Chichester, West Sussex, PO19"],
   "property_name":"3 bedroom bungalow to rent",
   "date_added":"08 June 2017 (18 hours ago)",
   "property_price_month":"£1,100 pcm",
   "let_agreed":null,
   "unknownown_values":"",
   "deposit":"£1384"
}

The full version is here: 6391 lines, I warned you. My dataset is relatively small and has 1114 of such records 117 MB in total.

Just a few things I'd like to highlight. Letting information has only a small number of fixed unique options. I managed to parse them to fields like furnish, letting_type, etc. Key Features list became just an array. We have thousands of various features here and I can't put them to separate fields. Nearest stations list became an array of name and value pairs. My first version of the scrapper put them to a key-value list. Like this:

"nearest_stations":[  
      "Fishbourne": "(0.4 mi)",
      "Chichester": "(1.2 mi)",
      "Bosham": "(1.7 mi)"
      ]

but this didn't work as intended. I got around one hundred of measures with names Fishbourne, Chichester, Bosham, etc. Not what I need. But that could work well if I had only a small number of important POIs (airports for example) and wanted to know distances to this points. So I changed it to this and it worked well:

"nearest_stations":[  
      {  
         "station_name":"Fishbourne",
         "station_dist":"(0.4 mi)"
      },
      {  
         "station_name":"Chichester",
         "station_dist":"(1.2 mi)"
      },
      {  
         "station_name":"Bosham",
         "station_dist":"(1.7 mi)"
      }
   ]
Connect to the Data

When I started this study my knowledge of the UK property rent market was close to this:

20

And it's possible or even likely that some of my conclusions may be obvious for anyone who is deep in the topic. In this blog, I show how a complete newbie (me) can use Tableau and become less ignorant.

So my very first task was to understand what kind of objects are available for rent, what are their prices and so on. That is the typical task for any new subject area.

As I said before Tableau 10 can work with JSON files natively but the question was if it could work with such a complex JSON as I had. I started a new project and opened my JSON file.

30

I expected that I will have to somehow simplify it. But in reality after a few seconds of waiting Tableau displayed a full structure of my JSON and all I had to do was selecting branches I need.

40

After a few more seconds I got a normal Tableau data source.

50

And this is how it looked like in analysis mode

55

First Look at the Data

OK, let's get started. The first question is obvious: "What types of property are available for rent?". Well, it seems that name ('2 bedroom apartment to rent') is what I need. I created a table report for this field.

60

Well, it gives me the first impression of what objects are offered and what my next step should be. First of all the names are ending with "to rent". This just makes strings longer without adding any value. The word "bedroom" also doesn't look important. Ideally, I'd like to parse these strings into fields one of which is # of bedrooms and the second one is Property type. The most obvious action is to try Split function.

80

Well, it partially worked. This function is smart enough and removed 'to rent' part. But except for this, it gave me nothing. On other datasets (other cities) it gave me much better results but it still wasn't able to read my mind and did what I wanted:

85

But I spent 15 seconds for this and lost nothing and if it worked I'd saved a lot of time. Anyway, I'm too old to believe in magic and this almost didn't hurt my feelings.

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Yes, this string literally asks some regular expressions wizardry.

90

I can easily use REGEXP_EXTRACT_NTH and get what I want. Group 1 is the number of bedrooms and Group 3 is the property type. Groups 2 and 4 are just constant words.

100

Explanation for my regular expressionI can describe most of the names in the following way: "digit bedroom property type to rent" and the rest are "property type to rent. So digit and bedroom are optional and property type to rent are mandatory. The expression is easy and obvious: ([0-9]*)( bedroom )*(.*)( to rent)

Regular expressions are one of my favourite hammers and helped me a lot for this analysis. And after all manipulations, I got a much better view of the data (I skipped some obvious steps like create a crosstab or a count distinct measure to save space for anything more interesting).

110

And while this result looks pretty simple it gives me the first insight I can't get simply browsing the site. The most offered are 1 and 2 bedroom properties especially flats and apartments. And if a family needs a bigger something with 4 or 5 bedrooms, well I wish them good luck, not many offers to chose from. Also if we talk about living property only we should filter out things like GARAGE, PARKING or LAND.

120 130

I think both charts work pretty well. The first one presents a nice view of how flats and apartments outnumber all other types and the second one gives a much better understanding of how many of 2 bedroom properties offered compared to all others.

And while I'm not a big fan of fancy visualisations but if you need something less formal and more eye-catching try Bubbles chart. It's not something I'd recommend for an analysis but may work well for a presentation. Every bubble represents particular property type, colour shows a number of bedrooms and size shows the number of properties.

140

Going Deeper

The next obvious question is the price. How much do different properties cost? Is any particular one more expensive than average or less? What influences the price?

As a baseline, I'd like to know what is the average property price. And I obviously don't want just one figure for the city-wide price. It's meaningless. Let's start with a bar chart and see what is the range of prices.

145

Well, we have a lot of options. Flat share costs less than £700 or we may choose a barn for more than £3600. Again a very simple result but I can't get it directly from the site.

The next obvious question is how the number of bedrooms affects the price. Does the price skyrockets with every additional bedroom or maybe more bedrooms mean smaller rooms and price increases not too fast?

150

Well, this chart gives me the answer but it looks bad. Mostly because a lot of properties types don't have enough variance in room number. Studio flats have only one bedroom by definition and the only converted barn has 7 bedrooms. I'd like to remove types which don't have at least 3 options and see how the price changes. For this, I created a new computed field using fixed keyword. It counts the number of bedroom options by property type.

160

And then I use it in the filter 'Bedroom # variance' at least 3. Now I have a much more clean view. And I can see that typically more bedrooms mean significantly higher price with a few exceptions. But in fact, these are not actual exceptions just a problem of a small dataset. I can say that increase in # bedrooms certainly means a significant increase in price. And one more insight. Going above 7 bedrooms may actually double the price.

170

Averages are good but they hide important information of how prices are distributed. For example, six properties priced £1K and one £200 give average £885. And looking at average only may make you think that with £900 you may choose one of 7 options. It's very easy to build a chart to check this. Just create a new calculation called Bins and use in a chart.

180 190

With £100 bins I got the following chart. It shows how many properties have price falling to a particular price range. For example, the £1000 bin shows # of properties with prices £1000-£1100.

200

The distribution looks more or less as expected but the most interesting here is that £1000-£1100 interval seems to be very unpopular. Why? Let's add # of bedrooms to this chart.

210

£1000 is too expensive for 1 bedroom and studios but too cheap for two. Simple. What else can we do here before moving further? Converting this chart to a running total gives a cool view.

220

What can this chart tell us? For example, if we look at the orange line (2 bedrooms) we will find that with £1200 we may choose among 277 of 624 properties. With £1400 budget we have 486 of 624. Further £200 increase in budget won't significantly increase the number of possibilities and if the change from £1200 to £1400 almost doubled the number of possibilities, the next £200 give only 63 new options. I don't have a ready to use insight here, but I got a way to estimate a budget for a particular type of property. With budget £X I will be able to choose one of N properties.

Why It Costs What It Costs

OK, now I know a lot of statistics about prices. And my next question is about factors affecting the price. I'd like to understand does a particular property worth what it cost or not. Of course, I won't be able to determine exact price but even hints may be useful.

The first hypothesis I want to check is if a train station near raises the price or it isn't any important. I made a chart very similar to the previous one and it seems that Pareto principle works perfectly here. 80% or properties are closer than 20% of the maximum distance to a station.

230

But this chart doesn't say anything about the price it just gives me the understanding of how dense train stations are placed. I'd say that most of the properties have a station in 10-15 minutes of walking reach and therefore this should not significantly affect the price. My next chart is a scatter plot for price and distance. Every point is a property and its coordinates on the plot determined by its price and distance to the nearest station. Colour shows # of bedrooms.

240

I'd say that this chart shows no clear correlation between price and distance. And a more classical line chart shows that.

250

The maximum price slightly decreases with distance, minimum price on the contrary increases. Average price more or less constant. I think the hypothesis is busted. There is no clear correlation between the distance a tenant have to walk to a station and the price he has to pay. If you want to rent something and the landlord says that the price is high because of a train station near, tell him that there are stations all around and he should find something more interesting.

What about furnishings? Does it cheaper to get an unfurnished property or a landlord will be happy to meet someone who shares his taste?

260

Unfurnished property is definitely cheaper. And it's interesting that in some cases partly furnished even cheaper than completely unfurnished. But at least for furnished/unfurnished, we can see a clear correlation. When you see a furnished one for the price of unfurnished this may be a good pennyworth.

Another thing I'd like to check. Can we expect I lower price for a property not available immediately? Or is, on the contrary, the best price is offered for already unoccupied properties?

As always start with a general picture. What is the average time of availability by property types?

270

For most popular types it is about one month and if you have a house you typically publish it two or three months in advance. And what is about the price? One more chart that I like in Tableau. In the nutshell, it is a normal line chart showing an average price by days before property availability. But the thickness of lines shows the number of properties at the same time. So I can see not only the price but reliance too. A thick line means it was formed by many properties and a thin line may be formed by few properties and move up or down significantly then something changes. It would be very interesting to get a historical data and see how much time properties stay free or how long it takes before the price is reduced, but unfortunately, I don't have this data.

280

And looking at this chart I'd say that there is no statistically significant dependency for price and availability date. Renting a property available in the distant future won't save you money* (*=statistically).

And the last thing I'd like to investigate is the Key features. What do landlords put as the key features of their properties? How do they affect the price?

The list of popular Key features surprised me.

290

'Unfurnished' looks good to me, it is a really significant part of the deal. But 'Brighton'? For properties in Brighton? '1 Bedroom'. How many bedrooms can '1 bedroom flat to rent' have? Oh, there is a key feature saying '1 bedroom' now I know. But jokes aside. I had to make a lot of cleaning on this data before I could use it. There are six ways to write 'Modern kitchen'. Make everything upper case, then remove quotes, strip spaces and tabs, remove noisy features like 'stylish 1 bedroom apartment' and so on. After this, I got a slightly better list with approximately 3500 features instead of 4500. Note how all variants of writing 'GAS CENTRAL HEATING' now combined into one most popular feature. But there are still too many features. I'm sure that there should be not more than a hundred of them. Even at this screenshot you may see 'Unfurnished' and 'Unfurnished property' features.

300

When I need a visualisation for this amount of points, bar charts or tables won't play well. My weapon of choice is Scatter plot. Every point is a particular feature, axes are minimum and average prices of it, size is determined by the number of properties declaring to have this feature and the colour is the maximum price. So if a feature is located high on the plot it means that in average it will be expensive to have it. If this feature at the same time located close to the left side even cheap properties may have it. For example, if you want a swimming pool be ready to pay at least £3000 and £7000 in average. And the minimum price for tumble dryer is £3250 but average £3965. The cheapest property with a dryer is more expensive than with a pool, but in average pools are more expensive. That is how this chart works.

310

The problems of this chart are obvious. It is littered with unique features. Only one property has 4 acres (the point in top right corner). And actually not so many swimming pools are available for rent in Brighton. I filtered it by "# of properties > 25" and here is how prices for the most popular features are distributed.

320

Central location will cost you at least £100 and £1195 in average and for Great location be ready to pay at least £445 and £1013 in average. Great location seems to be less valuable than the central one.

And now I can see how a particular feature impacts prices. For example 'GAS HEATING'. I made a set with all variants of heating I could find ('GAS CENTRAL HEATING', 'GAS HEAT' and so on). Now I can analyse how this feature impacts properties. And here is how it impacts the price of flats. Blue circles are properties with gas heating and orange are without.

330

Very interesting in my opinion. The minimum price of properties with gas heating (blue circles) is higher than without. That is expected. But average price for properties without gas heating is higher.

And here are kitchen appliances. For 1 bedroom flats, they increase both minimum and average prices significantly. But for bigger flats minimum price with appliances is higher and average price is lower. Possible this option is important for relatively cheap properties, but its weight is not that big for the bigger ones.

340

Summary

350

Categories: BI & Warehousing

Rittman Mead at Kscope 2017

Rittman Mead Consulting - Wed, 2017-06-21 07:45
Rittman Mead at Kscope 2017

Rittman Mead will be well represented in San Antonio, Texas next week for Kscope 17 with some of our best from both sides of the Atlantic! Our very own Francesco Tisiot and Jordan Meyer will present various topics as well as participate in the conference events. Also, the newly named ODTUG BI Community Lead, Rittman Mead's Becky Wagner, will be on hand and leading a lot of activities throughout. See details below and we hope to see you in Texas.

Jordan

Oracle Big Data Spatial and Graph enables the analysis of data sets beyond that of standard relational analytics commonly used. Through graph technology relationships can be identified that may not otherwise have been. This has practical uses including in product recommendations, social network analysis, and fraud detection.

In this presentation we will see a practical demonstration of Oracle Big Data Spatial and Graph to load and analyze the "Panama Papers" data set. Graph algorithms will be utilized to identify key actors and organizations within the data, and patterns of relationships shown. This practical example of using the tool will give attendees a clear idea of the functionality of the tool and how it could be used within their own organization.

When: Jun 27, 2017, Tuesday Session 7 , 11:15 am - 12:15 pm
Room: Magnolia

Francesco

OBIEE 12c is the latest generation of Oracle's Enterprise analytics and reporting tool, bringing with it many powerful new features. Many users are still on earlier releases of OBIEE 11g or even 10g, and are looking to understand how they can move to OBIEE 12c to benefit from its new capabilities.

Liberty Global is a global telecommunications company, with a long history with OBIEE going back to 10g. They wanted to move to OBIEE 12c in order to use the new Advanced Analytics options, and used Rittman Mead to support them with the full scope of the upgrade.

In this presentation, we will see what a highly successful OBIEE 12c migration looks like. We will cover clear details of all the steps required, and discuss some of the problems encountered. Regression testing is a crucial step in any upgrade and we will show how we did this efficiently and accurately with the provided Baseline Validation Tool. This presentation will assist all attendees who are considering, or in the process of, an OBIEE 12c upgrade.

When: Jun 26, 2017, Monday Session 5 , 4:45 pm - 5:45 pm
Room: Wisteria/Sunflower

And

As a DBA or sysadmin responsible for OBIEE how do you really dig into the guts of OBIEE, look at intra-component communication between the system components and examine the apparently un-examinable? What do you do when you need to trace activity beyond what is in the log files? How do you work with log files in order to give precise but low-level information? What information can be gleaned, by hook or by crook, from OBIEE?

OBIEE provides a set of systems management and diagnostic tools, but these only take you so far. Join me in this presentation to dive deeper with OBIEE. We will take a look at a bag of tricks including undocumented configuration options, flame graphs, system call tracing, discovering undocumented REST APIs, and more! This is not just a geek-out - this is real-life examples of where client OBIEE projects have required that next level of diagnostic techniques and tools. Don your beanie hat and beard as we go deep!

When: Jun 28, 2017, Wednesday Session 12 , 9:45 am - 10:45 am
Room: Wisteria/Sunflower

Becky

Becky Wagner is the new ODTUG BI Community Lead. You will find her at:

Monday Community Lunch | 12:45 – 2:00 PM | Grand Oaks K-S

Monday evening BI Community Night | 8:00 - 10:00 PM | Grand Oaks H http://kscope17.com/events/community-nigh-events

She will be doing the 5K Fun Run http://kscope17.com/events/kscope17-5k on Tuesday morning

Women in Technology Lunch | 12:15– 1:45 PM | Cibolo Canyon 6 on Wednesday https://form.jotformpro.com/71134693041955

Navigating the Oracle Business Analytics Frontier Panel
9:00 AM - 11:00 AM, Cibolo Canyon 8/9/10
http://kscope17.com/content/thursday-deep-dive-sessions

Categories: BI & Warehousing

Unify: Could it be any easier?

Rittman Mead Consulting - Mon, 2017-06-19 09:00

Rittman Mead’s Unify is the easiest and most efficient method to pull your OBIEE reporting data directly into your local Tableau environment. No longer will you have to worry about database connection credentials, Excel exports, or any other roundabout way to get your data where you need it to be.

Unify leverages OBIEE’s existing metadata layer to provide quick access to your curated data through a standard Tableau Web Data Connector. After a short installation and configuration process, you can be building Tableau workbooks from your OBIEE data in minutes.

This blog post will demonstrate how intuitive and easy it is to use the Unify application. We will only cover using Unify and it’s features, as once the data gets into Tableau it can be used the same as any other Tableau Data Source. The environment shown already has Unify installed and configured, so we can jump right in and start using the tool immediately.

To start pulling data from OBIEE using Unify, we need to create a new Web Data Connector Data Source in Tableau. This data source will prompt us for a URL to access Unify. In this instance, Unify is installed as a desktop application, so the URL is http://localhost:8080/unify.

Once we put in the URL, we’re shown an authentication screen. This screen will allow us to authenticate against OBIEE using the same credentials. In this case, I will authenticate as the weblogic user.

Once authenticated, we are welcomed by a window where we can construct an OBIEE query visually. On the left hand side of the application, I can select the Subject Area I wish to query, and users are shown a list of tables and columns in the selected Subject Area. There are additional options along the top of the window, and I can see all saved queries on the right hand side of the window.

The center of the window is where we can see the current query, as well as a preview of the query results. Since I have not started building a query yet, this area is blank.

Unify allows us to either build a new query from scratch, or select an existing OBIEE report. First, let’s build our own query. The lefthand side of the screen displays the Subject Areas and Columns which I have access to in OBIEE. With a Subject Area selected, I can drag columns, or double click them, to add them to the current query. In the screenshot above, I have added three columns to my current query, “P1 Product”, “P2 Product Type”, and “1 - Revenue”.

If we wanted to, we could also create new columns by defining a Column Name and Column Formula. We even have the ability to modify existing column formulas for our query. We can do this by clicking the gear icon for a specific column, or by double-clicking the grey bar at the top of the query window.

It’s also possible to add filters to our data set. By clicking the Filter icon at the top of the window, we can view the current filters for the query. We can then add filters the same way we would add columns, by double clicking or dragging the specific column. In the example shown, I have a query on the column “D2 Department” where the column value equals “Local Plants Dept.”.

Filters can be configured using any of the familiar methods, such as checking if a value exists in a list of values, numerical comparisons, or even using repository or session variables.

Now that we have our columns selected and our filters defined, we can execute this query and see a preview of the result set. By clicking the “Table” icon in the top header of the window, we can preview the result.

Once we are comfortable with the results of the query, we can export the results to Tableau. It is important to understand that the preview data is trimmed down to 500 rows by default, so don’t worry if you think something is missing! This value, and the export row limit, can be configured, but for now we can export the results using the green “Unify” button at the top right hand corner of the window.

When this button is clicked, the Unify window will close and the query will execute. You will then be taken to a new Tableau Workbook with the results of the query as a Data Source. We can now use this query as a data source in Tableau, just as we would with any other data source.

But what if we have existing reports we want to use? Do we have to rebuild the report from scratch in the web data connector? Of course not! With Unify, you can select existing reports and pull them directly into Tableau.

Instead of adding columns from the lefthand pane, we can instead select the “Open” icon, which will let us select an existing report. We can then export this report to Tableau, just as before.

Now let’s try to do something a little more complicated. OBIEE doesn’t have the capability to execute queries across Subject Areas without common tables in the business model, however Tableau can perform joins between two data sources (so long as we select the correct join conditions). We can use Unify to pull two queries from OBIEE from different Subject Areas, and perform a data mashup with the two Subject Areas in Tableau.

Here I’ve created a query with “Product Number” and “Revenue”, both from the Subject Area “A - Sample Sales”. I’ve saved this query as “Sales”. I can then click the “New” icon in the header to create a new query.

This second query is using the “C - Sample Costs” Subject Area, and is saved as “Costs”. This query contains the columns “Product Number”, “Variable Costs”, and “Fixed Costs”.

When I click the Unify button, both of these queries will be pulled into Tableau as two separate data sources. Since both of the queries contain the “Product Number” column, I can join these data sources on the “Product Number” column. In fact, Tableau is smart enough to do this for us:

We now have two data sets, each from a different OBIEE subject area, joined and available for visualization in Tableau. Wow, that was easy!

What about refreshing the data? Good question! The exported data sources are published as data extracts, so all you need to do to refresh the data is select the data source and hit the refresh button. If you are not authenticated with OBIEE, or your session has expired, you will simply be prompted to re-authenticate.

Using Tableau to consume OBIEE data has never been easier. Rittman Mead’s Unify allows users to connect to OBIEE as a data source within a Tableau environment in an intuitive and efficient method. If only everything was this easy!

Interested in getting OBIEE data into Tableau? Contact us to see how we can help, or head over to https://unify.ritt.md to get a free Unify trial version.

Categories: BI & Warehousing

Unify - An Insight Into the Product

Rittman Mead Consulting - Thu, 2017-06-15 06:00
Unify - An Insight Into the Product

Monday, 12 Jun saw the official release of Unify, Rittman Mead's very own connector between Tableau and OBIEE. It provides a simple but powerful integration between the two applications that allows you to execute queries through OBIEE and manipulate and render the datasets using Tableau.

Unify - An Insight Into the Product

Why We Made It

One of the first questions of course would be why we would want to do this in the first place. The excellent thing about OBI is that it acts as an abstraction layer on top of a database, allowing analysts to write efficient and secure reports without going into the detail of writing queries. As with any abstraction, it is a trade of simplicity for capability. Products like Tableau and Data Visualiser seek to reverse this trade, putting the power back in the hands of the report builder. However, without quoting Spiderman, care should be taken when doing this.

The result can be that users write inefficient queries, or worse still, incorrect ones. We know there will be some out there that use self service tools as purely a visualisation engine, simply dropping pre-made datasets into it. If you are looking to produce sustainable, scalable and accessible reporting systems, you need to tackle the problem both at the data acquisition stage as well as the communication stage at the end.

If you are already meeting both requirements, perhaps by using OBI with Data Visualiser (formerly Visual Analyser) or by other means then that's perfectly good. However, We know from experience that there are many of you out there that have already invested heavily into both OBI and Tableau as separate solutions. Rather than have them linger in a state of conflict, we'd rather we nurse them into a state of symbiosis.

The idea behind Unify is that it bridges this gap, allowing you to use your OBIEE system as an efficient data acquisition platform and Tableau as an intuitive playground for users who want to do a a bit more with their data. Unify works by using the Tableau Web Data Connector as a data source and then our customised software to act as an interface for creating OBIEE queries and them exporting them into Tableau.

How It Works

Unify uses Tableau's latest Web Data Connector data source to allow us to dynamically query OBIEE and extract data into Tableau. Once a dataset is extracted into Tableau, it can be used with Tableau as normal, taking advantages of all of the powerful features of Tableau. This native integration means you can add in OBIEE data sources just as you would add in any others - Excel files, SQL results etc. Then you can join the data sources using Tableau itself, even if the data sources don't join up together in the background.

First you open up Tableau and add a Web Data Connector source:

Unify - An Insight Into the Product

Then give the link to the Unify application, e.g. http://localhost:8080/unify. This will open up Unify and prompt you to login with your OBIEE credentials. This is important as Unify operates through the OBIEE server layer in order to maintain all security permissions that you've already defined.

Unify - An Insight Into the Product

Now that the application is open, you can make OBIEE queries using the interface provided. This is a bit like Answers and allows you to query from any of your available subject areas and presentation columns. The interface also allows you to use filtering, column formulae and OBIEE variables much in the same way as Answers does.

Alternatively, you can open up an existing report that you've made in OBIEE and then edit it at your leisure. Unify will display a preview of the dataset so you can tweak it until you are happy that is what you want to bring into Tableau.

Unify - An Insight Into the Product

Once you're happy with your dataset, click the Unify button in the top right and it will export the data into Tableau. From this point, it behaves exactly as Tableau does with any other data set. This means you can join your OBIEE dataset to external sources, or bring in queries from multiple subject areas from OBIEE and join them in Tableau. Then of course, take advantage of Tableau's powerful and interactive visualisation engine.

Unify - An Insight Into the Product

Unify Server

Unify comes in desktop and server flavours. The main difference between the two is that the server version allows you to upload Tableau workbooks with OBIEE data to Tableau Server and refresh them. With the desktop version, you will only be able to upload static workbooks that you've created, however with the server version of Unify, you can tell Tableau Server to refresh data from OBIEE in accordance with a schedule. This lets you produce production quality dashboards for your users, sourcing data from OBIEE as a well as any other source you choose.

Unify Your Data

In a nutshell, Unify allows you to combine the best aspects of two very powerful BI tools and will prevent the need for building all of your reporting artefacts from scratch if you already have a good, working system.

I hope you've found this brief introduction to Unify informative and if you have OBIEE and would like to try it with Tableau, I encourage you to register for a free desktop trial. If you have any questions, please don't hesitate to get in touch.

Categories: BI & Warehousing

Unify: See Your Data From Every Perspective

Rittman Mead Consulting - Mon, 2017-06-12 09:09
 See Your Data From Every Perspective

 See Your Data From Every Perspective

Ad hoc access to accurate and secured data has always been the goal of business intelligence platforms. Yet, most fall short of balancing the needs of business users with the concerns of IT.

Rittman Mead has worked with hundreds of organizations representing all points on the spectrum between agility and governance. Today we're excited to announce our new product, Unify, which allows Tableau users to directly connect to OBIEE, providing the best of both worlds.

Governed Data Discovery

Business users get Tableau's intuitive data discovery features and the agility they need to easily blend their departmental data without waiting on IT to incorporate it into a warehouse. IT gets peace of mind, knowing their mission-critical data is protected by OBIEE's semantic layer and row-level security.

Unify Essentials

Unify runs as a desktop app, making it easy for departmental Tableau users to connect to a central OBIEE server. Unify also has a server option that runs alongside OBIEE, for organizations with a large Tableau user base or those using Tableau Server.

Desktop installation and configuration is simple. Once installed, users can query OBIEE from within Tableau with just a few clicks. Have a look at these short videos demonstrating setup and use of Unify.

Available Today

Download your free 7-day trial of Unify Desktop here.

No Tableau Desktop license? No problem. Unify is compatible with Tableau Public.

Categories: BI & Warehousing

OAC: Essbase and DVCS

Rittman Mead Consulting - Wed, 2017-06-07 09:00

Finally managed to get around to having a proper look at Essbase within Oracle Analytics Cloud Service (OAC) after a busy couple of months. This post focusses mainly on initial impressions on the ‘out of the box’ the Essbase side of this - which we will explore in more detail in future posts, as well as more detail on the use of Essbase with DVCS.

Using Essbase with DVCS

One of the features we are keen to explore more in this context is the integration of Essbase and the Data Visualisation Cloud Service (DVCS). One point that we found that we do not think is being expressed clearly anywhere else we have seen is how to configure this: In setting up our OAC instance, we were having difficulty coming up with a combination of configuration selections that enables Essbase and DV to work at the same time.

Oracle documentation (such as the price list) suggest that both should be available within Standard Edition OAC:

But Doc ID 2265410.1 on MoS suggests, by needing to add a security rule to the Essbase OAC, that two OAC instances are required. We could not find any reference to this requirement in Oracle documentation or blogs on the subject, but it transpires after checking with Oracle that this is indeed the case – Essbase and DV need to be on separate OAC instances.

Essbase

Looking purely at Essbase, my initial reaction is very positive…whilst the interface is different (I am sure tears will be shed for EAS & Studio in the foreseeable future…although given the way some stalwarts are still clinging on the last surviving copies of the Excel Add In, maybe not too imminently), once the surface of the new interface is scratched more...ahem…’seasoned’ developers will take comfort from being able to do a lot of the same things as they currently can. I am also confident it will fulfil one of the stated objectives in making it easier for non-experts to be able quickly and easily deploy cubes for analysis purposes.

Whilst the manual application and cube maintenance tools through the OAC front-end seem resilient and work effectively, I think some aspects will be difficult to use as the primary maintenance method in a production system - the ‘breadcrumb’ method afforded to dimension maintenance in particular will start to get fiddly to use with a dimension of any sort of volume. The application and cube Import (from a formatted Excel spreadsheet) facility is great - to my mind, a bit like a supercharged and easier-to-use Outline Load Utility in Hyperion Planning - and the ability to refresh the spreadsheet from a deployed cube is a good feature that shouldn’t have been taken for granted. I know Excel is regarded as the Devil’s work in some BI quarters…I personally don’t feel that way until it is being used as a database (or as some form of primary data storage)…but in this context, it is quick & easy to use, on most people’s desktops straightway, and is intuitive.

Still in the Excel corner, on the Smartview side, the addition of the Cube Designer extension (requiring Smartview 11.1.2.5.700) to be able to consider & change the more generic aspects (not members) of the ‘cube maintenance’ spreadsheets is a nice touch that makes this more straightforward and removes the need to pay strict attention to the spreadsheet layout. The ‘treeview’ style hierarchy viewer also helps make sense of the parent-child members that need to be detailed on the individual dimension tabs.

One issue that has flitted across my mind at this early stage is that of rules files. Whilst the Import facility creates these for you (as with creating a cube from Essbase Studio) which is welcome, and rules files created in an on-prem system can be uploaded (again, welcome), the on-board rules file editor is text based:

I’m not too sure how many people have created or edited rules files like this before (although I’d hazard a guess), but whilst the presence of any means to create, amend, or even tweak a file is good, it remains to be seen how usable this approach is. The alternative is to resubmit from the maintenance spreadsheet thus getting it created / amended for you or to maintain in on-prem system…but seeing as this platform is an alternative to (rather than an augmentation of) on prem for a lot of people, I’m not sure how practical this is.

Whilst the existing tools look really promising, I can’t help but think there will be occasions going forwards where it might be advantageous to be able to create a rules file to run an uploaded file outside of them: time will tell.

The Command Line Tool (downloadable from OAC-Essbase / Utilities) is a little limited at the moment, but goes some way towards filling the potential gap left by the absence of client-side EssMsh and can only grow with further releases: from the Oracle OAC documentation...

In conclusion, first impressions are very favourable. There are changes (eg Security), new features (eg Sandboxing), and I am sure there will be gaps for those considering moving from existing on-prem applications - for example, as I have seen someone else reference, there does not seem to be any reference to partitions in the front end or the import spreadsheet layout - so whilst there is a lot with which we will quite quickly feel familiar, there are also going to be new areas and new practices for us to get into step with: as above, we will look to explore some of these in future posts.

Categories: BI & Warehousing

Overview of the new Cloudera Data Science Workbench

Rittman Mead Consulting - Fri, 2017-06-02 09:07

Recently Cloudera released a new product called Cloudera Data Science Workbench(CDSW)

Being a Cloudera Partner, we at Rittman Mead are always excited when something new comes along.

The CDSW is positioned as a collaborative platform for data scientists/engineers and analysts, enabling larger teams to work in a self-service manner through a web browser. This browser application is effectively an IDE for R, Python and Scala - all your favorite toys!

The CDSW is deployed onto edge nodes of your CDH cluster, providing easy access to your HDFS data and the Spark2 and Impala engines. This means that team members can immediately start working on their projects, accessing full datasets and share analysis and results. A CDSW Project can include reusable code and snippets, libraries etc helping your teams to collaborate. Oh, and these projects can be linked with Github repos to help keep version history.

The workbench is used to fire up user session with R, Python or Scala inside a dedicated Docker engines. These engines can be customised, or extended, like any other Docker images to include all your favourite R packages and Python libraries. Using HDFS, Hive, Spark2 or Impala the workload can then be distributed over to the CDH cluster, by use of your preferred methods, without having to configure anything. This engine (virtual machine, really) runs for as long as the analysis. Any logs or output files need to be saved in the project folder, which is mounted inside the engine and saved on the CDSW master node. The master node is a gateway node to the CDH cluster and can scale out to many worker nodes to distribute the Docker engines

(C) Cloudera.com

And under the hood we also have Kubernetes to schedule user workload across the worker nodes and provide CPU and memory isolation

So far I find the IDE to be a bit too simple and lacking features compared to e.g. RStudio Server. But the ease of use and the fact that everything is automatically configured makes the CDSW an absolute must for any Cloudera customer with data science teams. Also, I'm convinced that future releases will add loads of cool functionality

I spent about two days building a new cluster on AWS and install the Cloudera Data Science Workbench, just an indication of how easy it is to get up and running. Btw, it also runs in the cloud (Iaas) ;)

Want to know more or see a live demo? Contact us at info@rittmanmead.com

Categories: BI & Warehousing

First Steps with Oracle Analytics Cloud

Rittman Mead Consulting - Thu, 2017-06-01 07:43
Preface

Not long ago Oracle added a new offer to their Cloud - an OBIEE in a Cloud with full access. Francesco Tisiot made an overview of it and now it's time to go a bit deeper and see how you can poke it with a sharp stick by yourself. In this blog, I'll show how to get your own OAC instance as fast and easy as possible.

Before you start

The very first step is to register a cloud account. Oracle gives a trial which allows testing of all features. I won't show it here as it is more or less a standard registration process. I just want highlight a few things:

  • You will need to verify your phone number by receiving an SMS. It seems that this mechanism may be a bit overloaded and I had to make more than one attempts. I press the Request code button but nothing happens. I wait and press it again, and again. And eventually, I got the code. I can't say for sure and possible it was just my bad luck but if you face the same problem just keep pushing (but not too much, requesting a code every second won't help you).
  • Even for trial you'll be asked for a credit card details. I haven't found a good diagnostics on how much was already spent and the documentation is not really helpful here.
Architecture

OAC instances are not self-containing and require some additional services. The absolute minimum configuration is the following:

  • Oracle Cloud Storage (OCS) - is used for backups, log files, etc.
  • Oracle Cloud Database Instance (DBC) - is used for RCU schemas.
  • Oracle Analytics Cloud Instance (OAC) - is our ultimate target.

From the Cloud services point of view, architecture is the following. This picture doesn't show virtual disks mounted to instances. These disks consume Cloud Storage quota but they aren't created separately as services.

Architecture

We need at least one Oracle Database Cloud instance to store RCU schemas. This database may or may not have a separate Cloud Storage area for backups. Every OAC instance requires Cloud storage area for logs. Multiple OAC instances may share one Cloud storage area but I can't find any advantage of this approach over a separate area for every instance.

Create Resources

We create these resource in the order they are listed earlier. Start with Storage, then DB and the last one is OAC. Actually, we don't have to create Cloud Storage containers separately as they may be created automatically. But I show it here to make things more clear without too much "it works by itself" magic.

Create Cloud Storage

The easiest part of all is the Oracle Cloud Storage container. We don't need to specify its size or lots of parameters. All parameters are just a name, storage class (Standard/Archive) and encryption.

20-create_ocs.gif

I spent some time here trying to figure out how to reference this storage later. There is a hint saying that "Use the format: <storage service>-<identity domain>/<container>. For example: mystorage1-myid999/mybackupcontainer." And if identity domain and container are pretty obvious, storage service puzzled me for some time. The answer is "storage service=Storage". You can see this in the top of the page.

30-OCS_naming.png

It seems that Storage is a fixed keyword, rurittmanm is the domain name created during the registration process and demo is the actual container name. So in this sample when I need to reference my demo OCS I should write Storage-rurittmanm/demo.

Create Cloud DB

Now when we are somewhat experienced in Oracle Cloud we may move to a more complicated task and create a Cloud DB Instance. It is harder than Cloud Storage container but not too much. If you ever created an on-premise database service using DBCA, cloud DB should be a piece of cake to you.

At the first step, we set the name of the instance and select the most general options. These options are:

  • Service Level. Specifies how this instance will be managed. Options are:

    • Oracle Database Cloud Service: Oracle Database software pre-installed on Oracle Cloud Virtual Machine. Database instances are created for you using configuration options provided in this wizard. Additional cloud tooling is available for backup, recovery and patching.
    • Oracle Database Cloud Service - Virtual Image: Oracle Database software pre-installed on an Oracle Cloud Virtual Machine. Database instances are created by you manually or using DBCA. No additional cloud tooling is available.
  • Metering Frequency - defines how this instance will be paid: by months or by hours.

  • Software Release - if the Service Level is Oracle Database Cloud Service, we may choose 11.2, 12.1 and 12.2, for Virtual Image only 11.2 and 12.1 are available. Note that even cloud does no magic and with DB 12.2 you may expect the same problems as on-premise.

  • Software Edition - Values are:

    • Standard Edition
    • Enterprise Edition
    • Enterprise Edition - High Performance
    • Enterprise Edition - Extreme Performance
  • Database Type - defines High Availability and Disaster Recovery options:

    • Single Instance
    • Database Clustering with RAC
    • Single Instance with Data Guard Standby
    • Database Clustering with RAC and Data Gard Standby

Database Clustering with RAC and Database Clustering with RAC and Data Gard Standby types are available only for Enterprise Edition - Extreme Performance edition.

40-create_obdc-1.gif

The second step is also quite intuitive. It has a lot of options but they should be pretty simple and well-known for anyone working with Oracle Database.

60-create-odbc-dc.png

The first block of parameters is about basic database configuration. Parameters like DB name (sid) or Administration Password are obvious.

Usable DataFile Storage (GB) is less obvious. Actually, in the beginning, it puzzled me completely. In this sample, I ask for 25 Gb of space. But this doesn't mean that my instance will take 25 Gb of my disk quota. In fact, this particular instance took 150 Gb of disk space. Here we specify only a guaranteed user disk space, but an instance needs some space for OS, and DB software, and temp, and swap, and so on.

65-db-disk.png

A trial account is limited with 500 Gb quota and that means that we can create only 3 Oracle DB Cloud instances at max. Every instance will use around 125 Gb of let's say "technical" disk space we can't reduce. From the practical point of view, it means that it may be preferable to have one "big" instance (in terms of the disk space) rather than multiple "small".

  • Compute shape specifies how powerful our VM should be. Options are the following:
    • OC3 - 1.0 OCPU, 7.5 GB RAM
    • OC4 - 2.0 OCPU, 15.0 GB RAM
    • OC5 - 4.0 OCPU, 30.0 GB RAM
    • OC6 - 8.0 OCPU, 60.0 GB RAM
    • OC7 - 16.0 OCPU, 120.0 GB RAM
    • OC1m - 1.0 OCPU, 15.0 GB RAM
    • OC2m - 2.0 OCPU, 30.0 GB RAM
    • OC3m - 4.0 OCPU, 60.0 GB RAM
    • OC4m - 8.0 OCPU, 120.0 GB RAM
    • OC5m - 16.0 OCPU, 240.0 GB RAM

We may increase or decrease this value later.

  • SSH Public Key - Oracle gives us an ability to connect directly to the instance and authentication is made by user+private key pair. Here we specify a public key which will be added to the instance. Obviously, we should have a private key for this public one. Possible options are either we provide a key we generated by ourselves or let Oracle create keys for us. The most non-obvious thing here is what is the username for the SSH. You can't change it and it isn't shown anywhere in the interface (at least I haven't found it). But you can find it in the documentation and it is opc.

The second block of parameters is about backup and restore. The meaning of these options is obvious, but exact values aren't (at least in the beginning).

70-create-odbc-brc.png

  • Cloud Storage Container - that's the Cloud Storage container I described earlier. Value for this field will be something like Storage-rurittmanm/demo. In fact, I may do not create this Container in advance. It's possible to specify any inexistent container here (but still in the form of Storage-<domain>/<name>) and tick Create Cloud Storage Container check-box. This will create a new container for us.

  • Username and Password are credentials of a user who can access this container.

The last block is Advanced settings and I believe it's quite simple and obvious. Most of the time we don't need to change anything in this block.

80-create-odbc-ac.png

When we fill all parameters and press the Next button we get a Summary screen and the actual process starts. It takes about 25-30 minutes to finish.

When I just started my experiments I was constantly getting a message saying that no sites available and my request may not be completed.

It is possible that it was again the same "luck" as with the phone number verification but the problem solved by itself a few hours later.

Create OAC Instance

At last, we have all we need for our very first OAC instance. The process of an OAC instance setup is almost the same as for an Oracle DB Cloud Instance. We start the process, define some parameters and wait for the result.

At the first step, we give a name to our instance, provide an SSH public key, and select an edition of our instance. We have two options here Enterprise Edition or Standard Edition and later we will select more additional options. Standard edition will allow us to specify either Data Visualisation or Essbase instances and Enterprise Edition adds to this list a classical Business Intelligence feature. The rest of the parameters here are exactly the same as for Database Instance.

90-oacs-1st-step.png

At the second step, we have four blocks of parameters.

100-oacs-2nd-step.png

  • Service Administrator - the most obvious one. Here we specify an administrator user. This user will be a system administrator.

  • Database - select a database for RCU schemas. That's why we needed a database.

  • Options - specify which options our instance will have.

    • Self-Service Data Visualisation, Preparation and Smart Discovery - this option means Oracle Data Visualisation and it is available for both Standard and Enterprise Editions.
    • Enterprise Data Models - this option gives us classical BI and available only for Enterprise Edition. Also, this option may be combined with the first one giving us both classical BI and modern Data discovery on one instance.
    • Collaborative Data Collection, Scenarios and What-if Analysis - this one stands for Essbase and available for Standard and Enterprise Editions. It can't be combined with other options.
  • Size is the same thing that is called Compute Shape for the Database. Options are exactly the same.
  • Usable Storage Size on Disk GB also has the same meaning as for the DB. The minimum size we may specify here is 25 Gb what gives us total 170 Gb of used disk space.

Here is a picture showing all possible combinations of services:

110-oacs-editions.png

And here virtual disks configuration. data disk is the one we specify.
130-oacs-storage.png

The last block - Cloud Storage Configuration was the hardest one. Especially the first field - Cloud Storage Base URL. The documentation says "Use the format: https://example.storage.oraclecloud.com/v1" and nothing more. When you know the answer it may be easy, but when I saw it for the first time it was hard. Should I place here any unique URL just like an identifier? Should it end with v1? And what is the value for the second instance? V2? Maybe I should place here the URL of my current datacenter (https://dbcs.emea.oraclecloud.com). The answer is https://<domain>.storage.oraclecloud.com/v1 in my case it is https://rurittmanm.storage.oraclecloud.com/v1. It stays the same for all instances.

All other parameters are the same as they were for DBCS instance. We either specify an existing Cloud Storage container or create it here.

120-oacs-cloud-storage.png

The rest of the process is obvious. We get a Summary and then wait. It takes about 40 minutes to create a new instance.

Note: diagnostics here is a bit poor and when it says that the instance start process is completed it may not be true. Sometimes it makes sense to wait some time before starting to panic.

Now we may access our instance as a usual. The only difference is that the port is 80 not 9502 (or 443 for SSL). For Data Visualisation the link is http(s)://<ip address>/va, for BIEE - http(s)://<ip address>/analytics and for Essbase http(s)://<ip address>/essbase. Enterprise Manager and Weblogic Server Console are availabale at port 7001 which is blocked by default.

What is bad that https uses a self-signed certificate. Depending on browser settings it may give an error or even prevent access to https.

Options here either use HTTP rather than HTTPS or add this certificate to your local computer. But these aren't the options for a production server. Luckily Oracle provides a way to use own SSL certificates.

Typical Management Tasks SSH to Instances

During the setup process, we provide Oracle with a public key which is used to get an SSH access to instances. Cloud does nothing special to this. In the case of Windows, we may use Putty. Just add the private key to Pageant and connect to the instance using user opc.

140-pageant.png

150-putty.gi

Opening Ports

By default only the absolute minimum of the ports is open and we can't connect to the OAC instance using BI Admin tool or to the DB with SQLDeveloper. In order to do this, we should create an access rule which allows access to this particular ports.

In order to get to the Access Rules interface, we must use instance menu and select the Access Rules option.

150-access-menu.png

This will open the Access Rules list. What I don't like about it is that it opens the full list of all rules but we can create only a rule for this particular instance.

160-access-rules-list.png

New rule creation form is simple and should cause no issues. But be careful here and not open too much for a wild Internet.

170-new-rule.png

Add More Users

The user who registered a Cloud Account becomes its administrator and can invite more users and manage privileges.

180-access-users.png

Here we can add and modify users.

190-users.png

When we add a user we specify a name, email and login. Also here we set roles for the user. The user will get an email with these details, and link to register.

Obviously, the user won't be asked about a credit card. He just starts working and that's all.

Summary

My first steps with Oracle Analytics Cloud were not very easy, but I think it was worth it. Now I can create a new OBIEE instance just in a few minutes and one hour later it will be up and running. And I think that's pretty fast compared to a normal process of creating a new server in a typical organisation. We don't need to think about OS installation, or licenses, or whatever else. Just try it.

Categories: BI & Warehousing

ACE Alumni

Tim Tow - Tue, 2017-05-23 23:08
Today, I asked Oracle to move me from Oracle ACE Director status to Oracle ACE Alumni status.  There are a number of reasons why I decided to change status.  When I started answering questions on internet forums years ago, I did it to share what I had learned in order to help others.  The same goes for this blog which I originally started so that I could give better and more complete answers to questions on the forums.

After the Hyperion acquisition by Oracle, I was contacted by Oracle who asked if I would be interested in becoming an "Oracle ACE".  It was an honor.  But over time, things have changed.  As more people found out about the ACE program, more people wanted to become an ACE.  If you have ever monitored the OTN Essbase and Smart View forums, they have become cluttered with copy and paste posts from people obviously trying to increase their points.  As the ACE program grew, it also become harder for the OTN team to manage and now require a formal activity reporting - a time report if you will - to track contributions to the community.  As I am already extremely pressed for time, I decided that tracking my contributions to the community - in exchange for a free pass to Open World, just didn't make sense.

All of that being said, just because I have moved to Oracle ACE Alumni status doesn't mean that I will stop contributing to the community.  My company will continue to provide free downloads and support for the Next Generation (Essbase) Outline Extractor and the Outline Viewer along with free downloads of Drillbridge Community Edition.  And maybe, just maybe, I will finally have time to write some new blog posts (maybe even some posts on some new Dodeca features inspired by our work with Oracle Analytics Cloud / Essbase Cloud!)

Categories: BI & Warehousing

Users of Analytics Applications

Dylan's BI Notes - Sun, 2017-05-21 15:08
Business User who are consuming the data and the report.  They see the information pushed to them.  They can see alerts in their phone.  They see emails.  They add the page to a bookmark in their browser and periodically look at them.   They are executives, managers, busy users who have other duties.   They don’t […]
Categories: BI & Warehousing

Delivery to Oracle Document Cloud Services (ODCS) Like A Boss

Tim Dexter - Wed, 2017-05-17 11:53
p { margin-bottom: 0.1in; direction: ltr; color: rgb(0, 0, 10); line-height: 120%; text-align: left; }p.western { font-family: "Liberation Serif",serif; font-size: 12pt; }p.cjk { font-family: "WenQuanYi Micro Hei"; font-size: 12pt; }p.ctl { font-family: "Lohit Devanagari"; font-size: 12pt; }

We have moved to a new blogging platform. This was a post from Pradeep that missed the cut over ...

In release 12.2.1.1, BI Publisher added a new feature - Delivery to Oracle Document Cloud Services (ODCS). Around the same time, BI Publisher was also certified against JCS 12.2.1.x and therefore, today if you have hosted your BI Publisher instance on JCS then we recommend Oracle Document Cloud Services as the delivery channel. Several reasons for this:

  1. Easy to configure and manage ODCS in BI Publisher on Oracle Public Cloud. No port or firewall issues.

  2. ODCS offers a scalable, robust and secure document storage solution on cloud.

  3. ODCS offers document versioning and document metadata support similar to any content management server

  4. Supports all business document file formats relevant for BI Publisher

When to use ODCS?

ODCS can be used for all different scenarios where a document need to be securely stored in a server that can be retained for any duration. The scenarios may include:

  • Bursting documents to multiple customers at the same time.

    • Invoices to customers

    • HR Payroll reports to its employees

    • Financial Statements

  • Storing large or extremely large reports for offline printing

    • End of the Month/Year Statements for Financial Institutions

    • Consolidated department reports

    • Batch reports for Operational data

  • Regulatory Data Archival

    • Generating PDF/A-1b or PDF/A-2 format documents

How to Configure ODCS in BI Publisher?

Configuration of ODCS in BI Publisher requires the  URI, username and password. Here the username is expected to have access to the folder where the files are to be delivered.


 

How to Schedule and Deliver to ODCS?

Delivery to ODCS can be managed through both - a Normal Scheduled Job and a Bursting Job.

A Normal Scheduled Job allows the end user to select a folder from a list of values as shown below

\

In case of Bursting Job, the ODCS delivery information is to be provided in the bursting query as shown below:

Accessing Document in ODCS

Once the documents are delivered to ODCS, they can be accessed by user based on his access to the folder, very similar to FTP or WebDAV access.

That's all for now. Stay tuned for more updates !

 

Categories: BI & Warehousing

OBIEE upgrades and Windows vulnerabilities

Rittman Mead Consulting - Mon, 2017-05-15 06:00
OBIEE upgrades and Windows vulnerabilities

OBIEE upgrades and Windows vulnerabilities

These two topics may seem unrelated; however, the ransomware attacks over the last few days provide us with a reminder of what people can do with known vulnerabilities in an operating system.

Organisations consider upgrades a necessary evil; they cost money, take up time and often have little tangible benefit or return on investment (ROI). In the case of upgrades between major version of software, for example, moving from OBIEE 10g to 12c there are significant architecture, security, functional and user interface changes that may justify the upgrade alone, but they are unlikely to significantly change the way an organisation operates and may introduce new components and management processes which produce an additional overhead.

There is another reason to perform upgrades: to keep your operating systems compliant with corporate security standards. OBIEE, and most other enterprise software products, come with certification matrices that detail the supported operating system for each product. The older the version of OBIEE, the older the supported operating systems are, and this is where the problem starts.

If we take an example of an organisation running OBIEE 10g, the most recent certified version of Windows it can run is Windows 2008 R2, which will fall outside of your company's security policy. You will be less likely to be patching the operating system on the server as it will either have fallen off the radar or Microsoft may have stopped releasing patches for that version of the operating system.

The result leaves a system that has access to critical enterprise data vulnerable to known attacks.

The only answer is to upgrade, but how do we justify ROI and obtain budget? I think we need to recognise that there is a cost of ownership associated with maintaining systems, the benefit of which is the mitigation of the risk of an instance like the ransomware attacks. It is highly unlikely that anyone could have predicted those attacks, so you could never have used it as a reason to justify an upgrade. However, these things do happen, and a significant amount of cyber attacks probably go on undetected. The best protection you have is to make sure your systems are up to date.

Categories: BI & Warehousing

BIP and Mapviewer Mash Up V

Tim Dexter - Mon, 2017-05-08 11:38

The last part on maps, I promise ... its been a fun ride for me at least :0) If you need to catch up on previous episodes:

In this post we're looking at map quality. On the left a JPG map, to the right an SVG output.

If we ignore the fact that they have different levels of features or layers. Imagine getting the maps into a PDF and then printing them. Its pretty clear that the SVG version of the map is going to render better on paper compared to JPG.

Getting the SVG output from mapviewer is pretty straightforward, getting BIP to render it requires a little bit of effort. I have mentioned the XML request that we construct and then do a variable substitution in our servlet. All we need do is add another option to the requested output. Mapviewer supports several flavors of SVG:

  • If you specify SVG_STREAM, the stream of the image in SVG Basic (SVGB) format is returned directly;
  • If you specify SVG_URL, a URL to an SVG Basic image stored on the MapViewer host system is returned.
  • If you specify SVGZ_STREAM, the stream of the image in SVG Compressed (SVGZ) format is returned directly;
  • If you specify SVGZ_URL, a URL to an SVG Compressed image stored on the MapViewer host system is returned. SVG Compressed format can effectively reduce the size of the SVG map by 40 to 70 percent compared with SVG Basic format, thus providing better performance.
  • If you specify SVGTINY_STREAM, the stream of the image in SVG Tiny (SVGT) format is returned directly;
  • If you specify SVGTINY_URL, a URL to an SVG Tiny image stored on the MapViewer host system is returned. (The SVG Tiny format is designed for devices with limited display capabilities, such as cell phones.)

Dont panic, Ive looked at them all for you and we need to use SVGTINY_STREAM. This sends back a complete XML file representation of the map in SVG format. We have a couple of issues:

  1. We need to strip the XML declaration from the top of the file: <?xml version="1.0" encoding="utf-8"?> If we don't BIP will choke on the SVG. Being lazy I just used a string function to strip the line out in my servlet:

    dd

  2. We need to stream the SVG back as text. So we need to set the CONTENT_TYPE for the servlet as 'text/javascript'
  3. We need to handle the SVG when it comes back to the template. We do not use the




Categories: BI & Warehousing

A focus on Higher Education, HEDW 2017

Rittman Mead Consulting - Wed, 2017-05-03 09:04

First, before I get into a great week of Higher Education Data Warehousing and analytics discussions, I want to thank the HEDW board and their membership. They embraced us with open arms in our first year of conference sponsorship. Our longtime friend and HEDW board member, Phyllis Wykoff, from Miami University of Ohio even spent some time with us behind the booth!

HEDW was in the lovely desert scape of Tucson, AZ at the University of Arizona. Sunday was a fantastic day of training, followed by three days of outstanding presentations from member institutions and sponsors. Rittman Mead wanted to show how important the higher education community is to us, so along with me, we had our CEO-Jon Mead, our CTO-Jordan Meyer, and our US Managing Director-Charles Elliott. If our AirBnB had ears, it would have heard several solutions to the problems of the world as well as discussions of the fleeting athleticism of days gone past. But alas, that will have to wait.

While at the conference, we had a multitude of great conversations with member institutions and there were a few themes that stuck out to us with regard to common issues and questions from our higher education friends. I will talk a little bit about each one below with some context on how Rittman Mead is the right fit to be partners in addressing some big questions out there.

Legacy Investment vs BI tool Diversification (or both)

One theme that was evident from hour one was the influx of Tableau in the higher education community. Rittman Mead is known for being the leader in the Oracle Business Intelligence thought and consulting space and we very much love the OBIEE community. With that said, we have, like all BI practitioners, seen the rapid rise of Tableau within departments and lately as an enterprise solution. It would be silly for the OBIEE community to close their eyes and pretend that it isn’t happening. There are great capabilities coming out of Oracle with Data Visualization but the fact is, people have been buying Tableau for a few years and Tableau footprints exist within organizations. This is a challenge that isn't going away.

Analytics Modernization Approaches

We had a ton of conversations about how to include newer technologies in institutions’ business intelligence and data warehousing footprints. There is clearly a desire to see how big data technologies like Hadoop, data science topics like the R statistical modeling language, and messaging services like Kafka could positively impact higher education organizations. Understanding how you may eliminate batch loads, predict student success, know if potential financial aid is not being used, know more about your students with analysis of student transactions with machine learning, and store more data with distributed architectures like Hadoop are all situations that are readily solvable. Rittman Mead can help you prioritize what will make the biggest value impact with a Modernization Assessment. We work with organizations to make good plans for implementation of modern technology at the right place and at the right time. If you want more info, please let us know.

Sometimes we need a little help from our friends

Members of HEDW need a different view or another set of eyes sometimes and the feedback we heard is that consulting services like ours can seem out of reach with budgets tighter than ever. That is why we recently announced the Rittman Mead Expert Service Desk. Each month, there are hours available to spend however you would like with Rittman Mead’s experts. Do you have a mini project that never seems to get done? Do you need help with a value proposition for a project or upgrade? Did production just go down and you can’t seem to figure it out? With Expert Service desk, you have the full Rittman Mead support model at your fingertips. Let us know if you might want a little help from your friends at Rittman Mead.

To wrap up

Things are a changing and sometimes it is tough to keep up with all of the moving parts. Rittman Mead is proud to be a champion of sharing new approaches and technologies to our communities. Spending time this week with our higher education friends is proof more that our time spent sharing is well worth it. There are great possibilities out there and we look forward to sharing them throughout the year and at HEDW 2018 in Oregon!

Categories: BI & Warehousing

Top 5 Quotes from Oracle’s 2017 Modern Finance Experience

Look Smarter Than You Are - Mon, 2017-05-01 12:40
Three days of Oracle’s Modern Finance Experience set my personal new record for “Most Consecutive Days Wearing a Suit.” Surrounded by finance professionals (mostly CFOs, VPs of FP&A, and people who make money from Finance execs), I came prepared to learn nothing… yet found myself quoting the content for days to come.

The event featured top notch speakers on cutting edge concepts: the opening keynote with Mark Hurd, a panel on the changing world of finance with Matt Bradley & Rondy Ng, Hari Sankar on Hybrid in the world of Oracle EPM, and even one of my competitors (more on that in a second).

For those of you who couldn’t be there (or didn’t want to pay a lot of money to dress up for three days), I thought I’d share my top five quotes as best as I could transcribe them.

“IT currently spends 80% of its budget on maintenance. Boards are demanding increased security, compliance, and regulatory investment. All these new investments come from the innovation budget, not maintenance.”
-          Mark Hurd, Oracle, Co-Chief Executive Officer

Mark Hurd was pulling double duty: he gave the opening keynote at Oracle HCM World (held at a nearby hotel) and then bolted over to Oracle Modern Finance Experience to deliver our keynote. He primarily talked Oracle strategy for the next few years which – to badly paraphrase The Graduate – can be summed up in one word: Cloud.

He gave a compelling argument for why the Cloud is right for Oracle and businesses (though server vendors and hosting providers should be terrified). Now let me be clear: much of this conference was focused around the Cloud, so many of these quotes will be too, but what I liked about Mark’s presentation was it gave clear, concise, and practically irrefutable arguments of the benefits of the Cloud.

The reason I liked the quote above is it answers the concerns from all those IT departments: what happens to my job if I don’t spend 80% of our resources on maintaining existing systems? You’ll get to spend your time on actually improving systems. Increased innovation, greater security, better compliance … the things you’ve been wanting to get to but never have time or budget to address.

“The focus is not on adding lots of new features to on-premises applications. Our priority is less on adding to the functional richness and more on simplifying the process of doing an upgrade.”

-          Hari Sankar, Oracle, GVP of Product Management

I went to a session on the hybrid world of Oracle EPM. I knew Hari would be introducing a customer who had both on-premises Hyperion applications and Cloud applications. What I didn’t know is that he would be addressing the future of Oracle EPM on-premises. As most of you know, the current version for the on-premises Oracle EPM products is 11.1.2.4.x. What many of you do not know is that Oracle has taken future major versions (11.1.2.5 and 12c) of those products off the roadmap.

Hari spoke surprisingly directly to the audience about why Oracle is not abandoning EPM on-prem, but why they will not be pushing the Cloud versions and all their cool new functionality back down to the historical user base. To sum up his eight+ minute monologue, the user base is not requesting new functionality. They want simplicity and an easy path to transition to the Cloud eventually, and that’s why Oracle will be focusing on PSUs (Patch Set Updates) for the EPM products and not on “functional richness.”

Or to put it another way: Hyperion Planning and other Hyperion product users who want impressive new features? Go to the Cloud because they’re probably never coming to on-premises. To quote Hari once more, “create a 1-3 year roadmap for moving to a Cloud environment” or find your applications increasingly obsolete.

 “Hackers are in your network: they’re just waiting to pull the trigger.”

-          Rondy Ng, Oracle, SVP of Applications Development

There was an entertaining Oracle panel led by Jeff Jacoby (Master Principal Sales Consultant and a really nice guy no matter what his family says) that included Rondy Ng (he’s over ERP development), Matt Bradley (he’s over EPM development), and Michael Gobbo (also a lofty Master Principal Sales Consultant). While I expected to be entertained (and Gobbo’s integrated ERP/HCM/EPM demo was one for the ages), I didn’t expect them to tackle the key question on everyone’s mind: what about security in the Cloud?

Mark Hurd did address this in his keynote and he gave a fun fact: if someone finds a security flaw in Oracle’s software on a Tuesday, Oracle will patch in by Wednesday, and it will take an average of 18 months until that security patch gets installed in the majority of their client base. Rondy addressed it even more directly: if you think hackers haven’t infiltrated your network, you’re sticking your head in the sand.

Without going into all of Rondy’s points, his basic argument was that Oracle is better at running a data center than any of their customers out there. He pointed out that Oracle now has 90 data centers around the world and that security overrides everything else they do. He also said, “security is in our DNA” which is almost the exact opposite of “Danger is my middle name,” but while Rondy’s line won’t be getting him any dates, it should make the customer base feel a lot safer about letting Oracle host their Cloud applications.

 “Cloud is when not if.”

-          David Axson, Accenture, Managing Director

I have to admit, I have developed a man crush on one of my competitors. I wrote down more quotes from him than from every other speaker at the event put together. His take on the future of Finance and Planning so closely paralleled my thoughts that I almost felt like he had read the State of Business Analytics white paper we wrote. For instance, in that white paper, we wrote about Analysis Inversion: that the responsibility for analyzing the report should be in the hands of the provider of the report, not the receiver of the report. David Axson put it this way: “The reporting and analysis is only as good as the business decisions made from it. In finance, your job starts when you deliver the report and analysis. Most people think that's when it ends.”

The reason I picked the quote above is because it really sums up the whole theme of the conference: the Cloud is not doing battle with on-premises. The Cloud did that battle, won with a single sucker punch while on-prem was thinking it had it made, and Cloud currently dancing on the still unconscious body of on-prem who right now is having a bad nightmare involving losing its Blackberry while walking from Blockbuster to RadioShack.

David is right: the Cloud is coming to every company and the only question is when you’ll start that journey.

“Change and Certainty are the new normal. Combat with agility.”

-          Rod Johnson, Oracle, SVP North America ERP, EPM, SCM Enterprise Business

So, what can we do about all these changes coming to Finance? And for that matter, all the changes coming to every facet of every industry in every country on Earth? Rod Johnson (which he assures me is his not his “stage” name) said it best: don’t fight the change but rather embrace it and make sure you can change faster than everyone else.

"Change comes to those who wait, but it’s the ones bringing the change who are in control."

-          Edward Roske, interRel, CEO


To read more about some of those disruptive changes coming to the world of Finance, download the white paper I mentioned above.
Categories: BI & Warehousing

Deliver Reports to Document Cloud Services!

Tim Dexter - Fri, 2017-04-28 16:32

Greetings !

In release 12.2.1.1, BI Publisher added a new feature - Delivery to Oracle Document Cloud Services (ODCS). Around the same time, BI Publisher was also certified against JCS 12.2.1.x and therefore, today if you have hosted your BI Publisher instance on JCS then we recommend Oracle Document Cloud Services as the delivery channel. Several reasons for this:

  1. Easy to configure and manage ODCS in BI Publisher on Oracle Public Cloud. No port or firewall issues.
  2. ODCS offers a scalable, robust and secure document storage solution on cloud.
  3. ODCS offers document versioning and document metadata support similar to any content management server
  4. Supports all business document file formats relevant for BI Publisher

When to use ODCS?

ODCS can be used for all different scenarios where a document need to be securely stored in a server that can be retained for any duration. The scenarios may include:

  • Bursting documents to multiple customers at the same time.
    • Invoices to customers
    • HR Payroll reports to its employees
    • Financial Statements
  • Storing large or extremely large reports for offline printing
    • End of the Month/Year Statements for Financial Institutions
    • Consolidated department reports
    • Batch reports for Operational data
  • Regulatory Data Archival
    • Generating PDF/A-1b or PDF/A-2 format documents

How to Configure ODCS in BI Publisher?

Configuration of ODCS in BI Publisher requires the  URI, username and password. Here the username is expected to have access to the folder where the files are to be delivered.



How to Schedule and Deliver to ODCS?

Delivery to ODCS can be managed through both - a Normal Scheduled Job and a Bursting Job.

A Normal Scheduled Job allows the end user to select a folder from a list of values as shown below


\

In case of Bursting Job, the ODCS delivery information is to be provided in the bursting query as shown below:


Accessing Document in ODCS

Once the documents are delivered to ODCS, they can be accessed by user based on his access to the folder, very similar to FTP or WebDAV access.

That's all for now. Stay tuned for more updates !

Categories: BI & Warehousing

The Case for ETL in the Cloud - CAPEX vs OPEX

Rittman Mead Consulting - Thu, 2017-04-27 11:12

Recently Oracle announced a new cloud service for Oracle Data Integrator. Because I was helping our sales team by doing some estimates and statements of work, I was already thinking of costs, ROI, use cases, and the questions behind making a decision to move to the cloud. I want to explore what is the business case for using or switching to ODICS?

Oracle Data Integration Cloud Services

First, let me briefly talk about what is Oracle Data Integration Cloud Services? ODICS is ODI version 12.2.1.2 available on Oracle’s Java Cloud Service known as JCS. Several posts cover the implementation, migration, and technical aspects of using ODI in the cloud. Instead of covering the ‘how’, I want to talk about the ‘when’ and ‘why’.

Use Cases

What use cases are there for ODICS?
1. You have or soon plan to have your data warehouse in Oracle’s Cloud. In this situation, you can now have your ODI J2EE agent in the same cloud network, removing network hops and improving performance.
2. If you currently have an ODI license on-premises, you are allowed to install that license on Oracle’s JCS at the JCS prices. See here for more information about installing on JCS. These use cases are described in a webinar posted in the PM Webcast Archive.

When and Why?

So when would it make sense to move towards using ODICS? These are the scenarios I imagine being the most likely:
1. A new customer or project. If a business doesn’t already have ODI, this allows them to decide between an all on-premises solution or a complete solution in Oracle’s cloud. With monthly and metered costs, the standard large start-up costs for hardware and licenses are avoided, making this solution available for more small to medium businesses.
2. An existing business with ODI already and considering moving their DW to the cloud. In this scenario, a possible solution would be to move the current license of ODI to JCS and begin using that to move data, all while tracking JCS costs. When the time comes to review licensing obligations for ODI, compare the calculation for a license to the calculation of expected usage for ODICS and see which one makes the most sense (cents?). For a more detailed explanation of this point, let’s talk CAPEX and OPEX!

CAPEX vs. OPEX

CAPEX and OPEX are short for Capital Expense and Operational Expense, respectively. In a finance and budgeting perspective, these two show up very differently on financial reports. This often has tax considerations for businesses. Traditionally in the past, a data warehouse project was a very large initial capital expenditure, with hardware, licenses, and project costs. This would land it very solidly as CAPEX. Over the last several years, sponsorship for these projects has shifted from CIOs and IT Directors to CFOs and Business Directors. With this shift, several businesses would rather budget and see these expenses monthly as an operating expense as opposed to every few years having large capital expenses, putting these projects into OPEX instead.

Conclusion

Having monthly and metered service costs in the cloud that are fixed or predictable are appealing. As a bonus, this style of service is highly flexible and can scale up (or down) as demand changes. If you are or will soon be in the process of planning for your future business analytics needs, we provide expert services, assessments, accelerators, and executive consultations for assisting with these kinds of decisions. When it is time to talk about actual numbers, your Oracle Sales Representative will have the best prices. Please get in touch for more information.

Categories: BI & Warehousing

Breaking News! Dodeca Spreadsheet Management System Certified on Oracle Analytics Cloud!

Tim Tow - Thu, 2017-04-20 23:37
Now that the Oracle Analytics Cloud, or "OAC", has been released, we had to get serious about our work with one of the Oracle Analytics Cloud components, the Essbase Cloud Service, or "EssCS" for short.  You would think that we should have been working hard on EssCS for quite some time, but we had been assured by Oracle product management that the Essbase Java API would be available in EssCS.  Of course, Dodeca was built using the Essbase Java API and thus we expected that support for EssCS would be very easy.

We got access to a production version of the EssCS last week and started our work.  As promised by product management, the Essbase Java API is available in EssCS and, believe it or not, we did not need to change a single line of source code in order to support the Essbase Cloud.  We did, however, have to update our build processes to use Java 8 instead of the decrepit Java 6 used in Essbase 11.x.

As far as configuration inside Dodeca itself, the only change we made was that we configured the APSUrl in the Essbase Connection object to point to the Essbase Cloud APS instance.  Note that the URL format has changed in the cloud.  The Java API was accessible in Essbase 9.3.1 through Essbase 11.1.2.4 using the format:

http://<server>:<port>/aps/JAPI

In the cloud, this has changed to:

http://<server>:<port>/essbase/japi

The Essbase Connection configuration looks pretty much the same as the configuration for an on premise connection configuration:








Of course, the Dodeca views look identical when run against an on premise or a cloud server:



In summary, it was trivial to test Dodeca using EssCS.  Every single Essbase functionality that we use in the product, from data grid operations to metadata operations and even report scripts, worked exactly the same as it does against an on premise Essbase cube.  Based on our testing, we are certifying the Dodeca Spreadsheet Management System to work on the Oracle Analytics Cloud.

We have a number of innovations we plan to introduce in the near future aimed to improve the Essbase Cloud experience, so stay tuned.  If you are planning to come to Kscope17 in San Antonio, plan to attend the Dodeca Symposium and you may just be the first to see of these cool new things!


Categories: BI & Warehousing

SQL-on-Hadoop: Impala vs Drill

Rittman Mead Consulting - Wed, 2017-04-19 10:01
 Impala vs Drill

I recently wrote a blog post about Oracle's Analytic Views and how those can be used in order to provide a simple SQL interface to end users with data stored in a relational database. In today's post I'm expanding a little bit on my horizons by looking at how to effectively query data in Hadoop using SQL. The SQL-on-Hadoop interface is key for many organizations - it allows querying the Big Data world using existing tools (like OBIEE,Tableau, DVD) and skills (SQL).

Analytic Views, together with Oracle's Big Data SQL provide what we are looking for and have the benefit of unifying the data dictionary and the SQL dialect in use. It should be noted that Oracle Big Data SQL is licensed separately on top of the database and it's available for Exadata machines only.

Nowadays there is a multitude of open-source projects covering the SQL-on-Hadoop problem. In this post I'll look in detail at two of the most relevant: Cloudera Impala and Apache Drill. We'll see details of each technology, define the similarities, and spot the differences. Finally we'll show that Drill is most suited for exploration with tools like Oracle Data Visualization or Tableau while Impala fits in the explanation area with tools like OBIEE.

As we'll see later, both the tools are inspired by Dremel, a paper published by Google in 2010 that defines a scalable, interactive ad-hoc query system for the analysis of read-only nested data that is the base of Google's BigQuery. Dremel defines two aspects of big data analytics:

  • A columnar storage format representation for nested data
  • A query engine

The first point inspired Apache Parquet, the columnar storage format available in Hadoop. The second point provides the basis for both Impala and Drill.

Cloudera Impala

We started blogging about Impala a while ago, as soon as it was officially supported by OBIEE, testing it for reporting on top of big data Hadoop platforms. However, we never went into the details of the tool, which is the purpose of the current post.

Impala is an open source project inspired by Google's Dremel and one of the massively parallel processing (MPP) SQL engines running natively on Hadoop. And as per Cloudera definition is a tool that:

provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats.

Two important bits to notice:

  • High performance and low latency SQL queries: Impala was created to overcome the slowness of Hive, which relied on MapReduce jobs to execute the queries. Impala uses its own set of daemons running on each of the datanodes saving time by:
    • Avoiding the MapReduce job startup latency
    • Compiling the query code for optimal performance
    • Streaming intermediate results in-memory while MapReduces always writing to disk
    • Starting the aggregation as soon as the first fragment starts returning results
    • Caching metadata definitions
    • Gathering tables and columns statistics
  • Data stored in popular Apache Hadoop file formats: Impala uses the Hive metastore database. Databases and tables are shared between both components. The list of supported file formats include Parquet, Avro, simple Text and SequenceFile amongst others. Choosing the right file format and the compression codec can have enormous impact on performance. Impala also supports, since CDH 5.8 / Impala 2.6, Amazon S3 filesystem for both writing and reading operations.

One of the performance improvements is related to "Streaming intermediate results": Impala works in memory as much as possible, writing on disk only if the data size is too big to fit in memory; as we'll see later this is called optimistic and pipelined query execution. This has immediate benefits compared to standard MapReduce jobs, which for reliability reasons always writes intermediate results to disk.
As per this Cloudera blog, the usage of Impala in combination with Parquet data format is able to achieve the performance benefits explained in the Dremel paper.

Impala Query Process

Impala runs a daemon, called impalad on each Datanode (a node storing data in the Hadoop cluster). The query can be submitted to any daemon in the cluster which will act as coordinator node for the query. Impala daemons are always connected to the statestore, which is a process keeping a central inventory of all available daemons and related health and pushes back the information to all daemons. A third component called catalog service checks for metadata changes driven by Impala SQL in order to invalidate related cache entries. Metadata are cached in Impala for performance reasons: accessing metadata from the cache is much faster than checking against the Hive metastore. The catalog service process is in charge of keeping Impala's metadata cache in sync with the Hive metastore.

Once the query is received, the coordinator verifies if the query is valid against the Hive metastore, then information about data location is retrieved from the Namenode (the node in charge of storing the list of blocks and related location in the datanodes), it fragments the query and distribute the fragments to other impalad daemons to execute the query. All the daemons read the needed data blocks, process the query, and stream partial result to the coordinator (avoiding the write to disk), which collects all the results and delivers it back to the requester. The result is returned as soon as it's available: certain SQL operations like aggregations or order by require all the input to be available before Impala can return the end result, while others, like a select of pre-existing columns without a order by can be returned with only partial results.

 Impala vs Drill

Apache Drill

Defining Apache Drill as SQL-on-Hadoop is limiting: also inspired by Google's Dremel is a distributed datasource agnostic query engine. The datasource agnostic part is very relevant: Drill is not closely coupled with Hadoop, in fact it can query a variety of sources like MongoDB, Azure Blob Storage, or Google Cloud Storage amongst others.

One of the most important features is that data can be queried schema-free: there is no need of defining the data structure or schema upfront - users can simply point the query to a file directory, MongoDB collection or Amazon S3 bucket and Drill will take care of the rest. For more details, check our overview of the tool. One of Apache Drill's objectives is cutting down the data modeling and transformation effort providing a zero-day analysis as explained in this MapR video.
 Impala vs Drill

Drill is designed for high performance on large datasets, with the following core components:

  • Distributed engine: Drill processes, called Drillbits, can be installed in many nodes and are the execution engine of the query. Nodes can be added/reduced manually to adjust the performances. Queries can be sent to any Drillbit in the cluster that will act as Foreman for the query.
  • Columnar execution: Drill is optimized for columnar storage (e.g. Parquet) and execution using the hierarchical and columnar in-memory data model.
  • Vectorization: Drill take advantage of the modern CPU's design - operating on record batches rather than iterating on single values.
  • Runtime compilation: Compiled code is faster than interpreted code and is generated ad-hoc for each query.
  • Optimistic and pipelined query execution: Drill assumes that none of the processes will fail and thus does all the pipeline operation in memory rather than writing to disk - writing on disk only when memory isn't sufficient.
Drill Query Process

Like Impala's impalad, Drill's main component is the Drillbit: a process running on each active Drill node that is capable of coordinating, planning, executing and distributing queries. Installing Drillbit on all of Hadoop's data nodes is not compulsory, however if done gives Drill the ability to achieve the data locality: execute the queries where the data resides without the need of moving it via network.

When a query is submitted against Drill, a client/application is sending a SQL statement to a Drillbit in the cluster (any Drillbit can be chosen), which will act as Foreman (coordinator in Impala terminology) that will parse the SQL and convert it into a logical plan composed by operators. The next step is the cost-based optimizer which, based on optimizations like rule/cost based, data locality and storage engine options, rearranges operations to generate the optimal physical plan. The Foreman then divides the physical plan in phases, called fragments, which are organised in a tree and executed in parallel against the data sources. The results are then sent back to the client/application. The following image taken from drill.apache.org explains the full process:

 Impala vs Drill

Similarities and Differences

As we saw above, Drill and Impala have a similar structure - both take advantage of always on daemons (faster compared to the start of a MapReduce job) and assume an optimistic query execution passing results in cache. The code compilation and the distributed engine are also common to both, which are optimized for columnar storage types like Parquet.

There are, however, several differences. Impala works only on top of the Hive metastore while Drill supports a larger variety of data sources and can link them together on the fly in the same query. For example, implicit schema-defined files like JSON and XML, which are not supported natively by Impala, can be read immediately by Drill.
Drill usually doesn't require a metadata definition done upfront, while for Impala, a view or external table has to be declared before querying. Following this point there is no concept of a central and persistent metastore, and there is no metadata repository to manage just for Drill. In OBIEE's world, both Impala and Drill are supported data sources. The same applies to Data Visualization Desktop.
 Impala vs Drill

The aim of this article isn't a performance-wise comparison since those depends on a huge amount of factors including data types, file format, configurations, and query types. A comparison dated back in 2015 can be found here. Please be aware that there are newer versions of the tools since this comparison, which bring a lot of changes and improvements for both projects in terms of performance.

Conclusion

Impala and Drill share a similar structure - both inspired by Google's Dremel - relying on always active daemons deployed on cluster nodes to provide the best query performances on top of Big Data data structures. So which one to choose and when?
As described, the capability of Apache Drill to query a raw data-source without requiring an upfront metadata definition makes the tool perfect for insights discovery on top of raw data. The capacity of joining data coming from one or more storage plugins in a unique query makes the mash-up of disparate data sources easy and immediate. Data science and prototyping before the design of a reporting schema are perfect use cases of Drill. However, as part of the discovery phase, a metadata definition layer is usually added on top of the data sources. This makes Impala a good candidate for reporting queries.
Summarizing, if all the data points are already modeled in the Hive metastore, then Impala is your perfect choice. If instead, you need a mashup with external sources, or need work directly with raw data formats (e.g. JSON), then Drill's auto-exploration and openness capabilities are what you're looking for.
Even though both tools are fully compatible with Oracle BIEE and Data Visualization (DV), due to Drill's data exploration nature, it could be considered more in line with DV use cases, while Impala is more suitable for standard reporting like OBIEE. The decision on tooling highly depends on the specific use case - source data types, file formats and configurations have deep impact on the agility of the business analytics process and query performance.

If you want to know more about Apache Drill, Impala and the use cases we have experienced, don't hesitate to contact us!

Categories: BI & Warehousing

Data Lake and Data Warehouse

Dylan's BI Notes - Fri, 2017-04-07 11:23
This is an old topic but I learned more and come up more perspectives over time. Raw Data vs Clean Data Metadata What kind of services are required? Data as a Service Analytics as a Service Raw Data and Clean Data I think that assuming that you can use raw data directly in a dangerous thing. […]
Categories: BI & Warehousing

Pages

Subscribe to Oracle FAQ aggregator - BI &amp; Warehousing