Rittman Mead Consulting

Subscribe to Rittman Mead Consulting feed
Rittman Mead consults, trains, and innovates within the world of Oracle Business Intelligence, data integration, and analytics.
Updated: 10 hours 37 min ago

Getting Apex 4.2 Up and Running on Oracle 12c Using PL/SQL Gateway

Mon, 2017-02-27 10:30

Recently I was asked to create a simple Master-Detail Apex form on top of some metadata tables stored within Oracle 12c. Apex version 4.2 is already pre-installed in 12c but it needs configuring to be able to use. This is where I hit some trouble which stems from a conflict of information across a number of different sources.

I was following the Oracle certified documentation on 4.2 and how to get it running on 12c specifically but it wouldn't work. Time after time I kept getting 'Error 404' messages whenever I tried accessing Apex through the browser. I gave up on the Oracle documentation and decided to scout out some blogs but they were virtually copy and paste and resulted in the same error messages. I asked colleagues at work and also joined the Apex slack domain but nobody quite understood why it wasn't working.

It seemed like a lost cause, until I decided to check the later versions of the Oracle documentation and came across an extra piece of information from an Apex 5.0 document:

Procedures CREATE_ACL, ASSIGN_ACL, ADD_PRIVILEGE and CHECK_PRIVILEGE in DBMS_NETWORK_ACL_ADMIN are deprecated in Oracle Database 12c. Oracle recommends using APPEND_HOST_ACE instead.

Given that 12c comes with Apex 4.2, I was surprised to find it not mentioned anywhere else but in the Apex 5.0 documentation.

So the intention for this blog entry, is to bring together all the information that is needed to get Apex 4.2 using PL/SQL Gateway on Oracle 12c so you're able to take a quick and easy look at what it can offer.

Steps to configure the PL/SQL Gateway on Oracle 12c

  1. Change your working directory where Apex is installed:

    cd $ORACLE_HOME/apex  
    
  2. Connect to SQL*Plus with a user with SYS privileges:

    sqlplus / as sysdba  
    
  3. Switch containers to the pluggable database container:

    ALTER SESSION SET CONTAINER = <PDB_NAME>;  
    
  4. Run the apex_epg_config.sql script along with the $ORACLE_HOME:

    @apex_epg_config.sql /u01/app/oracle/product/12.1.0/dbhome_1/
    

    Log back into SQL*Plus from the Apex directory if you get logged out

  5. Then unlock the ANONYMOUS user within the Root Container:

    ALTER SESSION SET CONTAINER = CDB$ROOT;
    ALTER USER ANONYMOUS ACCOUNT UNLOCK;
    
  6. Switch to the pluggable database container and run apxconf.sql:

    ALTER SESSION SET CONTAINER = <PDB_NAME>; 
    @apxconf.sql;
    

    When prompted change the ADMIN password and assign a port for APEX to connect to - Default is 8080.

  7. Verify the Oracle XDB Protocol Server Port that was set after running the apxconf.sql script:

    ALTER SESSION SET CONTAINER = <PDB_NAME>;  
    SELECT DBMS_XDB.GETHTTPPORT FROM DUAL;
    

    If it returns with [0] then enable the XDB Server

    EXEC DBMS_XDB.SETHTTPPORT(port); COMMIT;  
    
  8. Enable the Network Services in Oracle Database 12c - This will be disabled by default. This following will give access to all hosts:

    BEGIN 
    DBMS_NETWORK_ACL_ADMIN.APPEND_HOST_ACE
    (host => '*', ace => xs$ace_type(privilege_list => xs$name_list('connect'), principal_name => 'apex_040200', 
    principal_type => xs_acl.ptype_db)); 
    END; 
    /
    

    NOTE: If you require stricter security add the host name after the host parameter e.g. host => 'localhost'

  9. Open your browser and connect to Apex:

    http://hostname:8080/apex
    
Summary

This blog has given you the guidelines for getting Apex 4.2, that comes pre-installed on Oracle 12c, up and running.

Note that 4.2 is now quite dated and it is recommended to upgrade to Apex 5. This ensures continued support from Oracle, and access to newer and more efficient features.

Categories: BI & Warehousing

Data Preparation for "Analyse Elections with Oracle Data Visualisation Desktop"

Fri, 2017-02-24 09:13

In the previous post I wrote about some interesting insights we can get from the elections data. And this post concentrates on the data preparation for it. As the first one, it doesn't work as a guide and aimed to show some data manipulation things end-user can do without IT department and coding. As data preparation is not absolutely needed for the understanding of the visual part but still a piece of the story it was extracted as a separate post.

The Data

From the logical point of view I have two datasets.

  1. Voting results by election commissions: number of votes for every political party and a lot of technical measures like number of registered voters, number of good and damaged voting ballots and so on.
  2. Turnout figures at given times throughout the day.

From the physical point of view, both datasets are big files (about 500 Mb combined) each line of which is a single record. These records represent information at various levels of aggregation. I will use this fact to show some of Data flows features (great introduction from Francesco Tisiot).

Generally, each record is a JSON which looks like this:

{  
 [...]
 "Data":{  
  "1 | Number of voters included in the voters list at the time of the end of voting":"102",
  "2 | The number of ballots received by the precinct election commission":"102",
  "3 | The number of ballots issued to voters who voted early":"0",
  [...]
  "19 | 1 Russian political party 'Rodina'":"0",
  "20 | 2 Political party 'COMMUNISTS COMMUNIST PARTY OF RUSSIA'":"3",
  "21 | 3 Political party 'Russian Party of Pensioners for Justice'":"0",
  [...]
  },
  "Loc":[  
   "Nizhny Novgorod Region",
   "129 Nizhny Novgorod Region - Nizhny Novgorod",
   "28 Krasnooktyabrskaya",
   "PEC #1148"
   ]
}

You can find a full sample of the JSON here.

Data part is a set of 32 measures and Loc is an array of one to four elements. The more detailed the record, the more elements in Loc array.

The whole votes dataset is slightly smaller than 100'000 of records. As I said, records have various levels of aggregation, and this sample shows the most detailed data (its Loc has four elements). Nizhny Novgorod Region is obviously the region, 129 Nizhny Novgorod Oblast - Nizhny Novgorod is a concatenation of district electoral commission number (129), region name (Nizhny Novgorod Region) and district electoral commission name (Nizhny Novgorod). 28 Krasnooktyabrskaya is number and name of a territorial election commission, and PEC #1148 is precinct electoral commission. For the most aggregated records Loc looks like:

"Loc":[  
      "DEC #129"
   ]

This is an aggregated data for a district election commission #129.

Unfortunately, DVD doesn't work with JSON format directly. As one of the possibilities, we can use Apache Drill (for more information see previous posts by Francesco Tisiot and Robin Moffatt). Apache Drill is a supported option for Oracle DV, but it isn't an end-user tool and the story is not about building a complete and reusable system. It is about the storytelling. So I did a one-time manual conversion using Excel magic and got a plain table.

The second part of the data is turnout. In JSON form it's basically the same: one record per line, Data part with four values and Loc part based on the same principles. Data is significantly smaller and has only four values - turnout at four fixed moments. But here I need it in a more dimesional form. I want time as a dimension rather than pivoted.

In my opinion, pivot/unpivot is one more highly desired feature DVD needs. We can live without it in tools like OBIEE because its models aren't built on rapidly changing user-generated data structures, but for desktop data exploration tools like DVD powerful data transformation capabilities are crucial.

But while it isn't implemented, I made unpivot in Excel and got this.

It wasn't obvious that I need in in this particular form in the beginning of data preparation stage, but the good news is both visualisation and data manipulation live in one product, and in reality I was able to build charts, understand that I need more data/attributes/etc, add them and go back to analysis. Very smooth and clean.

I Thought We Weren’t Supposed to Use Excel?

While Data flows DVD feature is powerful and very useful there are some data manipulation tasks that Excel can do just easier or better than DVD. I hope that Oracle will improve Data flows over time but right now it’s not perfect.

I used Excel for the following:

  1. Splitting of long strings into fields. I could do it using locate+substring+left+right functions in DVD. But that would produce unreadable code. Excel Text to columns tool is a weapon of choice so far.
  2. UNION. When I need to combine two datasets into one, Excel’s copy and paste are beyond competition. Adding set operations (not only union) will greatly improve Oracle’s tool.
  3. A few technical tasks like translating the data. Initially, the data was not in English and I had to translate it. I could use both Excel and DVD for this task but chose Excel in order to simplify the post. But if I did it in DVD, it would be just a few more joins. Can't say what is simpler, so no clear winner here.

Note: I'm talking about static data sets. In case of a changing data that may be not true.

Data Transformations in DVD

While I was able to (had to) make some of the data preparation in Excel, it's better and easier to do some tasks in DVD using relatively new Data flows feature. I will start with building a clean dictionary of electoral commissions.

When I finished my Excel part, I got this:

It isn't so bad, but I can make it better in a few minutes. I want to rename Location1 into Subject, split Location2 into DEC #129 and Nizhny Novgorod, remove leading digits from Location3 and rename it to TEC name, and also rename Location4 into PEC #.

Here is the data flow which does it. It may not look like the simplest thing in the world, but I spent about the same time building it as you spent reading what it does.

The second thing I want to do is to split my wide file with 32 measures into two. One for party results and another one for other let's say "technical" measures like number of voters, ballots and so on. It's not absolutely necessary but will please me and anyone else who will see it. Both data flows for this task use Commissions dictionary I made earlier and the same source file. I filter the source to get only the most detailed data, join with commissions, select only interesting columns and save the result. Simple and straightforward. The only difference is sets of selected columns.

I could start building visualisations with this data, but I'd prefer to calculate some relative measures. Absolute numbers won't help me because every commission has its own number of voters. We can't compare them directly but relative measures will do the trick.
Data flow for these calculations is the simplest of all.
The only not immediately obvious part is why I placed two consequential Add columns steps. The first one calculates Total votes column as a sum of votes for all parties and the second one uses this column for calculation of a percentage.

And it is really cool how we can use results of different flows. I took Protocols, made Commisions, joined Protocols and Commisions and got Votes data and later made calculations based on Votes data. That was very cool.

And here I want to show how the autocomplete feature in formulas editor works (because I really like it). I hope to see it on OBIEE one day.

I have one more data flow for the turnout data, but it simply joins source file with the dictionary of commissions.

Summary

I believe Oracle DVD is a very powerful tool for a data analyst. While many users prefer to stick with Excel exclusively, DVD can significantly increase their data preparation and transformation capabilities. Even now at the very beginning of its journey DVD allows making many data manipulation operations fast and easy. It obviously worth give it a try and include into a data analysis process. Look how the data from this post may be used for analysis and making interesting conclusions: Analyse Elections with Oracle Data Visualisation Desktop.

Categories: BI & Warehousing

Analysing Elections Data with Oracle Data Visualisation Desktop

Fri, 2017-02-24 04:38

Disclaimer #1 This post is not about politics. Its dataset is about politics, but that's a coincidence. It could be about immunisation or transport or anything else. If you are strictly against any politics, here is a link to watch cute kittens.

Introduction

Let's pretend that I'm an analyst and got a supposedly interesting data set. Now I want to understand if the data is actually interesting or it's a total rubbish. Having been in IT for some time I can use tools and technologies which typical end-user can’t access. But this time I pretend I’m a usual analyst which has data and desktop tools. And my task is to do a research and tell if there are anomalies in the data or everything looks like it supposed to look like.
The main tool for this work is obviously Oracle Data Visualisation Desktop (DVD). And, as a supplementary tool, I use Excel. This post is not a guide for any particular DVD feature. It won’t give a step by step instruction or anything like that. The main idea is to show how we can use Oracle Data Visualisation for an analysis of a real dataset. Not simply show that we can build bar charts, and pie charts and other fancy whatever charts, but show how we can get insights from the data.

The Data

I should say a few words about the data. It is an official result of the Russian State Duma (parliament) elections in 2016. Half of the Duma was elected by party lists and for this post I took that data. I should confess that I cheated a little and decided not spend my time downloading and parsing the data piece by piece from the official site, and took a prepared set.

From a bird's-eye view I have the following data:

  1. Voting results by election commissions: number of votes for every political party and a lot of technical measures like number of registered voters, number of good and damaged voting ballots and so on.
  2. Turnout figures at given times throughout the day.

From a more technical point of view, the data was stored in two big files with multiple JSON in each. As the data preparation part is big enough, it was extracted to another post. This one concentrates purely on visualisation and the second one is about data flows and comparison to Excel.

Analysing the Data

I did some cleaning, refining and enriching of the data and it's time to use it. I started with a standard Stacked bar chart. It shows percentages of parties by regions and in addition width of bars shows Total votes. The chart is sorted by ascending Total votes.

What can we say here?

Before I start talking I need a lawyer and a disclaimer #2:

Disclaimer #2 Some of the results may be interpreted in different ways. Some of them may be not so pleasant for some people. But I'm not a court and this post is only a data visualisation exercise. Therefore I'm not accusing anyone of committing any crimes. I will make some conclusions because of rules of the genre, but they should be treated as hypotheses only.

That’s not a proven charge (see disclaimer #2) but for me these regions look a bit suspicious. Their results are very uncommon. United Russia ruling party (orange bars) got an extremely high result in these few regions. This may be a sign of some kind of interfere with an election process there. But of course, other explanations (including a measure incorrectly selected for sorting) exist.

Just for reference so we don’t forget the names: Tatarstan, Kemerovo, Dagestan, Chechnya and Mordovia. There are a few more regions with similar results but their number of voters is lower so I don’t show them here.

At this point I'm starting to suspect something. But I need more visuals to support my position, and my next hypothesis is that in these regions ballots were somehow added to voting boxes (or protocols were changed which is basically the same). From a data visualisation point of view that will mean that these regions will have higher turnout (because of added ballots) along with higher United Russia result.

To check this I need one more measure - Turnout, %. It shows how many of registered voters actually voted. I can create this field without leaving visualisation mode. Cool.

Note. This formula may be not absolutely correct but it works well for demonstration purposes.

In order to visualise this hypothesis, I built a Scatter chart. Its horizontal axis is Turnout,% and vertical one United Russia, %. And I added a trend line to make things more obvious. Colour density shows Total votes.

I think my hypothesis just got a strong support. As usual it is not an absolutely impossible situation. But it's hard to explain why the more people come to voting stations the higher one party result is. I'd expect either high result not depending on the turnout (more or less like Uniform distribution) or at least a few exceptions with high turnout and low result.

I consider this result strange because in real life I'd expect that higher turnout should mean more opposition voters (a passive group indeed) coming to voting stations. But that's only my opinion. And highly arguable I should agree. What I really want to show here is that these charts highlight an oddity that should be investigated and may or may not have a rational explanation.

And who are our heroes? Let’s zoom in on the chart …and we see the same regions.

But maybe other parties can show the same result? We can build the same Scatter charts for every party or we can visualise all at once with a usual Line chart. Here I’ve plotted the percent of vote won by each party (Y-axis) against the overall turnout % (X-axis).

United Russia is the only party that increases with turnout.

So far all my charts were about relative measures, it's time to check some absolute values. Here is a Bar chart which shows a number of precinct commissions by results. I'd expect to see something close to normal distribution - a bell-shaped chart with the maximum around 54% (average turnout). Now, look at the real chart (bin size=1.0%). Its maximum is at 36-37%.

Normal distribution

In probability theory, the normal (or Gaussian) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.

Strictly speaking all numbers I'm showing here are discrete and I should say Binomial distribution rather than Normal but right now for my purposes the diffence is not that big.

I'm definitely not Carl Gauss (and even not one of his best students) and you may ignore my opinion, but I expected something more like this:

And I don't have the slightest idea how it is possible that the most "popular" turnout is 100%.

What if we look at the same chart with more details? The previous one was grouped by 1% bins, and this one has 0.1% bins. And I had to add turnout not equal to 100% filter. Even with smaller bin size, the last one is almost the same ~3K commissions. This bar is much bigger than the others and the chart doesn't show anything in that case.
What can we see here? Well, people aren’t really good in generating random numbers. It's perfectly OK to have some runout on the chart. But hey, it's not normal to have them mostly at round values. That looks like someone was trying to fit the result to some plan.

Here is my favourite part of the chart. I marked 1% intervals, and you can see that round turnout value is always more probable than its closest non-round neighbours. No exceptions. A possible explanation is that the number of commissions with results that high is relatively low and even the slightest manipulation is clearly visible.

But wait. What about that 64.3 percent? It’s not round, but it is a big runaway. Let’s take a closer look at this value and check if there is anything interesting or that is a normal situation. Here is a few interesting visualisation for it.

The first one is Tree Diagram. It shows all existing combinations of district and precinct commissions by regions for the filtered data (turnout=64.3). And in order to demonstrate how it works for this case I made an animation. Most of the regions have a few commissions with 64.3% turnout. But Saratov region beats them all.

This visualisation has a serious flaw. End-user has to scroll it (I mean for this set of data) and can miss the point. Another visualisation can improve the situation. Network diagram doesn't need scrolling.

Looks good and shows exactly the same. But for this chart we must ensure that every data point is unique what is not true in my case. Different precinct commissions have the same numbers and I had to create a unique field first (DEC #||PEC #). It's easy to forget and get unpredictable or even misleading results.

Or if you prefer more traditional charts, here is Sunburst for you. Its sectors size shows Total votes and the colour is PEC count. It gives a good representation of how uncommon Saratov's result is.

And the last picture for the same topic boring never-old classic Bar chart.

Considering all these charts I'd say that almost exclusive concentration of commissions with 63.4% turnout in Saratov doesn't look normal for me. It's pretty weird that sibling commissions show exactly the same figures.

A few more diagrams which could work well are Sankey and Parallel coordinates, unfortunately, they are less informative because of the high number of precinct commissions. But if the number was lower I'd consider them too.

All previous charts are based on voting data. But I have one more dataset - official turnout. Let's check if we can find anything interesting there. And unfortunately significant part of commissions doesn't have official data, and sometimes I may use formulas that are not exactly the same as official ones, so numbers may differ slightly from what I got from the protocols data.

The first chart shows the number of commissions (vertical axis) by the official turnout (horizontal axis). Colour shows the moment of time. Strictly saying I shouldn't have used continuous linear charts for discrete values, but coloured overlapped bars don't give that clear picture.

Except for the 100% tail, everything is more or less natural. Graph shape looks more like Gamma distribution rather than Normal but I didn't test it.

What picture do I have for various regions?
Moscow city is known for a relatively high number of poll watchers and we may expect more clean data there. Ignoring the long tail, these look normal (or binomial if you want to be precise).

Saratov region. The one with 64.3% turnout. Look at these peaks. Do they look natural to you?
Do you remember Tatarstan (was the hero in the beginning of this story)? Here it is. I don't know how can anyone explain how it is possible (without manual results adjusting I mean).

Summary

This post shows how we can use Oracle DVD for visualisation of a real data set. And I hope I was able to convince you that this tool can be useful and can give you really interesting insights. Of course, visualisation alone doesn't answer all questions. And this time actually it was less about answers but more about questions. It helps to ask right questions.

More reading on the topic: 1, 2 (Russian language). If you can read Russian, here you will find more visualisations, discussions and interesting information. And this article is about elections in 2011. Its undisputable advantage is that it is in English.

Categories: BI & Warehousing

OBIEE 12c Time Dimension: Logical Sequence Number

Thu, 2017-02-23 02:59
 Logical Sequence Number

The key component of any successful OBIEE implementation is the metadata model known as the repository (RPD). Doing it right is sometimes considered "black magic", and small mistakes in the RPD can impact all the exposed Subject Areas, resulting in poor performances or, even worse, wrong results.

Working an RPD requires dedicated knowledge of the tool and we are sharing it in our RPD modelling training both for OBIEE 11g and OBIEE 12c.

If you ever worked on RPD modelling, one of the settings you surely encountered is the Time dimension. This blog post written back in 2007 explains the process of setting up a time dimension for OBIEE 10g. The process didn't have any major modifications until recently when, in 12.2.1, Logical Sequence Numbers were introduced. As per Oracle's documentation this new feature "optimizes time series functions and in some cases improves query time", and in this post we'll see how to configure it and its impact on the time-series calculations. The examples shown below are based on Oracle Sampleapp v607, a really good source of modelling and front-end examples.

Usual Time-series Query Behaviour

Time-series functions like Ago,ToDate, and more recently PeriodRolling, allow end users to compare results coming from different moments just by specifying the level in the time dimension hierarchy and the number of periods to look backwards or forwards. As example if you needed to compare current month sales revenue with the previous month figure you'll end up writing a formula like

 AGO("F0 Sales Base Measures"."1- Revenue","H0 Time"."Month", 1)

Where:

  • AGO: is the Time-series function being called
  • "F0 Sales Base Measures"."1- Revenue": is the metric
  • "H0 Time"."Month": is the time hierarchy level
  • 1: is the amount of periods (months in our case) to look back in history

Once the time-series metric has been created, it can be used in an analysis like the following to compare Revenue of 2015-04 with the one of the previous month.

 Logical Sequence Number

The analysis generates the following Logical SQL which basically lists the columns retrieved and the filters applied.

SELECT  
   0 s_0,
   "A - Sample Sales"."Time"."T02 Per Name Month" s_1,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2,
   "A - Sample Sales"."Time Series"."127  Mth Ago Rev  (Fix Time Lvl)" s_3
FROM "A - Sample Sales"  
WHERE  
("Time"."T02 Per Name Month" = '2015 / 04')
ORDER BY 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY  

The translation to source SQL hugely depends on the data model created in the RPD and on the data source type. In our example an Oracle SQL gets generated containing the following steps:

  • Sequence Generation: a RANK function is used to created a dense sequence based on Per_Name_Month, the chronological key defined in the time-hierarchy for the month level. Mth_Key is also part of the query since it's used in the join between dimension and fact table.
WITH  
OBICOMMON0 AS (select DENSE_RANK() OVER ( ORDER BY T653.Per_Name_Month) as c1,  
     T653.Mth_Key as c2,
     T653.Per_Name_Month as c3
from  
     BISAMPLE.SAMP_TIME_MTH_D T653 /* D02 Time Month Grain */ ),
  • Period Shifting: The sequence generated above is now shifted by the number of periods defined in the column formula (D1.c1 + 1), in our example 1.
SAWITH0 AS (select D1.c1 + 1 as c1,  
     D1.c2 as c2,
     D1.c3 as c3
from  
     OBICOMMON0 D1),
  • Ago Period Query: Using the period shifted query the historical record (or set of records) is retrieved.
SAWITH1 AS (select distinct D1.c1 as c1,  
     D1.c3 as c2
from  
     OBICOMMON0 D1),
SAWITH2 AS (select sum(T418.Revenue) as c1,  
     D3.c2 as c2
from  
     BISAMPLE.SAMP_REVENUE_F T418 /* F10 Billed Rev */ ,
     SAWITH0 D4,
     SAWITH1 D3
where  ( T418.Bill_Mth_Key = D4.c2 and D3.c1 = D4.c1 and D3.c2 = '2015 / 04' )  
group by D3.c2, D4.c3),  

The period shifted query usage is explained visually by the image below

 Logical Sequence Number

  • Selected Period Query: the query for the selected period, in our case 2015-04, is executed using standard time dimension
SAWITH3 AS (select sum(T418.Revenue) as c1,  
     T653.Per_Name_Month as c2
from  
     BISAMPLE.SAMP_TIME_MTH_D T653 /* D02 Time Month Grain */ ,
     BISAMPLE.SAMP_REVENUE_F T418 /* F10 Billed Rev */ 
where  ( T418.Bill_Mth_Key = T653.Mth_Key and T653.Per_Name_Month = '2015 / 04' )  
group by T653.Per_Name_Month)
  • Resultsets joining: Results coming from Ago Period and Selected Period queries are then joined with an outer join.
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3, D1.c4 as c4 from ( select D1.c1 as c1,  
     D1.c2 as c2,
     D1.c3 as c3,
     D1.c4 as c4
from  
     (select 0 as c1,
               coalesce( D1.c2, D2.c2) as c2,
               D2.c1 as c3,
               D1.c1 as c4,
               ROW_NUMBER() OVER (PARTITION BY coalesce( D1.c2, D2.c2) ORDER BY coalesce( D1.c2, D2.c2) ASC) as c5
          from 
               SAWITH2 D1 full outer join SAWITH3 D2 On D1.c2 = D2.c2
     ) D1
where  ( D1.c5 = 1 )  
order by c2 ) D1 where rownum <= 5000001  

As you can see, it's a rather complex set of instructions that involves several steps including analytical functions like the DENSE_RANK() in the sequence generation. In our case we have been lucky that the source system provided the DENSE_RANK() function; with other sources the sequence generation needs to be calculated directly by the BI Server (OBIS) with a possible degradation of performances.

What is Logical Sequence Number?

The Sequence Generation step mentioned above can sometimes be the bottleneck of the whole query especially when the time dimension is complex or huge in volume since DENSE_RANK() is a costly analytical function.

All that OBIEE's Time-series need to work is a pure sequence, or in Oracle's words:

Sequence numbers are enumerations of time dimensional members at a certain level. The enumeration must be dense (no gaps) and must correspond to a real time order. For example, months in a year can be enumerated from 1 to 12.

Then what if we can find a way of pre-calculating them and storing in the table or calculate them on the fly but using functions less expensive than a DENSE_RANK()?

This is the idea behind the Logical Sequence Number (LSN): a way of avoiding the BI Server (OBIS) needing to execute the DENSE_RANK(), by passing either a pre-calculated sequence column in the source table or a cheaper-cost function to calculate it on the fly based on existing columns.

The formula behind a Logical Sequence Number must resolve in a number (no varchar, or dates are allowed) and can either be:

  • Absolute: when the sequence is consistent and doesn't need any external reference, e.g. calendar years are a self-defined sequence
  • Relative: when the sequence is relative to a parent level, e.g. creating a sequence for months between 1 and 12 would need the calendar year as parent level to provide correct cross-years references

As the word already says "Sequence Numbers" must be sequential, no gaps can be included. Year and month number are good examples of this.

But what if we wanted to provide an absolute sequence number at month level?
One could think about using a month key in the YYYYMM format but this is not a sequence: 201701 <> 201612+1. The best way of generating the sequence number would be to add it as a column in the database table.

Will take as example the SAMP_TIME_MTH_D table that can be found in Sampleapp v607 containing MONTH_KEY column in the YYYYMM format.
 Logical Sequence Number

I'll add to SAMP_TIME_MTH_D a column MONTH_ABS_SEQ containing the absolute sequence number. The formula behind the column is exactly what Oracle was doing under the covers using a DENSE_RANK.

SELECT MTH_KEY, DENSE_RANK() OVER (ORDER BY MTH_KEY) MONTH_ABS_SEQ FROM SAMP_TIME_MTH_D  

And the end result as expected being

 Logical Sequence Number

How are Logical Sequence Number Used?

Previously I described the two types of Logical Sequence Numbers: Absolute and Relative, each one has it use case:

  • PERIODROLLING: This function uses absolute LSN to calculate the starting Sequence Number based on the selected one: e.g. looking at the previous image a 6 month rolling starting from 2009-02 will include data from sequences in the range [9-14].
  • TO_DATE: uses relative LSN, e.g A YearToDate Measure shown by month will use the month relative LSN to calculate the previous months in the same year.
  • AGO: Ago function uses both absolute and relative: Absolute numbers are used if the grain of the query and the one of the AGO are at the same level e.g. Yearly analysis of Actual vs Previous Year. However when the grain of the shift in the ago is higher than the grain of the query Relative LSN are used, e.g. Monthly analysis of Actual vs Previous Year.
How to Configure Logical Sequence Numbers

Logical Sequence Number is a new feature and as such it requires additional settings in the time dimension hierarchy to be working. In our case we'll add two sequences, an absolute at calendar year level and a relative at calendar month level.
We'll add the sequences directly as formulas in the repository however those formulas should be pushed down as columns in the database table if optimal performances are sought.

In order to set the Logical Sequence Numbers we need to open the RPD (the SampleApp one in this test), and select the Time dimension we want to change.
 Logical Sequence Number

After selecting any level apart from the Grand Total (top of the hierarchy) a new tab called "Sequence Numbers" should be visible. However if, like in our case, none of the columns at that level are integer or doubles, the sequence number selector is disabled.

 Logical Sequence Number

In order to enable the selector we first need to create a sequence column in our dimension and bring it at the correct level in the hierarchy. For the Year Level there is already an integer column named "T35 Cal Year" which can be used as sequence. We need simply to drag the column at Year level in the Time hierarchy and set it as Absolute Sequence.

 Logical Sequence Number

I can do the same with the Month level in the hierarchy and the "T32 Cal Month" column. Note that the column contains only the months enumeration from 1 till 12 so we need to set the sequence number as Relative to the level Year.

 Logical Sequence Number

Please note that both absolute and relative LSN can be (and should be) entered since as discussed above each have a different use cases. In addition relative LSN should be set for all logical parents level in the hierarchy since they will be used only if the grain of the time shift matches the one of the parent level. For example a Monthly LSN based on Year logical level will only be used in AGO functions having a year shift and not in case of Quarterly shift.

For an optimal usage every level of the time hierarchy should have one absolute and a relative LSN for each of the parents level in the hierarchy.

Impact on SQL

It's time now to review the SQL generated by our analysis and check the differences with the old-school time-series query.

When creating an analysis at year level like the following
 Logical Sequence Number

As Expected the BI Server (OBIS) uses the CAL_YEAR column as sequence instead of the DENSE_RANK() function over the PER_NAME_YEAR column.

WITH  
OBICOMMON0 AS (select T795.CAL_YEAR as c1,  
     T795.QTR_KEY as c2,
     T795.PER_NAME_YEAR as c3
from  
     BISAMPLE.SAMP_TIME_QTR_D T795 /* D03 Time Quarter Grain */ )

While when using the TO_DATE both the relative sequence is used, like in the following example where the measure "166 Revenue Year To Date" is defined by the formula:

TODATE("01 - Sample App"."F0 Sales Base Measures"."1- Revenue",  "01 - Sample App"."H0 Time"."Year" )  

 Logical Sequence Number

The following query gets generated, note the usage of Cal_Year and Cal_Month in the sequence generation query instead of the DENSE_RANK() function as per RPD settings mentioned above.

WITH  
OBICOMMON0 AS (select T653.Cal_Year as c1,  
     T653.Cal_Month as c2,
     T653.Per_Name_Month as c3,
     T653.Per_Name_Year as c4
from  
     BISAMPLE.SAMP_TIME_MTH_D T653 /* D02 Time Month Grain */ )
...
Are Logical Sequence Numbers Useful?

Most of the times the bottleneck when using Time Series is not in the sequence generation, since the time dimension cardinality is rarely big enough to produce a noticeable delay in the query time, but rather in the poorly managed calculations made on query time on top of massive fact tables.

Don't expect LSN to solve all your performance problems with Time Series. However, the usage of Logical Sequence Numbers provides to OBI a way of pre-cooking part of the calculation and so in theory should help performance. The small effort required to set them up centrally in the time hierarchy is covered by the benefits during query time, without having to touch any pre-defined time-series calculation.

If you do have performance problems with your OBIEE system, or would like to ensure that a system you’re building will be performant from the outset, please get in touch to find out more about our Performance Analytics service!
We also provide expert OBIEE training, implementations, QA and health checks - to find out more about how we can help you, please contact us!

Categories: BI & Warehousing

Introducing Advanced Analytics Training from Rittman Mead!

Tue, 2017-02-21 09:00

alt

Rittman Mead is proud to release our new training course: Advanced Analytics with Oracle's R Technologies.

Oracle has made significant investments in the R language with Oracle R, ROracle and Oracle R Enterprise. Using these tools, data scientists and business intelligence practitioners can work together more efficiently and can transition between their roles more easily.

Rittman Mead has developed a three-day course that tackles R's notoriously steep learning curve. It builds on Oracle professionals' existing skills to accelerate growth into R programming and data science.

What does the course include?

Day one is all about the R programming language, starting with a history and explanation of Oracle's R technologies. Hands-on coding begins right away, with practical labs comparing R's data types and data structures with those found in the Oracle Database. The day wraps up with R programming concepts like conditions and functions, providing a strong grasp of the fundamentals on the very first day.

Day two focuses on the analysis pipeline, from data acquisition to data visualization. You will use SQL and R to tidy and transform raw data into a structured format and then use visualization and basic statistics to gain insights.

Day three looks at statistical modeling—discussing linear models and the predictive modeling pipeline. We present pros and cons of different types of models and get hands-on with preprocessing, model tuning, cross-validation and interpreting model results.

Our course is a mixture of theory and practical exercises—ensuring that you'll understand the tools and know when to apply them.

Who should attend?

The course is suitable for Oracle practitioners having some experience with SQL and business intelligence. No previous knowledge of R is assumed or necessary.

Sounds great, where do I sign up?

Please view our UK & Europe or US training schedule for public courses. For any questions or queries, including on-site training requests, please contact Daniel Delgado (US) or Sam Jeremiah (UK & Europe) for more details.

Categories: BI & Warehousing

OBIEE 12c - Regression Tester for Application Roles and Mappings

Mon, 2017-02-20 10:08

Allow me to introduce you to the Application Role Mapping validation script for OBIEE 12c. With this bit of code, we can utilize OBIEE's runcat command script to export our application role and permissions mapping information from multiple environments, and ensure their consistency. The picture below is an example of what you'll see as the final product. While it does not show you a side by side comparison of both environments, what it does do is display those objects in your lower environment that contain inconsistent application role or permissions mappings vs the target environment, and their respective configuration. The items in the path column then point you to that object in your lower environment catalog for further examination.

sec_audit_pre

Our script only requires a few easy steps to generate a web-based view of any inconsistencies in application role and permissions mappings between your tested environments. Similar to the Baseline Validation Tool (BVT), this script goes one step further and executes a fine-grain examination and resulting view of application role and permissions mappings. The BVT only catches that something is different about the object, as indicated by the Name column value, and tells you where to look.

BVT_audit

While I'll be sure to go into more detail later, the first picture above shows us that we have a number of application role and/or permissions mappings that exist in the lower environment, however, do not in the target OBIEE environment. Curious? Let's jump right into it.


Overview


The security audit is essentially a 3-step process, and was designed to be really accessible and simple to use. It breaks out like this:

  • Run the security_audit.py script in both OBIEE12c environments (being the lower environment that possesses the proper app role/permission mappings and the target environment).
  • security_audit.py will generate a .csv file in each environment.
  • Move the CSV from the target environment into the directory where you've got the CSV in the lower environment. After you've got the files moved, you'll run security_compare.py and simply pass in the locations of your lower environment CSV, and then that of the target environment. Lastly, a browser will pop up, giving you an immediate view of any inconsistencies that exist between your two OBIEE 12c instances.

Let's take a look at the process in a bit more detail!


Step by Step


Run security_audit.py in Lower Environment

First, let's make sure we've got a few libraries installed that we'll need to run our code. I recommend using pip for this. You'll need to install pandas and flask if you have not done so already. Simply navigate to the security_audit directory you got from GitHub and then from the command line run:

python3 security_audit.py

First, the script is smart enough to figure out which kind of OS it's in. Second, if your DOMAIN_HOME variable is set correctly (probably something like ORACLE_HOME/user_projects/domains/bi), the runcat command will run, exporting a CSV that contains the information we need to run the next script, which does the actual crunching between environments. In Windows, the default output location is C:, in Linux, the /tmp directory.

script run

The tool will prompt you to enter your DOMAIN_HOME, should you not have one set in your environment.

enter domain home



Run security_audit.py in Target Environment

Next, in our target environment, being the OBIEE 12c instance we want to make sure contains the same app role mappings as in our lower environment, run the security_audit.py script once again, following the same steps as outlined above. Rename the CSV to something different than the file that was written in your lower environment, as we're going to need to put both of these guys in the same directory.

rename mapping files

So on that note, after you've renamed your security_mappings CSVs, move them to the same directory on your lower environment. We're simply renaming them so we don't clobber one version or the other, and for easy reference as to which file belongs to its corresponding environment.



Run security_compare.py in Lower Environment

At this point you should have two security_mappings.csv files (although, the names should be a bit changed at this point) in your lower environment. We're going to need them as inputs for the next part of the regression testing process. This next bit of code simply ensures consistency between the two environments. Once run, it will instantly display any catalog objects in your lower environment that contain any disparities in the way their application roles or permissions are mapped when compared against those in your target environment. As of this blog, there really is no good way to do this with any native OBIEE tool, that is aside from running the runcat reports yourself and doing the crunching. So let's do it!

Open a command prompt in your environment, taking note of where your two CSV files are located. In this example, we'll be using a Windows command prompt, with our files located directly off of the C:\ drive.

In your command prompt, navigate to the location of your .py files. On that note make sure you do not separate these from the other files in the security_audit directory. Flask will need the other files to render the resulting webpage. Back to it. In my example below, I've navigated to the security_audit directory, and then run the following:

run_script

And then watch the magic happen! Make sure you have pop-ups enabled if you're having trouble rendering the page. The script will auto-magically figure out the host name for your environment and run it there.

permissions_list

  1. Owner - this is the owner of the catalog object that is showing the variance in permissions assignment under:

  2. Name - this is the name of the object, as it is displayed in the catalog.

  3. Path - this is the path to the object in the web catalog

  4. ACL - these are the detailed permissions mappings based on each entity assigned to the particular object. There is an accompanying key to the left hand side to help you out.

  5. Permissions - detail level permissions mapped to each object by owner entity.


Summary


Having any trouble running the script? Please get in touch! I would also love to hear some feedback on how it might have helped you perform one task or the other. Feel free to use the comments section below for this or to report any issues.

Categories: BI & Warehousing

Financial Reports - which tool to use? Part 2

Thu, 2017-02-16 12:00
Financials in BI Publisher Financial Reports - which tool to use? Part 2

I find it interesting that BI Publisher is mostly known for the creation of pixel perfect repeating forms (invoices, labels, checks, etc) and its ability to bursting them. To me, BI Publisher is the best kept secret for the most challenging reports known to mankind.

In my last blog - https://www.rittmanmead.com/blog/2017/02/financial-reports-which-tool-to-use-part-1/, I discussed some of the challenges of getting precisely formatted financial reports in OBIEE, as well as some pros and cons of using Essbase/HFR. Although we can work through difficult solutions and sometimes get the job done, BI Publisher is the tool that easily allows you to handle the strangest requirements out there!

If you have OBIEE, then you already have BI Publisher, so there is no need to purchase another tool. BI Publisher comes integrated with OBIEE, and they can both be used from the same interface. The transition between BI Publisher and OBIEE is often seamless to the user, so you don’t need to have concerns over training report consumers in another tool, or even transitioning to another url.

The BIP version that comes embedded with OBIEE 12c comes loaded with many more useful features like encryption and delivering documents to Oracle Document Cloud Service. Check out the detailed new features here: http://www.oracle.com/technetwork/middleware/bi-publisher/new-features-guide-for-12-2-1-1-3074557.pdf

In BI Publisher, you can leverage data from flat files, from different databases, from an Essbase cube, from the OBIEE RPD, from one (or multiple) OBIEE analyses, from web services and more:

Financial Reports - which tool to use? Part 2

So, if you already have very complex OBIEE analyses that you could not format properly, you can use these analyses, and all the logic in them, as sources for your perfectly formatted BI Publisher reports.

Every BI Publisher report consists of three main components:

  1. Data Model - data source that you will use across one or more reports

  2. Layout(s) - which will define how your data is presented

  3. Properties - which are related to how it generates, displays and more

You start a BI Publisher project by creating a data model that contains the different data sets that you would like to use on your report (or across multiple reports). These data sets, which reside inside of your data model, can be of the same source or can come from multiple sources and formats. If you regularly use OBIEE, you can think of a data model as the metadata for one or more reports. It is like a very small, but extremely flexible and powerful RPD.

Financial Reports - which tool to use? Part 2

Inside the data model you can connect your data sets using bind variables (which creates a hierarchical relationship between data sets), or you can leave them completely disconnected. You can also connect some of your data sets while leaving others disconnected.

The most impressive component of this tool is that it will allow you to do math from the results of disconnected data sets, without requiring ETL behind the scenes. This may be one of the requirements of a very complex financial report, and one that is very difficult to accomplish with most tools. The data model can extract and transform data within a data set, or extract only, so that it can later be transformed during your report template design!

For example, within a data set, you can create new columns to suit most requirements - they can be filtered, concatenated, or have mathematical functions applied to them, if they come from the same data source.

Financial Reports - which tool to use? Part 2

If they do not come from the same source, you can transform your data using middle tier systems, such as Microsoft Word during your template creation. You can perform math and other functions to any result that comes from any of your data sets using an RTF template, for example.

Financial Reports - which tool to use? Part 2

The example above was mentioned in Part 1 of this blog. It was created using BI Publisher and represents what I would call a "challenging report" to get done in OBIEE. The data model in this example consisted of several OBIEE analyses and their results were added/subtracted/multiplied as needed in each cell.

Financial Reports - which tool to use? Part 2

This second example was another easy transition into BI Publisher: the entire report contained 10 pages that were formatted entirely differently, one from the other. Totals from all pages needed to be added in some specific cells. Better yet, the user entered some measures at the prompt, and these measures needed to be accounted for in every sub-total and grand total. You may be asking: why prompt for a measure? Very good question indeed. In this case, there were very few measures coming from a disconnected system. They changed daily, and the preferred way for my client to deal with them was to enter them at the prompt.

So, do you always have to add apples to apples? Not necessarily! Adding apples and bananas may be meaningful to you.

Financial Reports - which tool to use? Part 2

And you can add what is meaningful with BI Publisher!

For example, here is a sample data model using sources from Excel, OBIEE and a database. As you see, two of these data sets have been joined, while the other two are disconnected:

Financial Reports - which tool to use? Part 2

A data model such as this one would allow you to issue simultaneous queries across these heterogeneous sources and combine their results in the report template. Meaning, you can add anything that you would like in a single cell. Even if it involves that measure coming from the prompt! Goes without saying, you should have the exact purpose and logic behind this machination.

Once your data model is complete: your data sets are in place, you have created the relationships within them (where applicable), you created custom columns, created your parameters and filters, then you generate some sample data (XML) and choose how you will create your actual report.

As I mentioned, there are additional functionalities that may be added when creating the report, depending on the format that you choose for your template:

Financial Reports - which tool to use? Part 2

One very simple option is to choose the online editor, which has a bit more limited formatting capability, but will allow you to interact with your results online.

In my experience, if I had to cross the bridge away from OBIEE and into BI Publisher, it is because I needed to do a lot of customization within my templates. For those customizations, I found that working with RTF templates gave me all the additional power that I could possibly be missing everywhere else. Even when my financial report had to be read by a machine, BI Publisher/RTF was able to handle it.

The power of the BI Publisher data model combined with the unlimited flexibility of the RTF templates was finally the answer to eliminate the worst excel monsters. With these two, you can recreate the most complex reports, and do it just ONCE - not every month. You can use your existing format - that you either love, or are forced to use for some reason - and reuse it within the RTF. Inside of each RTF cell, you define (once!) what that cell is supposed to be. That specific cell, and all others, will be tested and validated to produce accurate results every month.

Once this work is done, you are done forever. Or well, at least until the requirements change… So, if you are battling with any one of these monsters on a monthly basis, I highly encourage you to take a step forward and give BI Publisher a try. Once you are done with the development of your new report, you may find that you have hours per month back in your hands. Over time, many more hours than what you spent to create the report. Time worth spending.

Financial Reports - which tool to use? Part 2

Categories: BI & Warehousing

Financial Reports - which tool to use? Part 1

Wed, 2017-02-15 01:00
Financial Reports - which tool to use? Part 1

One of the treats of working in the Business Intelligence world is that we are asked to analyze different aspects of a business. In fact, we are asked to analyze many different types of businesses, too. Most of us using BI tools have come from some previous background. Be it Marketing, Finance, Supply Chain or any other, we most likely had work experience before we got here. Maybe one of our jobs even led to Business Intelligence. The fact is, we are not experts in all areas. It would take several lives to make such a claim, because each area can be very complex and take years to master. The truth, for most of us, is that we have our favorite areas. They are often related to what we are most familiar with.

Financial Reports - which tool to use? Part 1

Over time, I came to really appreciate how simple numbers can be, and developed this - hard to understand - favoritism towards financial reports. While some business areas can be artistic and even vague, numbers are never vague. I have a great appreciation for that. Working with numbers is always precise. In the end, they have to match. No matter how great your report looks, if the numbers don’t add up the report is always wrong. Plus, financial layouts are generally very defined going in, so there is little room for error.

Financials in OBIEE

So, the endeavor begins when you are a BI consultant and everything is supposed to add up properly and look very nice. OBIEE is an extremely powerful tool, and this gives users the impression that it can solve all problems. While it can solve most problems, it falls short on some key features needed for easy financial reporting. That is not to say that Financials can’t be handled in OBIEE - but it is definitely to say that it is not easy.

So, if financial reports are not easy to create in OBIEE, than we are left with two very simple options:

  1. Struggle through it and make it happen

  2. Choose another tool

I have made the mistake of choosing option 1 some times, but quickly realized that option 2 couldn’t be as bad. Countless times, I have been asked to create financial reports in OBIEE. Of course, they needed to tie up and match a specific format: they needed to have blank lines inserted between one section and another, and the alignment of the categories was very important. They often required very detailed variance calculations, so that a company could see where they stood as far as change overtime. Variance percentages are key on these types of reports, and if you have dealt with them in OBIEE, you know that different types of variances and their grand totals can often pose challenges for report writers.

So, in order to accomplish the formatting needed, you end up adding extra code here and there, in essence trying to make OBIEE do something that it’s not supposed to do. Soon, you are experiencing performance issues and a new array of considerations are in place. You start removing your “special code”, then you loose your formatting. The numbers on your financial statement are still correct, but your report looks something like this:

Financial Reports - which tool to use? Part 1

While, in reality, you were trying to get here:

Financial Reports - which tool to use? Part 1

** The Balance Sheet above was created using HFR for illustration of formatting only.

Looking at a different OBIEE financial report (below), you will see that a lot of formatting can be done in these reports, but they will always look like OBIEE reports, if you know what I mean.

Financial Reports - which tool to use? Part 1

In this example, the first column is out of order - as far as Income Statements go. This was left alone on purpose to display one of the issues with creating these statements in OBIEE. The tool does not easily allow you to choose which items will go in each row. So, in the criteria tab, in Answers, you choose the order of the columns, but if you need the rows in order, you will need to either:

  1. Use a hidden column created just for sorting purposes

  2. Leverage selection steps, or

  3. Create a measure column for each row that you will need, use a pivot table, and add the Measure Labels as rows on your pivot

I will illustrate the third option, as it is my preferred way of ordering rows. Suppose that you have a very simple criteria tab such as this:

Financial Reports - which tool to use? Part 1

Naturally, your results would default like this on a table:

Financial Reports - which tool to use? Part 1

If you use a Pivot table instead, you can drag your Measure Labels onto the Rows:

Financial Reports - which tool to use? Part 1

And now, you will be able to see your measures as rows. You can easily reorder them as needed by just moving the order of the columns in the Measures section of your Layout editor.

Financial Reports - which tool to use? Part 1

This seems like a simple solution if you know precisely what all your rows should be, and even better, if you don’t have a huge amount of measures on the report. In real life, this type of row ordering is high maintenance:

  1. You must label each measure to match the account category name for each row

  2. You must filter each measure by its account category (or account number)

  3. If the account category name changes in your DB, you must manually rename your columns to match the new naming convention

  4. If you add or delete account categories, you must manually add and delete columns from your report

OBIEE 12c offers a great improvement in this area: the ability to “save columns” is described very well by Jason Baer on this blog: https://www.rittmanmead.com/blog/2016/01/my-favorite-obiee-12c-feature-that-almost-no-one-is-talking-about/

With the new release of the product you can save as many financial columns as you would like in the web catalog, which allows you to reuse them. As a consequence, you will streamline report maintenance by updating the columns’ format and formula directly from the catalog (instead of inside every report). In fact, if you are spending too much time maintaining your existing reports out of OBIEE 11g, you will automatically benefit from an upgrade to 12c just based on this single feature. Check here for more info: https://www.rittmanmead.com/obiee-12c-upgrade/

This is a great improvement, but you will still need to deal with an overall lack of flexibility for dynamically adding and deleting columns, setting orders, adding blank space, indenting and calculating variances along with proper grand totals.

After spending more time than you should in order to create a simple report, you really start considering other tools. If you are already working in the Oracle stack, the obvious choices will be BI Publisher and Hyperion Financial Reporting (HFR).

Financials in Essbase/HFR

Hyperion Financial Reporting (HFR) brings a powerful solution to financial statements, because it allows you to create pixel perfect reports that are pre-aggregated in an Essbase cube. Just with that, two big problems were just solved: formatting and performance.

In the example below, you see that HFR allows you to place metrics on both sides of the Account Category (butterfly layout - difficult to accomplish in OBIEE):

Financial Reports - which tool to use? Part 1

In addition to formatting and performance, there are some definite pros to consider when choosing HFR:

  1. The calculations in HFR dynamically reference cells, as in excel. So, if a cell changes, the cells that are referencing the original cell will automatically be updated

  2. HFR has the ability to create financial books and batches, and also has a powerful bursting feature

HFR is a great solution for Income Statements, Balance Sheets and other reports that come from Essbase cubes. In a simplistic way, an Essbase cube is a combination of tables that have been joined and pre-aggregated. Since most tables coming out of a financial module in a system can often be joined, you should be able to create Essbase cubes to use as a source for your HFR reports. You will rarely have a requirement that cannot be handled by HFR and Essbase, but some situations may be problematic, for example, if your report requires a measure to be entered at run-time, if results from multiple cubes need to be added, or if your layout is very complex. This is why :

In an HFR report, you start by inserting a grid onto your report and then you associate that grid with a specific Essbase cube. If you need data from two cubes on the report, you can insert another grid and associate that with the second cube. You can also create a report that leverages calculations between existing grids (for the purpose of doing math with two or more separate cubes):

Financial Reports - which tool to use? Part 1

Many thanks to my collegue, Mark Cann https://www.rittmanmead.com/blog/author/mark-cann/, for working through this solution with me

The challenge here is that you may end up with multiple layout grids on your HFR report, which will complicate the report creation and maintenance going forward. It is important to know that if your requirements call for strange off-setting of cells and multiple different looking blocks, then HFR may not be the best tool for the job. If you choose HFR for this purpose, you will spend too much time trying to make things right.

Financial Reports - which tool to use? Part 1

*This is a simple Essbase implementation with 2 cubes (or databases): a Balance Sheet and an Income Statement cube.

The fact is, some financial reports are very tricky and do not come solely from a Financial module. For example, if your company is evaluated monthly for a line of credit, your bank may require to look at several components of your business in order to determine the amount that you can borrow. They will base their decision not only on your monthly revenue, but also your liabilities, such as accounts payables, and some of your assets, such as inventory. What they ask for really depends on their internal lending requirements, and also on the type of business that you have. These are, therefore, highly customized reports that never come out-of-the box anywhere. For this reason, most companies spend a lot of man hours creating these reports as a huge excel report, after the employees have managed to pull information from many different modules together.

These excel “monsters” do the job. They are accepted by the banks, and will get you that loan. On the downside, they need to be redone every month and will drag resource hours out of profitable projects. The flat excel files are also prone to mistakes, as the values are manually keyed in each time. If you make a mistake favorable to the company, your bank will look at it as a very negative issue. If you make an unfavorable mistake, you will not be able to borrow as much as you qualify for. This is a no win situation, so the reports must be accurate every time.

Financial Reports - which tool to use? Part 1 To check for accuracy, there is nothing like testing overtime. But, since you must rework the report each month, you don’t have that opportunity.

The solution is to create a template that will pull from all of these different modules, calculate the numbers, add the results automatically to a pixel perfect formatted report. Over the development cycle, these mappings and calculations will be thoroughly tested, and then they will only be reused going forward.

While you may spend some time pulling this logic together, you will only have to click a few buttons after you are done, for months or years to come. In fact, I have clients that have been running reports such as this one for years. They have been saving a couple of weeks in report creation every month.

Let’s look at an example of what I am talking about:

Financial Reports - which tool to use? Part 1

On this report, each number (disguised as $1234) has been mapped to a calculation that will be pulled dynamically, according to the date entered on the prompt. The inventory amounts are adjusted according to banking requirements, and a rate is allocated depending on the row. This amount is later added/subtracted from receivables and existing contracts. Most of these numbers were created as separate OBIEE analyses. Some amounts could even be tied into web services to get the daily futures prices to estimate the value of contracts when the report runs. All lines are considered in the final equation before the total borrowing amount can be calculated. Per this bank’s requirement, this form needed to be printed and signed, then submitted monthly.

Lending/financing reports may be the most tricky, and the most time consuming for companies to generate every month. The reports may be required by the bank, or by a company that is leasing or financing valuable equipment to your company. These reports need to show your prospective lender everything about your business. They will often need to be done in a format that is specified by your lender. These formats are not negotiable, in fact, some lenders still use old forms that used to be read by a machine.

Here is another small snippet of a financing report that I had to create recently. Now, which tool would you use for 10 different pages of something like this, which required some of the amounts to be entered in the prompt? *Note: the report had to look “exactly” like this:

Financial Reports - which tool to use? Part 1

Well, as I mentioned in the beginning of this article, OBIEE would not be your partner in this type of endeavor. I can guarantee that this relationship would fail: strange formatting with black boxes, line numbers, need for Headers (footers too, not shown here), indenting, etc.

You may consider Essbase/HFR combo, for formatting and performance, but you will soon realize that:

  1. Performance does not tend to be an issue with these reports, as they are generally submitted to lenders on a monthly basis, and therefore can be scheduled to run automatically in the middle of the night.

  2. As mentioned earlier, HFR requires a layout grid to be inserted before the report can be designed. Here, you would end up with multiple grids to handle the calculation of different cells from multiple cubes - which can be cumbersome to create and maintain.

  3. The measures in an HFR report should come from the pre-aggregated cube. In this example, some of the measures were entered as part of the prompt and are calculated at run time. At this point, you must scratch the Essbase/HFR option for this one!

So, now you are still stuck with your monster excel spreadsheet, then retyping the numbers onto the required form.
Financial Reports - which tool to use? Part 1

Before you marry this solution, let me present you with the tool that can do everything: BI Publisher.

Stay tuned for the second part of this blog, when I will share why I believe that BIP can solve the most challenging reporting requirements out there!

Categories: BI & Warehousing

Enabling Concurrent OBIEE RPD Development - for free

Fri, 2017-02-10 04:30
Enabling Concurrent OBIEE RPD Development - for free

One of the most common and long standing problems with developing in OBIEE is the problem of multiple developers working on the RPD at the same time. This blog explains the problem and explores the solution that we’ve developed and have been successfully using at clients over the last couple of years. We’re pleased to announce the immediate availability of the supporting tools, as part of the Rittman Mead Open Source Project.
Before we get into the detail, I'll first explain a bit about the background to the requirement and the options that ship with OBIEE.

Why Concurrent Development

The benefits of concurrent development are obvious: scalability and flexibility of development. It enables you to scale your development team to meet the delivery demands of the business. The challenge is to manage all of the concurrent work and enable releases in a flexible manner - which is where source control comes in.

We couldn't possibly attempt to manage concurrent development on something as complex as the RPD without good version control in place. Source control (A.K.A. version control/revision control) systems like Git and Apache Subversion (SVN) are designed to track and retain all of the changes made to a code base so that you can easily backtrack to points in time where the code was behaving as expected. It tracks what people changed, who changed it, when they changed it and even why they made that change (via commit messages). They also possess merge algorithms that can automatically combine changes made to text files, as long as there are no direct conflicts on the same lines. Then there's added benefits with code branching and tagging for releases. All of this leads to quicker and more reliable development cycles, no matter what the project, so good in fact that I rely on it even when working as one developer. To (mis)quote StackOverflow, "A civilised tool for a civilised age".

All of these techniques are about reducing the risk during the development process, and saving time. Time spent developing, time spent fixing bugs, spent communicating, testing, migrating, deploying and just about every IT activity under the sun. Time that could be better spent elsewhere.

Out of the Box

Oracle provide two ways to tackle this problem in the software:

However I believe that neither of these are sufficient for high standards of reliable development and releases - the reasons for which I explore below (and have been described previously). Additionally it is not possible to natively and fully integrate with version control for the RPD, which again presents a significant problem for reliable development cycles.

Firstly the online check-in and check-out system does, by design, force all development to be conducted online. This in itself is not an issue for a single developer, and is in fact a practice that we advocate for ‘sandbox’ development in isolation. However, as soon as there is more than one developer on the same server it reduces development flexibility. Two developers cannot develop their solutions in isolation and can be made to wait for assets they want to modify to be unlocked by other developers. This may be workable for a small amount of developers but does not scale well. Furthermore, the risk of losing work is much higher when working online; we've all seen the infamous "Transaction Update Failed" message when saving online. This is usually because of an inconsistency in the RPD but can be caused by less obvious reasons and usually leads to repeating some redundant work. Lastly, very large RPDs like those from BI Apps or very mature enterprise deployments pose a problem when working online. They cause the Admin Tool to work very slowly because of the calls it has to make to the server, which can be frustrating for developers. To be clear, I am certainly not advocating developing without testing your code, but given the speed of uploading an RPD to the server and the fact that it can be automated, in my experience it is far more efficient to develop offline and upload frequently for testing.

The MUD system is much better and is quite close in methodology to what we recommend in this guide. The premise works on having a production quality master RPD and then having many other individual developers with their own RPDs. The check-in and check-out system will automatically handle three-way merges to and from the master RPD when development changes are made. This is good in theory but has been maligned for years when used in practice. The version control system in MUDE is not as robust as Git or SVN for example and the conflict resolution relies on developers managing their own issues, without the ability for a source master to intervene. Ultimately there is little flexibility in this method, which makes it difficult to use in the real world.

Source Controlling the RPD

Source control is another problem as the RPD is a binary file which cannot be merged or analysed by text-based comparison tools like SVN or Git. A potential solution at one point seemed to be MDS XML, a structured, textual representation of the RPD. However, this also seemed to have some drawbacks when put into practice. Whilst some use MDS XML with great success and there are tools on the market that rely on this system, here at Rittman Mead we’ve found that there are significant risks and issues with it. We’ve come up with what we believe is a robust, scalable, and flexible approach, based around the binary RPD.

The Rittman Mead Solution to Concurrent OBIEE RPD Development

Successful development lifecycles comes down to implementation of the correct process and ensuring it is as quick and reliable as possible. Tools, like the ones described in this blog, can be used to help in both of those areas but are not a substitute for detailed knowledge of the processes and the product. A key feature of this approach is the Source Master who owns and is responsible for the overall development process. They will have a detailed understanding of the method and tools, as well as the specifics of the current and future work being done on the RPD. Things will go wrong, it is as inevitable as death and taxes - the key is to minimise the impact and frequency of these events.

The solution is based on the Gitflow method, which is one of the most established development models. The model is based on a few major concepts:

  • Features - Specific items of development, these are begun from the development branch and then merged back into development when complete.
  • Develop/Master Branches - Two branches of code, one representing the development stream, the other the production code.
  • Releases - A branch taken from development that is then eventually merged into production. This is the mechanism for getting the development stream into production.

I highly recommend reading that blog and this cheatsheet as they explains the method excellently and what we've done here is support that model using binary RPDs and the 3-way merge facility in OBIEE. Also of relevance is this Rittman Mead blog which describes some of the techniques we're explaining here. We've open sourced some command line tools (written in Python) to ease and automate the process. You can download the code from the GitHub repository and need only an install of Python 2.7 and the OBIEE client to get started. The tooling works with both git and Subversion (SVN). We recommend the use of git, but realise that SVN is often embedded at organisations and so support that too.

Scenario

This section shows a simple example of how you might use this methodology for multiple developers to work on the RPD in a reliable way. Many of the screenshots below show SourceTree, a GUI for Git which I'm a fan for both its UI and GitFlow support.

We have two developers in our team, Basil and Manuel, who both want to work on the RPD and add in their own changes. They already have an RPD they've made and are using with OBIEE, named base.rpd. First they initialise a Git repository, committing a copy of their RPD (base.rpd).

Enabling Concurrent OBIEE RPD Development - for free

The production branch is called master and the development branch develop, following the standard naming convention for GitFlow.

Before we get started, let's a take a look at the RPD we're working with:

Enabling Concurrent OBIEE RPD Development - for free

Simple Concurrent Development

Now Basil and Manuel both independently start features F01 and F02 respectively:

python obi-merge-git.py startFeature F01  

Each developer is going to add a measure column for Gross Domestic Product (GDP) to the logical fact table, but in different currencies. Basil adds "GDP (GBP)" as a logical column and commits it to his development branch, F01. Manuel does the same on his, adding "GDP (USD)" as a logical column and committing it to F02.

Enabling Concurrent OBIEE RPD Development - for free

Now Basil finishes his feature, which merges his work back into the develop branch.

Enabling Concurrent OBIEE RPD Development - for free

This doesn't require any work, as it's the first change to development to occur.

python obi-merge-git.py finishFeature F01

Checking out develop...  
Already on 'develop'

Merging feature/F01 into develop...  
Successfully merged feature/F01 to the develop branch.  

When Manuel does the same, there is some extra work to do. To explain what's happening we need to look at the 3 merge candidates in play, using the terminology of OBIEE’s 3-way merge functionality:

  • Original: This is the state of the development repository from when the feature was created.
  • Modified: This is your repository at the time of finishing the feature.
  • Current: This is the state of the development repository at the time of finishing the feature.

When Basil completed F01, the original and current RPDs were the same, so it could just be overridden with the new RPD. However now, the Original and Current RPDs are different, so we need to resolve the changes. Our RPDs are binary files and so we need to use the 3-way merge from the Admin Tool. The python script wrapped around this process uses Git’s metadata to determine the appropriate merge candidates for invoking the OBIEE 3-way merge.

Enabling Concurrent OBIEE RPD Development - for free

Since our changes do not conflict so this can happen automatically without user intervention. This is one of the critical differences from doing the same process in MDS XML, which would have thrown a git merge conflict (two changes to the same Logical Table, and thus same MDS XML file) requiring user intervention.

python obi-merge-git.py finishFeature F02

Checking out develop...  
Already on 'develop'

Merging feature/F02 into develop...  
warning: Cannot merge binary files: base.rpd (HEAD vs. feature/F02)


Creating patch...

        Patch created successfully.

Patching RPD...

        RPD patched successfully.

RPD Merge complete.


Successfully merged feature/F02 to the develop branch.  

In the background the script uses the comparerpd and patchrpd OBIEE commands.

Release

Now our development branch has both features in, which we can see using the Admin Tool:

Enabling Concurrent OBIEE RPD Development - for free

To get this into production we can start a release process:

python obi-merge-git.py startRelease v1.00  

This creates a new branch from develop that we can use to apply bug fixes if we need to. Any changes made to the release now will be applied back into development when the release is complete as well as being merged into the production branch. The developers realise they have forgotten to put the new columns in the presentation layer, so they do it now in the release branch as a bugfix. In GitFlow, bugfixes are last minute changes that need to be made for a release but do not interfere with the next development cycle, which may have already begun (in the develop branch) by the time the bug was spotted. The changes are merged back to develop as well as master so the fix isn't lost in the next cycle.

Enabling Concurrent OBIEE RPD Development - for free

This is committed to the repo and then the release is finished:

python obi-merge-git.py finishRelease v1.00  

Enabling Concurrent OBIEE RPD Development - for free

After the release we can see that the master and develop branches are at the same commit point, with a tag of the release name added in too. Additionally we can switch to the develop and master branches and see all of the changes including the columns in the presentation layer. The full commit history of course remains if we want to roll back to other RPDs.

Enabling Concurrent OBIEE RPD Development - for free

Conflicted Development

Basil and Manuel start their new features, F03 and F04 respectively. This time they’re working on the same existing column - something that a “Source Master” should have helped avoid, but missed this time. Basil edits the column formula of the "Area" column and renames it to "Area (sqm)"" and Manuel does the same, naming his column "Area (sqFt)".

Enabling Concurrent OBIEE RPD Development - for free

They both commit the changes to their own feature branches and Manuel merges his back to development with no problem.

python obi-merge-git.py finishFeature F04  

However when Basil tries to finish his feature the obvious conflict occurs, as the automatic merge cannot resolve without some human intervention since it is the same object in the RPD affected by both changes. At this point, the script will open up the current RPD in the Admin Tool and tell Basil to merge his changes manually in the tool, going through the usual conflict resolution process. The script provides 3 RPDs to make the RPD choosing step unambiguous:

  • original.rpd
  • modified.rpd
  • current.rpd (Opened)
python obi-merge-git.py finishFeature F03

Checking out develop...  
Already on 'develop'

Merging feature/F03 into develop...  
warning: Cannot merge binary files: base.rpd (HEAD vs. feature/F03)


Creating patch...

        Patch created successfully.

Patching RPD...

        Failed to patch RPD. See C:\Users\Administrator\Documents\obi-concurrent-develop\patch_rpd.log for details.

        Conflicts detected. Can resolve manually using the Admin Tool.

        Original RPD:   C:\\Users\\Administrator\\Documents\\rpd-test\a.rpd (original.rpd)
        Current RPD:    C:\\Users\\Administrator\\Documents\\rpd-test\c.rpd (Opened)
        Modified RPD:   C:\\Users\\Administrator\\Documents\\rpd-test\b.rpd (modified.rpd)

Perform a full repository merge using the Admin Tool and keep the output name as the default or C:\\Users\\Administrator\\Documents\\rpd-test\base.rpd

Will open RPD using the Admin Tool.

Press Enter key to continue.

You must close the AdminTool after completing the merge manually in order for this script to continue.  

When Basil hits a key, the Admin Tool opens up, and from here he needs to manually initiate the merge and specify the merge candidates. This is made easy by the script which automatically names them appropriately:

Enabling Concurrent OBIEE RPD Development - for free

Note that the a, b and c RPDs are part of the merge logic with Git and can be ignored here.

Basil assigns the original and modified RPDs to the correct parts of the wizard and then resolves the conflict (choosing his change) in the next step of the wizard.

Enabling Concurrent OBIEE RPD Development - for free

Upon closing the Admin Tool, the Git merge to the develop branch is automatically completed.

Now when they look in the development RPD they can see the column named as "Area (sqm)", having accepted Basil's change. Of course this is a trivial example, but because the method relies on using the Admin Tool, it will be just as reliable as a manual 3-way merge you would perform in OBIEE.

In my experience, most of the problems with 3-way merging is that developers get confused as to which candidates to choose or they lose track of a true original point from when both developers started working. Using this method eliminates both of the those problems, with the added benefit of tight integration into source control. Even with an easier interface to the 3-way merge process, developers and/or the Source Master should be aware of some of the ‘features’ of OBIEE’s 3-way merge. For example, if a change has occurred on the physical layer which does not have any representations at all in the business or presentation layers, it may be lost during a 3-way merge. Another is that the merge rules are not guaranteed to stay the same between OBIEE versions, which means that we cannot be certain our development lifecycle is stable after patching or upgrading OBIEE.

So given this, and as a general core tenet of good software development practice, you should be automatically testing your RPDs after the merge and before release.

Testing the RPD

There are still issues with OBIEE RPD merging that aren't rectified by the 3-way merge and so must be handled manually if and when they occur. One such example is that if a change has occurred on the physical layer which does not have any representations at all in the business or presentation layers, it may be lost during a 3-way merge. Another problem is that the merge rules are not guaranteed to stay the same between OBIEE versions, which means that we cannot be certain our development lifecycle is stable after patching or upgrading OBIEE. Another thing I don't really like is the inherent bias the merge process has toward the modified RPD, instead of treating the modified and current RPDs equally. The merge candidates in the tool have been selected in such a way as to mitigate this problem but I am wary it may have unforeseen consequences for some as yet untested scenarios. There are may be other inconsistencies, but it is difficult to pin down all of the scenarios precisely and that's one of the main stumbling blocks when managing a file as complex as the RPD. Even if we didn't receive any conflicts, it is vital that RPDs are checked and tested (preferably automatically) before release.

The first step to testing is to create a representative test suite, which will encompass as much of the functionality of your system in as few reports as possible. The reason for this is that it is often impractical and sometimes invalid to check the entire catalogue at once. Furthermore, the faster the testing phase occurs, the quicker the overall release process will be. The purpose of a test suite is so that we can take a baseline of the data of each report from which we can validate consistency after making changes. This means your test suite should contain reports that are expected not to change after making changes to RPD. Furthermore you need to be careful that the underlying data of the report does not change between the baseline capture and the regression validation phases, otherwise you will invalidate your test.

In terms of tooling, Oracle provide BVT which can be used outside of upgrades to perform automated regression tests. This is good as it provides both data checks as well as visual validation. Furthermore, it can be run on a specific Web/Presentation Catalog folder directly, as opposed to the whole system.

As well as Oracle’s BVT, we also have an in-house Regression Testing tool that was written prior to BVT’s availability, and is still used to satisfy specific test scenarios. Built in Python, it is part of a larger toolset that we use with clients for automating the full development lifecycle for OBIEE, including migrating RPDs and catalogue artefacts between environments.

This brings us onto the last piece in the DevOps puzzle is continuous integration (CI). This is the concept of automatically testing and deploying code to a higher environment as soon as the work is complete. This is something not explicitly covered by the tools in this blog, however would work nicely used with the testing and migration scripts described above. This could all be made seamless by invoking the processes via script calls or better using Git hooks.

Summary

The success of an OBIEE concurrent development approach comes down to two things: the tooling, and the rigorous implementation of the process - and it is the latter that is key. In this article I’ve demonstrated the tooling that we’ve developed, along with the process required for a successful development method. Here at Rittman Mead we have detailed understanding and experience in the process and framework necessary to implement it at any client, adapting and advising to ensure the integration into existing in-house development and release requirements. The real world is messy and developers don't all work in the same way. A single tool in isolation is not going to succeed in making OBIEE - designed from the outset as a single-developer tool - scale to multiple developers. Instead of insisting that you change to accommodate our tool, we instead bring our tool and process and adapt to suit you.

You can find the code used in this blog up on GitHub and if you would like to discuss how Rittman Mead can help implement concurrent OBIEE RPD development successfully at your organisation, please get in touch.

Categories: BI & Warehousing

Working with OBIEE Data in Excel using ODBC

Thu, 2017-02-09 04:09

Look at this picture. I'm sure you've recognised the most favourite data analysis tool of all times - Excel.

Excel

But what you can't see in this picture is the data source for the table and charts. And this source is OBIEE's BI Server. Direct. Without exports or plugins!

Querying OBIEE Directly from Excel? With No Plugins? What Is Going On!

The OBIEE BI Server (nqsserver / OBIS) exposes an ODBC interface (look here if you live in a world full of Java and JDBC) which is used by Presentation Services and Administration tool. But a lesser-known benefit of this is that we can utilise this ODBC interface for own needs. But there is a little problem with the OBIEE 12c client installation - its size. Full (and the only possible actually) client OBIEE installation is more than 2 gigabytes and consists of more than 31 thousand files. Not a huge problem considering HDD sizes and prices but something not so good if you have an average-sized SSD.

And the second point to consider. We don’t want to give a full set of developer tools to an end-user. Even if our security won’t let them break anything, why would we stuff his head with unnecessary things? Let's keep things simple.

So what I had in mind with this research was to make a set of OBIEE ODBC libraries as small as possible. And the second aim was avoiding a full installation with cutting out redundant pieces. I need a small "thing" I can easily copy to any computer and then use it.

Disclaimer. Everything below is a result of our investigation. It’s not a supported functionality or Oracle’s recommendation.

I will not describe in full details the process of the investigation as it is not too challenging. It's less a detective thriller and more a tedious story. But anyways the main points will be highlighted.

Examine Working Copy

The first thing I needed to know what changes Oracle's installer does during an installation. Does it copy something to the Windows folder or everything stays in its installation folder? Does it make any registry changes (apparently it does but what exactly)?

For this task, I took a fresh Windows, created a dump of the registry and folders structure of the Windows folder, then installed OBIEE client using normal installation process, made the same dumps and compared them once again.

There were no surprises. OBIEE installer doesn't copy a single byte to the Windows folder (and it's a good news I think) but it creates a few registry keys (what was expected). Anyone who has ever tried to play around Windows ODBC won't be surprised with it at all.

I deleted some keys in order to make this screenshots more clear and readable.

So now I know names of the DLLs and their places. A good point to start. A small free utility Dependency walker helped me to find out a set of DLLs I need. This tool is very easy to use and very useful for finding a missing DLL. Just give it a DLL to explore and it will show all DLLs used by it and mark all missing.

Dependency walker

And a bit of educated guess helped to find one more folder called locale which stores all language files.

So, as a result, we got a tiny ODBC-related OBIEE client. It's very small. With only English locale it has a size about 20 megabytes and consists of 75 files. Compare it to 31 thousand files of the full client.

So that was a short story of looking and finding things. Now goes some practical result.

Folders Structure.

It seems that some paths are hard-coded. So we can't put DLLs to any folder we like. It should be something\bi\bifoundation\server. C:\BI-client\bi\bifoundation\server for example.

The List of DLLs

I tried to find the minimum viable set of the libraries. The list has only 25 libraries but it takes too much place on the screen so I put them into a collapsible list in order to keep this post not too long. These libraries should go under bin folder. C:\BI-client\bi\bifoundation\server\bin for example.

The list of ODBC DLLs

  • BiEndPointManagerCIntf64.dll
  • mfc100u.dll
  • msvcp100.dll
  • msvcr100.dll
  • nqcryptography64.dll
  • nqerrormsgcompiler64.dll
  • nqmasutility64.dll
  • nqperf64.dll
  • nqportable64.dll
  • nqsclusterapi64.dll
  • nqsclusterclient64.dll
  • nqsclusterutility64.dll
  • nqsodbc64.dll
  • nqsodbcdriverconndlg64.dll
  • nqssetup.dll
  • NqsSetupENU.dll
  • nqstcpclusterclient64.dll
  • NQSTLU64.4.5.dll
  • nqutilityclient64.dll
  • nqutilitycomm64.dll
  • nqutilitygeneric64.dll
  • nqutilitysslall64.dll
  • perfapi64.dll
  • samemoryallocator864.dll
  • xerces-c_2_8.dll

Or you may take the full bin folder. Its size is about 240 megabytes. You won't win the smallest ODBC client contest but will save a few minutes of your time.

Locale

The second folder you need is locale, it is located near bin. C:\BI-client\bi\bifoundation\server\locale, for example. Again if you agree with not the smallest client in the world, you may take the whole locale. But there are 29 locales and I think most of the time you will need only one or two of them. Every locale is about 1.5 megabytes and has 48 files. A good place for some optimisation in my opinion.

Registry Key

And the last part is registry keys. I need to tell my Windows what is my driver name and what is its path and so on. If it was a usual part of the registry I'd created a file anything.reg, put a code like this into it and imported it into the registry.

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\ODBC\ODBCINST.INI\ODBC Drivers]
"Oracle BI Server 12.2.1.2"="Installed"

[HKEY_LOCAL_MACHINE\SOFTWARE\ODBC\ODBCINST.INI\Oracle BI Server 12.2.1.2]
"ConnectFunctions"="YYN"
"Driver"="C:\\BI-client\\bi\\bifoundation\\server\\bin\\NQSODBC64.dll"
"DriverODBCVer"="03.52"
"FileUsage"="0"
"Setup"="C:\\BI-client\\bi\\bifoundation\\server\\bin\\nqssetup.dll"
"SQLLevel"="2"
"UsageCount"="dword:00000001"
"APILevel"="3"
"Regional"="Yes"
"PrimaryCCS"=""
"PrimaryCCSPort"=""
"SecondaryCCS"=""
"SecondaryCCSPort"=""
"Port"=""

But luckily there is a small console utility which makes the task easier and more elegant - scripted. Microsoft provides us a tool called odbcconf.exe located in C:\Windows\System32 folder. And its syntax is not very obvious but not too hard also. Generally the syntax is the following: odbcconf.exe /a {action "parameters"}. In this case the call is odbcconf.exe {installdriver "Oracle BI Server 12.2.1.2|Driver=C:\BI-client\bi\bifoundation\server\bin\nqsodbc64.dll|Setup=C:\BI-client\bi\bifoundation\server\bin\nqssetup.dll|APILevel=2|SQLLevel=2|ConnectionFunctions=YYN|DriverODBCVer=03.52|Regional=Yes"}. Here installdriver is the action and the long string is the set of parameters divided by |. It may look a bit complicated but in my opinion it leaves less space for manual work and therefore less space for error. Just one note: don't forget to start a cmd windows as administrator.
odbcconf.exe

Visual C++ Redistributable

If your computer is fresh and clean, you need to install a Visual C++ 2010 redistributable package. It's included in Oracle's client and placed in 'Oracle_Home\bi' folder. The file name is vcredist_x64.exe.

Result

And as a result I got an ODBC driver I can use as I want. And not obvious but pleasant bonus is that I can give it any name I like. OBIEE version, path, whatever I want.

And I can create an ODBC DSN in a normal way using ODBC Data source Administrator. Just like always. No matter this is a hand-made driver. It was properly registered and it is absolutely legitimate.

So just a brief intermediate summary. We can take a full 2+ gigabytes OBIEE client. Or we can spend some time to:
1. Create a folder and put into it some files from the Oracle OBIEE client;
2. Create a few registry keys;
3. Install a Visual C++ 2010 redistributable
And we will get a working OBIEE ODBC driver which size is slightly above 20 megabytes.

Demo

So now we have a working ODBC connection, what can it give us?

Meet the most beloved by end users all around the world tool - Excel.

At this point of the story, some may tell me "Hey, stop right there! Where have you got that SQL? And why is it so strange? That's not an ANSI SQL". The evil part of me wants to simply give you a link to the official documentation: Logical SQL Reference and run away. But the kind one insists that even while documentation has done no harm to anyone, that's not enough.

In a nutshell, this is an SQL that Presentation services send to BI Server. When anyone builds an analysis or runs a dashboard, Presentation services create and send logical queries to BI Server. And we can use it for our own needs. Create an analysis as usual (or open an existing one), navigate to the Advanced tab, and then copy and paste analysis' Logical SQL. You may want to refine it, maybe remove some columns, or change aliases, or add a clause or two from the evil part's documentation, but for the first step just take it and use it. That simple.

And of course, we can query our BI server using any ODBC query tool.

And all these queries go directly to the BI Server. This method doesn't use Presentation Services, OBIEE won't build a complex HTML which we have to parse later. We use a fast and efficient way instead.

Categories: BI & Warehousing

Join Rittman Mead at the 2017 BIWA Summit!

Mon, 2017-01-30 15:46
Join Rittman Mead at the 2017 BIWA Summit!

We invite you to come join us at the annual 2017 BIWA Summit.

Join Rittman Mead at the 2017 BIWA Summit!

This year we are proud to announce that Robin Moffatt, Head of Research and Development, will be presenting on:

Analysing the Panama Papers with Oracle Big Data Spatial and Graph

January 31, 2017 | 3:45 pm – 4:15 pm | Room 103

Oracle Big Data Spatial and Graph enables the analysis of datasets beyond that of standard relational analytics commonly used. Through Graph technology relationships can be identified that may not otherwise have been. This has practical uses including in product recommendations, social network analysis, and fraud detection. In this presentation we will see a practical demonstration of Oracle Big Data Spatial and Graph to load and analyse the “Panama Papers” dataset. Graph algorithms will be utilised to identify key actors and organisations within the data, and patterns of relationships shown. This practical example of using the tool will give attendees a clear idea of the functionality of the tool and how it could be used within their own organisation. If Oracle Database 12cR2 on-premise is available by the time of this presentation, then its new property graph capabilities will also be covered here. The presentation will be based on a paper published on OTN: https://community.oracle.com/docs/DOC-1006400


Kafka’s Role in Implementing Oracle’s Big Data Reference Architecture on the Big Data Appliance

February 1, 2017 | 2:20 pm – 3:10 pm | Room 102

Big Data … Big Mess? Everyone wants Big Data, but without a good platform design up front there is the risk of a mess of point-to-point feeds. The solution to this is Apache Kafka, which enables stream or batch consumption of the data by multiple consumers. Implemented as part of Oracle’s Big Data Architecture on the Big Data Appliance, it acts as a data bus for the enterprise to both the data reservoir and discovery lab. This presentation will introduce the basics of Kafka, and explain how it fits within the Big Data Architecture. We’ll then see it used with Oracle GoldenGate to stream data into the data reservoir, as well as ad hoc population of discovery lab environments and microservices such as Flume, HBase, and Elasticsearch.

(Still) No Silver Bullets: OBIEE 12c Performance in the Real World

February 2, 2017 | 1:30 pm – 2:20 pm | Room 203

Are you involved in the design and development of OBIEE systems and want to know the best way to go about ensuring good performance? Maybe you’ve an existing OBIEE system with performance “challenges” that you need to diagnose? This presentation looks at the practical elements of diagnosing the causes of performance issues in OBIEE, and discusses good practices to observe when developing new systems. It includes discussion of OBIEE 12c and with additional emphasis on analysis of Usage Tracking data for the accurate profiling and diagnosis of issues. Why this would appeal to the audience: – Method-R time profiling technique applied to the OBIEE nqquery.log – Large number of the community use OBIEE, many will have their own performance horror stories; fewer will have done a deep dive into analysing the time profile of long-running requests – Performance “right practices” will help those less familiar with performant OBIEE designs, and may prompt debate from those more experienced. As presented previously at OOW, OUGF, UKOUG, OUG Scotland, and POUG. Newly updated for OBIEE 12c. * Video: http://ritt.md/silver-bullets-video* Slides: http://ritt.md/silver-bullets-slides

Categories: BI & Warehousing

What's new in Training in 2017?

Thu, 2017-01-26 05:30
2016 - Thank you for a great year!

Rittman Mead would like to thank everyone that attended or showed an interest in our Training courses in 2016. Since we started back in 2007, Training has been a mainstay of our service offerings.

My personal opinion is that Q3 & Q4 saw the emergence of OBIEE 12c being properly adopted within the marketplace. It made sense for companies to wait for some of the bugs from earlier releases to be ironed out as well as waiting for clarity around the release of things such as Data Visualization Desktop.

It meant that we started to really see numbers pick up in our OBIEE 12c bootcamp. For the first time we’ve really tried to stress the fact that different parts of the course can be suitable for different people based on their everyday use of the product. This has led to more business focused end-users of OBIEE attending our training.

We love travelling and 2016 yet again took us to some amazing places to deliver courses to a variety of different clients. Locations we visited included South Africa, India, Sweden, Jamaica, Bulgaria and Ireland to name a few.

Finally we were really proud to release our new On Demand Training platform in December 2016 with our first online course, OBIEE 12c Front End Development & Data Visualization.

What’s new in 2017

We’re looking forward to another busy year in 2017 and it’s certainly already underway!!

Our public training schedule has been published with a number of courses available in OBIEE & ODI.
Here you can find the course dates for UK & Europe

And here you can find the courses available in the US

2017 will also see the release of some new courses including:

Advanced Analytics and Oracle R

We are seeing more and more investment in Predictive Analytics projects from companies looking to make as much value out of their data as possible.
Our 3 day course will teach you about the tools available & the techniques required to start or continue your Predictive Analytics journey.

From acquiring to tidying and transforming data, moving into the types of Predictive Models and how to the deploy them, our course will strengthen your knowledge and teach you valuable techniques.

The Advanced Analytics & Oracle R course will be available from March 2017, please get in touch for more details.

ODI 12c Bootcamp

2017 will also see the refresh of our ODI 12c Bootcamp. There are some very handy new features in the latest version such as the Big Data Integration and also Lifecycle Management. We'll also be including some lessons on advanced techniques such as Groovy Scripting in ODI.
We’re looking forward to teaching these extra modules soon.

The new course will be released in Q2 2017.

On Demand Training

We will be adding more courses to our On Demand Training platform throughout 2017. We recognise the value of classroom instructor led training however we also understand that people have busy lives and that sometimes flexibility to learn at your own pace is important.

Our On Demand Training platform provides this opportunity whether you’re trying to reaffirm your learning post-classroom training or looking to learn a new skill for the first time.

Courses that will be added online in 2017 include OBIEE 12c RPD Modeling, OBIEE 12c Systems Management & Performance, OBIEE 11g Front End Development, ODI 12c Bootcamp, ODI 11g for BI Apps and many more….

For more information and updates, please head to our webpage

Categories: BI & Warehousing

Rittman Mead at BIWA Summit 2017

Wed, 2017-01-25 08:00

I'm excited to be attending my first ever BIWA Summit next week (which will take me to Oracle HQ at Redwood Shores for the first time too!). This three day conference is one of the major dates in the conference calendar for all Oracle Analytics folk, and I'm proud to have opportunity to present three papers:

  • Analysing the Panama Papers with Oracle Big Data Spatial and Graph

    31st January, Room 103, 15:45

    Based on an article I wrote recently, I'll be talking about how to use property graph analysis through Oracle's Big Data Spatial and Graph tool to examine and analyse the relationships in the Panama Papers dataset. Complex relationships that would be all but impossible to query in relational SQL can be uncovered using built in algorithms as well as with Property Graph Query Language (PGQL). I'm using my new favourite tool, interactive notebooks, to demonstrate PGQL as well as the PGX interface.

  • Kafka's Role in Implementing Oracle's Big Data Reference Architecture on the Big Data Appliance

    1st February, Room 102, 14:20

    Apache Kafka is rapidly becoming accepted as a de-facto means of building a data pipeline through a business, ensuring availability of data to and from all systems that need it. In this presentation I go in to the detail of what Apache Kafka is, the problems that it solves - and then put this in context of the Oracle Information Management and Big Data Reference Architecture.

  • (Still) No Silver Bullets : OBIEE 12c Performance in the Real World

    2nd February, Room 203, 13:30

    One of my favourite presentations to deliver, this dives into what you should - and shouldn't - do when building an OBIEE system. It explains how to troubleshoot performance issues methodically - and not a best practice in sight!

There's a full listing of all sessions here, with a PDF to download here.

You can follow the conference proceedings on twitter with the hashtag #BIWASummit, and I'll be tweeting about it to as @rmoff. The presentations that I'm delivering will be available to download on speakerdeck.

Categories: BI & Warehousing

Race Against the Machine - Skilling in the Name

Fri, 2017-01-20 06:36

Firstly apologies for the awful title. We’ll see by the end of this post whether the pun works!

At Rittman Mead we see the Data & Analytics market and indeed the broader technology market continually changing.

Investment in technology to move organisations ahead of their competitors seems higher than ever and gone are the days that large IT projects are seen as purely a huge cost to the business.

They are getting genuinely measurable ROI now.

We’ve also observed an ever changing landscape in terms of the new features and functionalities of tools. It makes us wonder if we’re in a situation where these systems are going to end up way ahead of a person's ability to use it to it’s best potential.

Take the Hadoop Ecosystem for example. Every large organisation in the world is starting to take Big Data seriously, however barely anyone had even seen the different tools surrounding it until a few years ago.

That’s why we believe that it’s everyone’s responsibility to put Learning & Education towards the highest end of their priority lists at the start of each year.

  • That goes for course developers & deliverers such as Rittman Mead, Oracle and Cloudera to name a few. The onus is on us to provide the necessary learning opportunities and courses.

  • It also goes for organisations who must invest in their staff through training and education. The benefits to companies are huge. It shows that level of commitment to your staff which can lower attrition and increase productivity.

  • And it certainly goes for anyone in the technology space to constantly pick up new skills and experiences that will help them for years to come.

It would be mad for me to go out and buy a TaylorMade M1 golf club for £250 before I’ve learnt how to hit a golf ball straight (I really struggle with a slice!).

In the same respect, a company should never invest millions on a system and then fail to train it’s teams how to use it.

We’ll always be racing against machine but with the right learning perspective in place we can definitely all keep up.

Look out for our post next week when we review our Training in 2016 and take a look at what's on the horizon in 2017.

Categories: BI & Warehousing

Time Series Visualisations: Kibana or Grafana?

Thu, 2017-01-19 05:49

Grafana has rapidly become one of the de-facto “DevOps” tools for real time monitoring dashboards of time series metrics. In addition to its powerful visualisations, Grafana is not tied to a particular stack or vendor, and supports multiple backend data sources including InfluxDB, Graphite, Elasticsearch and many others which can be added via plugins.

Another similar tool, Kibana is the data visualisation front end for the Elastic Stack, complementing the rest of the stack which includes Beats, Logstash (ingest) and Elasticsearch itself (storage). With the version 5.x release of the Elastic Stack, Kibana now includes Timelion for interactive time series charts.

Here at Rittman Mead we are big fans of both tools, and have written about them over the years (see 1, 2, 3). Our industry-leading Performance Analytics solution for OBIEE is built on top of these tools, and takes advantage of the time series features to provide interactive web-based dashboards presenting a “full stack" view of the important metrics relating to OBIEE's performance.

To give you an idea of what we’ve built, here is a sample dashboard from our Performance Analytics tool. We use both Grafana and Kibana, to present different views of data. The dense dashboards of time series metrics work brilliantly in Grafana:

To enable the user to view and analyse performance data across multiple dimensions we use Kibana, which does a stirling job:

With the recent release of Timelion - a time series visualisation plugin for Kibana - out of beta and into the big time, we wanted to ensure we were still using the right tool for the right job. Did we still need Grafana in our stack for visualisation of time series metrics, or could Timelion fill that gap now, and enable us to streamline our platform’s toolset?

In this article we’ll see how Timelion and Grafana stack up against each other. The intention is not to define which is “best” (a pointless exercise), nor create an unintelligible grid of down-in-the-weeds features that each may or may not support, but to see how the two tools compared in real-world usage, side by side. Which makes it easier to build charts? Which produces a nicer-looking dashboard at the end of it? Which has the best UI and UX for the end user reading and analysing the data? What limitations -if any- are there on data sources and functionality in analysing that data? And ultimately, can we unify our product’s front end on a single one of these tools?

Introduction to Timelion

Since version 5 of Kibana, Timelion (pronounced "Timeline") has been included as part of the default installation. Charts are defined using a bespoke query language, which specifies both the source of the data, functions to apply to it, and how it is presented. The query is specified in a textbox in the Timelion interface. In this simple chart here we’re using the expression .es(*) to show the total number of documents in Elasticsearch, over time:

Every Timelion expression starts with a data source function and continues with a chain of functions that are connected with a dot. Over 20 functions are provided, across three groups:

  1. Data sources - the default is Elasticsearch, and other APIs such as World Bank and Quandl are also available.
    For example in the graph above, the default expression .es(*) (similar to .elasticsearch(*)) shows a count of all documents in Elasticsearch. You can specify details of the Elasticsearch index, mappings and metrics here too, as well as filters.

  2. Data manipulations ranging from simple arithmetic to moving averages, cumulative sums and derivatives
    For example, adding a moving average to the data is as simple as including the function to the end of the expression: .es(*).movingaverage(12)

  3. Themes and styles of the visual elements including bar/point/lines, labels, title and legends. The graph below shows the number of running queries by time extracted from the active session history data in the Oracle database. .es(index=ash*).lines(1,fill=1).title('Running Queries').legend(none).label(false)

With regards to the available documentation and guides for the developers, the main documentation for Timelion is somewhat sparse. For details of each function you can refer to the documentation on github. Compared to the rest of the excellent Elastic documentation, this is surprising and hopefully now that Timelion is part of the core product its documentation will be brought up to parity - full explanations of features and functions along with examples of usage.

On the positive side, the query builder text box supports auto-complete of functions and their arguments, and the Timelion interface provides online help too. A downside to this minimalist Timelion page is the size of the expression textbox. As you will read more in this post, it wouldn’t take long before you need to add more than one metric and a few styles to a visualisation which means having too many words in the textbox that can’t be seen, scrolled and edited easily:

If you are a beginner, to avoid the confusion over typos and errors, try building the expressions step by step and add functions gradually. The blog here nicely explains how to gradually create Timelion expressions.

Of special note in the data manipulation functions that Timelion provides are the statistical analysis ones:

  • .trend() : add a trendline using a specified regression algorithm to your graph
  • .holt(): an early version of this function, which samples the beginning of a series and use it to forecast what should happen via several optional parameters.

These are useful for our performance monitoring dashboards, enabling us to show things such as the point at which you would run out of memory/disk space if you continued to consume resources at your current rate.

Related to this concept is Prelert, which Elastic acquired next year and is expected to be part of a future X-Pack release. Whilst dashboard-based analysis is useful, once a clear pattern on which we want to alert is identified we can bring in Watcher to provide real time notifications to pager systems etc.

Introduction to Grafana

Grafana is an open source feature rich dashboard and graph editor that is rapidly becoming accepted as one of the best time-series metric visualisation tools available. Grafana has gained its popularity thanks to its simplicity, ease of use and snazzy look and feel that attracts many users. You can read more about Grafana in an earlier article that we wrote on the Rittman Mead blog here. Here is the kind of dashboard you can easily build with Grafana:

Most of the configurations in Grafana are done via a comprehensive graph editor interface:

In the Grafana editor queries are generally built entirely through the GUI. Manually specified queries are used in cases such as accessing advanced functionality, and for specifying Lucene queries for in order to access data held in Elasticsearch. In terms of support for Elasticsearch, the latest version of Grafana at the time of writing this post (v4.1.1) supports both Elasticsearch v2 and v5. From my time spent working with Grafana 4.1.1 and Elasticsearch v5 I haven’t found it to be as stable as the long-standing data sources such as InfluxDB and Graphite (or even Elasticsearch v2). As an example, if a chart is configured incorrectly (for example settings for null values), Grafana is not as intuitive in returning no results or throw a descriptive error explaining the issue; instead the graph seems locked and the only possible solution for this behaviour seems to be deleting the chart and recreating it from scratch.

As well as data sources the graph editor includes settings covering display styles such as titles, templates, axis and legends.

A interesting new addition to the Grafana family is the alerting engine which allows users to attach rules to the dashboard panels. Once dashboards are saved Grafana will extract the alert rules into a separate alert rule storage and schedule them for evaluation.

Side-by-Side : Presenting the Data

On the face of it, the output from Grafana and Timelion can be remarkably similar:

Grafana on the right and Timelion is the left graph

However, there are a few differences between the two tools that are worth digging into here. They are mainly on the display configuration part and simplicity of the user experience.

As mentioned, Grafana’s chart editor has a clear interface over the multitude of options available for refining the presentation of the data.

Timelion also supports chart formatting, but with fewer options than Grafana. It also depends on the user concatenating the correct functions onto the data query expression as we saw above. For example to add a graph that has a “Running Queries” title, a legend on the top right of the plot, not labeled axes and data shown with a 1px width line, you would need to hand-code the this expression: .lines(1,fill=1).title('Running Queries').legend(ne).label(false)

Grafana offers significantly greater flexibility in the formatting of the chart. One example is displaying metrics of different units such as time, currency and data. Grafana can automatically scale axes based on the units (bytes -> MB -> GB). The following Grafana graph shows disk usage from our monitored application stored in Elasticsearch. The disk usage metric on the Y axis is in Kilobytes, which Grafana has automagically scaled to the appropriate magnitude (MiB) in the labelling:

The same could be done manually in Timelion by specifying the appropriate conversion, but this is a hardcoded option compared to Grafana’s dynamic one, and even then wouldn’t have the varying labeling that Grafana does above (KiB initially, switching to MiB subsequently)

Grafana also supports the rendering of negative values on the Y axis, which is just not possible in Timelion. As well as genuinely negative data values (for example, temperature recordings below zero degrees), using transform feature of Grafana it is possible to invert particular series so as to aid the comprehension of the data as seen here:

Another nice feature that Grafana has - and unfortunately Timelion doesn’t - is the ability to show metric values in the legend itself. It’s a great way to see key values at a glance, without requiring a separate table or the user to hover over the data points.

Side-by-Side : Interacting with the Data

Grafana and Kibana are also different in terms of the level and ease with which it is possible to interact with the charted data. Both Kibana and Grafana support the drag-select of time periods on a chart to zoom into detail, with the rest of the charts on the same dashboard updating to show the same time period too. However, Kibana is much more feature-rich in this area. As a front end to Elasticsearch it supports ad-hoc text search of your data. It also allows users to automatically drill down into data, by clicking on a value in a chart to show details just for that. In the OBIEE monitoring dashboard below (built in Kibana), Active Session History data is filtered for the session_states in “Waiting” and “On CPU” - this filter was created by the user simply by clicking on the data points in one of the charts, and can be toggled dynamically from the same interface.

This interactivity is supported by Timelion too. The es() datasource function includes an argument called “kibana”. This argument defines whether the visualisation should follow the filters applied to the rest of the Kibana dashboard or not, for example:
.es(index=dms_*,metric='avg:obips1-Current_Disk_Usage',fit='nearest',kibana='true')

Whilst it is possible to specify Elasticsearch Lucene queries in Grafana and use term filters in the editor, these are local to the graph. With some use of variables it can be possible to enable a degree of global filtering on a single Grafana dashboard but this is a bespoke solution per-dashboard, rather than the out-of-the-box functionality that Kibana provides.

Grafana does enable you to toggle the display of data in a chart, by clicking on the measure label in the legend, seen above.

Conclusion

Comparing Kibana and Timelion to Grafana, it is true that they do a similar job displaying time series metrics - with pros and cons on each side.

Grafana’s graph editor offers an amazing interface with regards to the options available for refining the presentation of the data. Grafana is not only an straightforward development tool but also adds a huge amount of value to the resulting dashboards making them easier to read and analyse by the end users

On other hand, Timelion is just one of many visualisations that Kibana provides (including Tile Map and Tag Cloud), meaning that dashboards can be built which are less dense with numbers and time series but information is shown through variety of visualisations. Unfortunately Timelion and its expression editor at its current version seem slightly immature and relatively limited. A few more additional display options plus a nicer editor would put Timelion in a better position in comparison.

So, for now, we’ll be sticking with our dual approach of both Grafana and Kibana. Grafana provides our pure time-series metric dashboards, with the ease-of-building being one of the key factors, along with the rich formatting capabilities and its support for a data sources rather than Elasticsearch. Kibana does an unbeatable job of dashboards enabling rich exploration of metrics across dimensions, rendered in a greater number of possible visualisation forms. Timelion is a great first step, but ultimately just can’t compete with Grafana.

This is a fast-moving area of tool development, and you can bet that Grafana and Kibana are going to continue developing at a rate of knots - which as users and developers is great news!

Categories: BI & Warehousing

Analyse Your Data on the Go with Oracle Synopsis App

Mon, 2017-01-16 02:42
Analyse Your Data on the Go with Oracle Synopsis App

How many times did you receive an email on your phone with an Excel or CSV attachment you wanted to analyse immediately in an app on your mobile, without having to wait until you reach the laptop? Not only viewing raw numbers but also creating graphs and summaries with the possibility to share the end result with your colleagues?

Analyse Your Data on the Go with Oracle Synopsis App

Your prayers have been answered with the recent release from Oracle of Oracle Synopsis. This is a new mobile app available for Android devices (an iOS version coming soon) that enables building data analyses on the go by interacting with data directly on the smartphone/tablet. It is a free application that doesn't require any OBIEE backend or additional licensing.


Happy to announce that we have launched Synopsis on Android. Now go analyze your data. https://t.co/tAxCrz4RAY pic.twitter.com/JT6dHMKgVu

— BI Mobile Labs (@bimobilelabs) December 21, 2016

Oracle's mobile ecosystem so far has been represented by Oracle BI Mobile HD: an app available since several years, requiring an additional license, which focused on the visualization of OBIEE's pre-built content like dashboard, analysis or alerts. The main limitation of BI Mobile HD is that all content must be created upfront in a computer browser in order to be accessed by the application, no "analysis" option was available other than predefined drilling or navigation capabilities.

Synopsis extends Oracle's mobile ecosystem by adding an app capable of analysing data on the go, interacting directly with files on the phone/tablet in a visual and intuitive way.

Let’s have a look at how Synopsis works. For my test I'll use the Federal U.S. Electric Utility Companies and Rates data coming from en.openei.org.
I first downloaded the CSV file in my phone, then opened the Oracle Synopsis App and selected the file from the "Downloads" folder.

Analyse Your Data on the Go with Oracle Synopsis App

By default, just by opening the file with Synopsis app, I get a project named as the source file (iouzipcodes2011) showing a bar chart of ZIP by UTILITY_NAME. I'm able to switch the coordinates of the graph by either changing the dimension of the X-axis or the measure on the Y-axis by selecting another option from the sections on top and bottom of the graph respectively.

Analyse Your Data on the Go with Oracle Synopsis App

The default graphs are provided just by opening the file, without having to define any measure or dimension.

The question now is: What am I looking at? How do I change the default behaviour?
I can get an idea about what Synopsis is doing by clicking on the project title itself. By default the application associates all text columns to dimensions and all numeric columns to measures, aggregating them with SUM. I can however change the default behaviour by:

  • Changing the aggregation method for the measures, possible alternatives are Average and Count
  • Changing the text and numbers (dimension and metrics) assignments by clicking on the cog icon and accessing the related screen
  • Hide a column by just tapping on the column name (in the grey box shown in the image below), a line will be shown on top the hidden column name.

Analyse Your Data on the Go with Oracle Synopsis App

Measures format can be changed just by sliding the related number tile and setting decimals, currency, percentage format among others.

Analyse Your Data on the Go with Oracle Synopsis App

When clicking on a measure (e.g. ind_rate), a list of graphs, one per dimension, is presented. The type of graphs depends by the attribute type (bar for a text, line trend for a date) and cardinality (a donut is presented when the number of distinct value for an attribute is shown).

Analyse Your Data on the Go with Oracle Synopsis App

I can however change the graph type by clicking on it. There are multiple options:

  • One or more metrics can be added and filters can be applied
  • Attributes can be added and filters can be applied
  • The graph type can be changed

Analyse Your Data on the Go with Oracle Synopsis App

An image of the resulting graph can then be easily sent via mail by clicking on Analyse Your Data on the Go with Oracle Synopsis App

When clicking on padlock icon projects are secured with the fingertip sensor. This option is enabled only in mobile devices supporting the fingertip recognition.

Analyse Your Data on the Go with Oracle Synopsis App

Projects can be exported: a .syn file will be created containing both data and metadata. The project file can then be shared and reopened with the Synopsis app in other mobiles. In this first release is not possible to share projects between Synopsis and other tools like Visual Analyzer or Data Visualization Desktop but it’s an obvious enhancement that one could imagine Oracle considering for the future.

A list of global settings is also available in order to change import settings like blank cells management and CSV delimiter.

Analyse Your Data on the Go with Oracle Synopsis App

Be aware that Synopsis requires the Excel (or CSV) file to be in a pure tabular form; extra heading rows or collapsed cells will prevent Synopsis from parsing it. On top of this, in order for it to work at least one numeric and one text column are needed.

Conclusion

Oracle Synopsis is a great addition to Oracle's mobile offering. It provides an app capable of analysing data on the go in a very visual and intuitive way. So far only CSV and Excel files are supported but my guess is that it will soon be possible to interface to a lot of other applications especially in the Cloud. And - it’s free!

Categories: BI & Warehousing

Data Processing and Enrichment in Spark Streaming with Python and Kafka

Fri, 2017-01-13 10:00

In my previous blog post I introduced Spark Streaming and how it can be used to process 'unbounded' datasets. The example I did was a very basic one - simple counts of inbound tweets and grouping by user. All very good for understanding the framework and not getting bogged down in detail, but ultimately not so useful.

We're going to stay with Twitter as our data source in this post, but we're going to consider a real-world requirement for processing Twitter data with low-latency. Spark Streaming will again be our processing engine, with future posts looking at other possibilities in this area.

Twitter has come a long way from its early days as a SMS-driven "microblogging" site. Nowadays it's used by millions of people to discuss technology, share food tips, and, of course, track the progress of tea-making. But it's also used for more nefarious purposes, including spam, and sharing of links to pirated material. The requirement we had for this proof of concept was to filter tweets for suspected copyright-infringing links in order that further action could be taken.

The environment I'm using is the same as before - Spark 2.0.2 running on Docker with Jupyter Notebooks to develop the code (and this article!). You can download the full notebook here.

The inbound tweets are coming from an Apache Kafka topic. Any matched tweets will be sent to another Kafka topic. The match criteria are:

  • Not a retweet
  • Contains at least one URL
  • URL(s) are not on a whitelist (for example, we're not interested in links to spotify.com, or back to twitter.com)
  • The Tweet text must match at least two from a predefined list of artists, albums, and tracks. This is necessary to avoid lots of false positives - think of how many music tracks there are out there, with names that are common in English usage ("yesterday" for example). So we must match at least two ("Yesterday" and "Beatles", or "Yesterday" and "Help!").
    • Match terms will take into account common misspellings (Little Mix -> Litle Mix), hashtags (Little Mix -> #LittleMix), etc

We'll also use a separate Kafka topic for audit/debug purposes to inspect any non-matched tweets.

As well as matching the tweet against the above conditions, we will enrich the tweet message body to store the identified artist/album/track to support subsequent downstream processing.

The final part of the requirement is to keep track of the number of inbound tweets, the number of matched vs unmatched, and for those matched, which artists they were for. These counts need to be per batch and over a window of time too.

Getting Started - Prototyping the Processing Code

Before we get into the meat of the streaming code, let's take a step back and look at what we're wanting the code to achieve. From the previous examples we know we can connect to a Kafka topic, pull in tweets, parse them for given fields, and do windowed counts. So far, so easy (or at least, already figured out!). Let's take a look at nub of the requirement here - the text matching.

If we peruse the BBC Radio 1 Charts we can see the popular albums and artists of the moment (Grant me a little nostalgia here; in my day people 'pirated' music from the Radio 1 chart show onto C90 cassettes, trying to get it without the DJ talking over the start and end. Nowadays it's done on a somewhat more technologically advanced basis). Currently it's "Little Mix" with the album "Glory Days". A quick Wikipedia or Amazon search gives us the track listing too:

  1. Shout Out to My Ex
  2. Touch
  3. F.U.
  4. Oops - Little Mix feat. Charlie Puth
  5. You Gotta Not
  6. Down & Dirty
  7. Power
  8. Your Love
  9. Nobody Like You
  10. No More Sad Songs
  11. Private Show
  12. Nothing Else Matters
  13. Beep Beep
  14. Freak
  15. Touch

A quick twitter search for the first track title gives us this tweet - I have no idea if it's legit or not, but it serves as an example for our matching code requirements:

DOWNLOAD MP3: Little Mix – Shout Out To My Ex (CDQ) Track https://t.co/C30c4Fel4u pic.twitter.com/wJjyG4cdjE

— Ngvibes Media (@ngvibes_com) November 3, 2016

Using the Twitter developer API I can retrieve the JSON for this tweet directly. I'm using the excellent Paw tool to do this.

From this we can get the text element:

"text": "DOWNLOAD MP3: Little Mix \u2013 Shout Out To My Ex (CDQ) Track https:\/\/t.co\/C30c4Fel4u https:\/\/t.co\/wJjyG4cdjE",

The obvious approach would be to have a list of match terms, something like:

match_text=("Little Mix","Glory Days","Shout Out to My Ex","Touch","F.U.")

But - we need to make sure we've matched two of the three types of metadata (artist/album/track), so we need to know which it is that we've matched in the text. We also need to handle variations in text for a given match (such as misspellings etc).

What I came up with was this:

filters=[]  
filters.append({"tag":"album","value": "Glory Days","match":["Glory Days","GloryDays"]})  
filters.append({"tag":"artist","value": "Little Mix","match":["Little Mix","LittleMix","Litel Mixx"]})  
filters.append({"tag":"track","value": "Shout Out To My Ex","match":["Shout Out To My Ex","Shout Out To My X","ShoutOutToMyEx","ShoutOutToMyX"]})  
filters.append({"tag":"track","value": "F.U.","match":["F.U","F.U.","FU","F U"]})  
filters.append({"tag":"track","value": "Touch","match":["Touch"]})  
filters.append({"tag":"track","value": "Oops","match":["Oops"]})

def test_matching(test_string):  
    print 'Input: %s' % test_string
    for f in filters:
        for a in f['match']:
            if a.lower() in test_string.lower():
                print '\tTag: %s / Value: %s\n\t\t(Match string %s)' % (f['tag'],f['value'],a)

We could then take the test string from above and test it:

test_matching('DOWNLOAD MP3: Little Mix \u2013 Shout Out To My Ex (CDQ) Track https:\/\/t.co\/C30c4Fel4u https:\/\/t.co\/wJjyG4cdjE')  
Input: DOWNLOAD MP3: Little Mix \u2013 Shout Out To My Ex (CDQ) Track https:\/\/t.co\/C30c4Fel4u https:\/\/t.co\/wJjyG4cdjE
    Tag: artist / Value: Little Mix
        (Match string Little Mix)
    Tag: track / Value: Shout Out To My Ex
        (Match string Shout Out To My Ex)

as well as making sure that variations in naming were also correctly picked up and tagged:

test_matching('DOWNLOAD MP3: Litel Mixx #GloryDays https:\/\/t.co\/wJjyG4cdjE')  
Input: DOWNLOAD MP3: Litel Mixx #GloryDays https:\/\/t.co\/wJjyG4cdjE
    Tag: album / Value: Glory Days
        (Match string GloryDays)
    Tag: artist / Value: Little Mix
        (Match string Litel Mixx)
Additional Processing

With the text matching figured out, we also needed to address the other requirements:

  • Not a retweet
  • Contains at least one URL
    • URL(s) are not on a whitelist (for example, we're not interested in links to spotify.com, or back to twitter.com)
Not a Retweet

In the old days retweets were simply reposting the same tweet with a RT prefix; now it's done as part of the Twitter model and Twitter clients display the original tweet with the retweeter shown. In the background though, the JSON is different from an original tweet (i.e. not a retweet).

Original tweet:

Because we all know how "careful" Trump is about not being recorded when he's not aware of it. #BillyBush #GoldenGate

— George Takei (@GeorgeTakei) January 12, 2017
{
  "created_at": "Thu Jan 12 00:36:22 +0000 2017",
  "id": 819342218611728384,
  "id_str": "819342218611728384",
  "text": "Because we all know how \"careful\" Trump is about not being recorded when he's not aware of it. #BillyBush #GoldenGate",

[...]

Retweet:

{
  "created_at": "Thu Jan 12 14:40:44 +0000 2017",
  "id": 819554713083461632,
  "id_str": "819554713083461632",
  "text": "RT @GeorgeTakei: Because we all know how \"careful\" Trump is about not being recorded when he's not aware of it. #BillyBush #GoldenGate",

[...]

  "retweeted_status": {
    "created_at": "Thu Jan 12 00:36:22 +0000 2017",
    "id": 819342218611728384,
    "id_str": "819342218611728384",
    "text": "Because we all know how \"careful\" Trump is about not being recorded when he's not aware of it. #BillyBush #GoldenGate",
[...]

So retweets have an additional set of elements in the JSON body, under the retweeted_status element. We can pick this out using the get method as seen in this code snippet, where tweet is a Python object created from a json.loads from the JSON of the tweet:

if tweet.get('retweeted_status'):  
    print 'Tweet is a retweet'
else:  
    print 'Tweet is original'
Contains a URL, and URL is not on Whitelist

Twitter are very good to us in the JSON they supply for each tweet. Every possible attribute of the tweet is encoded as elements in the JSON, meaning that we don't have to do any nasty parsing of the tweet text itself. To find out if there are URLs in the tweet, we just check the entities.urls element, and iterate through the array if present.

if not tweet.get('entities'):  
    print 'no entities element'
else:  
    if not tweet.get('entities').get('urls'):
        print 'no entities.urls element'

The URL itself is again provided to us by Twitter as the expanded_url within the urls array, and using the urlsplit library as I did previously we can extract the domain:

for url in tweet['entities']['urls']:

    expanded_url = url['expanded_url']
    domain = urlsplit(expanded_url).netloc

With the domain extracted, we can then compare it to a predefined whitelist so that we don't pick up tweets that are just linking back to sites such as Spotify, iTunes, etc. Here I'm using the Python set type and issubset method to compare the list of domain(s) that I've extracted from the tweet into the url_info list, against the whitelist:

if set(url_info['domain']).issubset(domain_whitelist):  
    print 'All domains whitelisted'
The Stream Processing Bit

With me so far? We've looked at the requirements for what our stream processing needs to do, and worked out the prototype code that will do this. Now we can jump into the actual streaming code. You can see the actual notebook here if you want to try this yourself.

Job control variables
batchIntervalSec=30  
windowIntervalSec=21600 # Six hours  
app_name = 'spark_twitter_enrich_and_count_rm_01e'  
Make Kafka available to Jupyter
import os  
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'  
os.environ['PYSPARK_PYTHON'] = '/opt/conda/envs/python2/bin/python'  
Import dependencies

As well as Spark libraries, we're also bringing in the KafkaProducer library which will enable us to send messages to Kafka. This is in the kafka-python package. You can install this standalone on your system, or inline as done below.

# Necessary to make Kafka library available to pyspark
os.system("pip install kafka-python")

#    Spark
from pyspark import SparkContext  
#    Spark Streaming
from pyspark.streaming import StreamingContext  
from pyspark.streaming.kafka import KafkaUtils  
#    Kafka
from kafka import SimpleProducer, KafkaClient  
from kafka import KafkaProducer

#    json parsing
import json  
#    url deconstruction
from urlparse import urlsplit  
#    regex domain handling
import re  
Define values to match
filters=[]  
filters.append({"tag":"album","value": "Glory Days","match":["Glory Days","GloryDays"]})  
filters.append({"tag":"artist","value": "Little Mix","match":["Little Mix","LittleMix","Litel Mixx"]})  
filters.append({"tag":"track","value": "Shout Out To My Ex","match":["Shout Out To My Ex","Shout Out To My X","ShoutOutToMyEx","ShoutOutToMyX"]})  
filters.append({"tag":"track","value": "F.U.","match":["F.U","F.U.","FU","F U"]})  
filters.append({"tag":"track","value": "Touch","match":["Touch"]})  
filters.append({"tag":"track","value": "Oops","match":["Oops"]})  
Define whitelisted domains
domain_whitelist=[]  
domain_whitelist.append("itun.es")  
domain_whitelist.append("wikipedia.org")  
domain_whitelist.append("twitter.com")  
domain_whitelist.append("instagram.com")  
domain_whitelist.append("medium.com")  
domain_whitelist.append("spotify.com")  
Function: Unshorten shortened URLs (bit.ly etc)
# Source: http://stackoverflow.com/a/4201180/350613
import httplib  
import urlparse

def unshorten_url(url):  
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    h.request('HEAD', parsed.path)
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return response.getheader('Location')
    else:
        return url
Function: Send messages to Kafka

To inspect the Kafka topics as messages are sent use:

kafka-console-consumer --zookeeper cdh57-01-node-01.moffatt.me:2181 --topic twitter_matched2  

N.B. following the Design Patterns for using foreachRDD guide here.

# http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

def send_to_kafka_matched(partition):  
    from kafka import SimpleProducer, KafkaClient
    from kafka import KafkaProducer

    kafka_prod = KafkaProducer(bootstrap_servers='cdh57-01-node-01.moffatt.me:9092')
    for record in partition:
        kafka_prod.send('twitter_matched2', str(json.dumps(record)))

def send_to_kafka_notmatched(partition):  
    from kafka import SimpleProducer, KafkaClient
    from kafka import KafkaProducer

    kafka_prod = KafkaProducer(bootstrap_servers='cdh57-01-node-01.moffatt.me:9092')
    for record in partition:
        kafka_prod.send('twitter_notmatched2', str(record))

def send_to_kafka_err(partition):  
    from kafka import SimpleProducer, KafkaClient
    from kafka import KafkaProducer

    kafka_prod = KafkaProducer(bootstrap_servers='cdh57-01-node-01.moffatt.me:9092')
    for record in partition:
        kafka_prod.send('twitter_err2', str(record))
Function: Process each tweet

This is the main processing code. It implements all of the logic described in the requirements above. If a processing condition is not met, the function returns a negative code and description of the condition that was not met. Errors are also caught and returned.

You can see a syntax-highlighted version of the code in the notebook here.

def process_tweet(tweet):  
    # Check that there's a URLs in the tweet before going any further
    if tweet.get('retweeted_status'):
        return (-1,'retweet - ignored',tweet)

    if not tweet.get('entities'):
        return (-2,'no entities element',tweet)

    if not tweet.get('entities').get('urls'):
        return (-3,'no entities.urls element',tweet)

    # Collect all the domains linked to in the tweet
    url_info={}
    url_info['domain']=[]
    url_info['primary_domain']=[]
    url_info['full_url']=[]
    try:
        for url in tweet['entities']['urls']:
            try:
                expanded_url = url['expanded_url']
            except Exception, err:
                return (-104,err,tweet)

            # Try to resolve the URL (assumes it's shortened - bit.ly etc)
            try:
                expanded_url = unshorten_url(expanded_url)
            except Exception, err:
                return (-108,err,tweet)

            # Determine the domain
            try:
                domain = urlsplit(expanded_url).netloc
            except Exception, err:
                return (-107,err,tweet)
            try:
                # Extract the 'primary' domain, e.g. www36.foobar.com -> foobar.com
                #
                # This logic is dodgy for UK domains (foo.co.uk, foo.org.uk, etc) 
                # since it truncates to the last two parts of the domain only (co.uk)
                #
                re_result = re.search('(\w+\.\w+)$',domain)
                if re_result:
                    primary_domain = re_result.group(0)
                else:
                    primary_domain = domain
            except Exception, err:
                return (-105,err,tweet)

            try:
                url_info['domain'].append(domain)
                url_info['primary_domain'].append(primary_domain)
                url_info['full_url'].append(expanded_url)
            except Exception, err:
                return (-106,err,tweet)


        # Check domains against the whitelist
        # If every domain found is in the whitelist, we can ignore them
        try:
            if set(url_info['primary_domain']).issubset(domain_whitelist):
                return (-8,'All domains whitelisted',tweet)
        except Exception, err:
            return (-103,err,tweet)

        # Check domains against the blacklist
        # Unless a domain is found, we ignore it
        #Only use this if you have first defined the blacklist!
        #if not set(domain_blacklist).intersection(url_info['primary_domain']):
        #    return (-9,'No blacklisted domains found',tweet)


    except Exception, err:
        return (-102,err,tweet)

    # Parse the tweet text against list of trigger terms
    # --------------------
    # This is rather messy iterative code that maybe can be optimised
    #
    # Because match terms are not just words, it's not enough to break
    #  up the tweet text into words and match against the filter list.
    #  Instead we have to take each filter term and see if it exists
    #  within the tweet text as a whole
    #
    # Using a set instead of list so that duplicates aren't added
    #
    matched=set()
    try:
        for f in filters:
            for a in f['match']:
                tweet_text = tweet['text']
                match_text = a.decode('utf-8')
                if match_text in tweet_text:
                    matched.add((f['tag'],f['value']))
    except Exception, err:
        return (-101,err,tweet)

    #-----
    # Add the discovered metadata into the tweet object that this function will return
    try:
        tweet['enriched']={}
        tweet['enriched']['media_details']={}
        tweet['enriched']['url_details']=url_info
        tweet['enriched']['match_count']=len(matched)
        for match in matched:
            tweet['enriched']['media_details'][match[0]]=match[1]

    except Exception, err:
        return (-100,err,tweet)

    return (len(matched),tweet)
Function: Streaming context definition

This is the function that defines the streaming context. It needs to be a function because we're using windowing and so the streaming context needs to be configured to checkpoint.

As well as processing inbound tweets, it performs counts of:

  • Inbound
  • Outbound, by type (match/no match/error)
  • For matched tweets, top domains and artists

The code is commented inline to explain how it works. You can see a syntax-highlighted version of the code in the notebook here.

def createContext():  
    sc = SparkContext(appName="spark_twitter_enrich_and_count_01")
    sc.setLogLevel("WARN")
    ssc = StreamingContext(sc, batchIntervalSec)

    # Define Kafka Consumer and Producer
    kafkaStream = KafkaUtils.createStream(ssc, 'cdh57-01-node-01.moffatt.me:2181', app_name, {'twitter':1})

    ## Get the JSON tweets payload
    ## >>TODO<<  This is very brittle - if the Kafka message retrieved is not valid JSON the whole thing falls over
    tweets_dstream = kafkaStream.map(lambda v: json.loads(v[1]))

    ## -- Inbound Tweet counts
    inbound_batch_cnt = tweets_dstream.count()
    inbound_window_cnt = tweets_dstream.countByWindow(windowIntervalSec,batchIntervalSec)

    ## -- Process
    ## Match tweet to trigger criteria
    processed_tweets = tweets_dstream.\
        map(lambda tweet:process_tweet(tweet))

    ## Send the matched data to Kafka topic
    ## Only treat it as a match if we hit at least two of the three possible matches (artist/track/album)
    ## 
    ## _The first element of the returned object is a count of the number of matches, or a negative 
    ##  to indicate an error or no URL content in the tweet._
    ## 
    ## _We only want to send the actual JSON as the output, so use a `map` to get just this element_

    matched_tweets = processed_tweets.\
                        filter(lambda processed_tweet:processed_tweet[0]>1).\
                        map(lambda processed_tweet:processed_tweet[1])

    matched_tweets.foreachRDD(lambda rdd: rdd.foreachPartition(send_to_kafka_matched))
    matched_batch_cnt = matched_tweets.count()
    matched_window_cnt = matched_tweets.countByWindow(windowIntervalSec,batchIntervalSec)

    ## Set up counts for matched metadata
    ##-- Artists
    matched_artists = matched_tweets.map(lambda tweet:(tweet['enriched']['media_details']['artist']))

    matched_artists_batch_cnt = matched_artists.countByValue()\
                                    .transform((lambda foo:foo.sortBy(lambda x:-x[1])))\
                                    .map(lambda x:"Batch/Artist: %s\tCount: %s" % (x[0],x[1]))

    matched_artists_window_cnt = matched_artists.countByValueAndWindow(windowIntervalSec,batchIntervalSec)\
                                    .transform((lambda foo:foo.sortBy(lambda x:-x[1])))\
                                    .map(lambda x:"Window/Artist: %s\tCount: %s" % (x[0],x[1]))

    ##-- Domains
    ## Since url_details.primary_domain is an array, need to flatMap here
    matched_domains = matched_tweets.flatMap(lambda tweet:(tweet['enriched']['url_details']['primary_domain']))

    matched_domains_batch_cnt = matched_domains.countByValue()\
                                    .transform((lambda foo:foo.sortBy(lambda x:-x[1])))\
                                    .map(lambda x:"Batch/Domain: %s\tCount: %s" % (x[0],x[1]))
    matched_domains_window_cnt = matched_domains.countByValueAndWindow(windowIntervalSec,batchIntervalSec)\
                                    .transform((lambda foo:foo.sortBy(lambda x:-x[1])))\
                                    .map(lambda x:"Window/Domain: %s\tCount: %s" % (x[0],x[1]))

    ## Display non-matches for inspection
    ## 
    ## Codes less than zero but greater than -100 indicate a non-match (e.g. whitelist hit), but not an error

    nonmatched_tweets = processed_tweets.\
        filter(lambda processed_tweet:(-99<=processed_tweet[0]<=1))

    nonmatched_tweets.foreachRDD(lambda rdd: rdd.foreachPartition(send_to_kafka_notmatched))

    nonmatched_batch_cnt = nonmatched_tweets.count()
    nonmatched_window_cnt = nonmatched_tweets.countByWindow(windowIntervalSec,batchIntervalSec)

    ##  Print any erroring tweets
    ## 
    ## Codes less than -100 indicate an error (try...except caught)

    errored_tweets = processed_tweets.\
                        filter(lambda processed_tweet:(processed_tweet[0]<=-100))
    errored_tweets.foreachRDD(lambda rdd: rdd.foreachPartition(send_to_kafka_err))

    errored_batch_cnt = errored_tweets.count()
    errored_window_cnt = errored_tweets.countByWindow(windowIntervalSec,batchIntervalSec)

    ## Print counts
    inbound_batch_cnt.map(lambda x:('Batch/Inbound: %s' % x))\
        .union(matched_batch_cnt.map(lambda x:('Batch/Matched: %s' % x))\
            .union(nonmatched_batch_cnt.map(lambda x:('Batch/Non-Matched: %s' % x))\
                .union(errored_batch_cnt.map(lambda x:('Batch/Errored: %s' % x)))))\
        .pprint()

    inbound_window_cnt.map(lambda x:('Window/Inbound: %s' % x))\
        .union(matched_window_cnt.map(lambda x:('Window/Matched: %s' % x))\
            .union(nonmatched_window_cnt.map(lambda x:('Window/Non-Matched: %s' % x))\
                .union(errored_window_cnt.map(lambda x:('Window/Errored: %s' % x)))))\
        .pprint()

    matched_artists_batch_cnt.pprint()
    matched_artists_window_cnt.pprint()
    matched_domains_batch_cnt.pprint()
    matched_domains_window_cnt.pprint()

    return ssc
Start the streaming context
ssc = StreamingContext.getOrCreate('/tmp/%s' % app_name,lambda: createContext())  
ssc.start()  
ssc.awaitTermination()  
Stream Output Counters

From the stdout of the job we can see the simple counts of inbound and output splits, both per batch and accumulating window:

-------------------------------------------
Time: 2017-01-13 11:50:30
-------------------------------------------
Batch/Inbound: 9
Batch/Matched: 0
Batch/Non-Matched: 9
Batch/Errored: 0

-------------------------------------------
Time: 2017-01-13 11:50:30
-------------------------------------------
Window/Inbound: 9
Window/Non-Matched: 9

-------------------------------------------
Time: 2017-01-13 11:51:00
-------------------------------------------
Batch/Inbound: 21
Batch/Matched: 0
Batch/Non-Matched: 21
Batch/Errored: 0

-------------------------------------------
Time: 2017-01-13 11:51:00
-------------------------------------------
Window/Inbound: 30
Window/Non-Matched: 30

The details of identified artists within tweets is also tracked, per batch and accumulated over the window period (6 hours, in this example)

-------------------------------------------
Time: 2017-01-12 12:45:30
-------------------------------------------
Batch/Artist: Major Lazer       Count: 4
Batch/Artist: David Bowie       Count: 1

-------------------------------------------
Time: 2017-01-12 12:45:30
-------------------------------------------
Window/Artist: Major Lazer      Count: 1320
Window/Artist: Drake    Count: 379
Window/Artist: Taylor Swift     Count: 160
Window/Artist: Metallica        Count: 94
Window/Artist: David Bowie      Count: 84
Window/Artist: Lady Gaga        Count: 37
Window/Artist: Pink Floyd       Count: 11
Window/Artist: Kate Bush        Count: 10
Window/Artist: Justice  Count: 9
Window/Artist: The Weeknd       Count: 8
Kafka Output Matched

The matched Kafka topic holds a stream of tweets in JSON format, with the discovered metadata (artist/album/track) added. I'm using the Kafka console consumer to view the contents, parsed through jq to show just the tweet text and metadata that has been added.

kafka-console-consumer --zookeeper cdh57-01-node-01.moffatt.me:2181 \  
--topic twitter_matched1 \
--from-beginning|jq -r "[.text,.enriched.url_details.primary_domain[0],.enriched.media_details.artist,.enriched.media_details.album,.enriched.media_details.track,.enriched.match_count] "
[
  "Million Reasons by Lady Gaga - this is horrendous sorry @ladygaga https://t.co/rEtePIy3OT",
  "youtube.com",
  "https://www.youtube.com/watch?v=NvMoctjjdhA&feature=youtu.be",
  "Lady Gaga",
  null,
  "Million Reasons",
  2
]
Non-Matched

On our Kafka topics outbound we can see the non-matched messages. Probably you'd disable this stream once the processing logic was finalised, but it's useful to be able to audit and validate the reasons for non-matches. Here a retweet is ignored, and we can see it's a retweet from the RT prefix of the text field. The -1 is the return code from the process_tweets function denoting a non-match:

(-1, 'retweet - ignored', {u'contributors': None, u'truncated': False, u'text': u'RT @ChartLittleMix: Little Mix adicionou datas para a Summer Shout Out 2017 no Reino Unido https://t.co/G4H6hPwkFm', u'@timestamp': u'2017-01-13T11:45:41.000Z', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 819873048132288513, u'favorite_count': 0, u'source':
Summary

In this article we've built on the foundations of the initial exploration of Spark Streaming on Python, expanding it out to address a real-world processing requirement. Processing unbounded streams of data this way is not as complex as you may think, particularly for the benefits that it can yield in reducing the latencies between an event occuring and taking action from it.

We've not touched on some of the more complex areas, such as scaling this up to multiple Spark nodes, partitioned Kafka topics, and so on - that's another [kafka] topic (sorry...) for another day. The code itself is also rudimentary - before moving it into Production there'd be some serious refactoring and optimisation review to be performed on it.

You can find the notebook for this article here, and the previous article's here.

If you'd like more information on how Rittman Mead can help your business get the most of out its data, please do get in touch!

Categories: BI & Warehousing

Getting Started with Spark Streaming, Python, and Kafka

Thu, 2017-01-12 07:42

Last month I wrote a series of articles in which I looked at the use of Spark for performing data transformation and manipulation. This was in the context of replatforming an existing Oracle-based ETL and datawarehouse solution onto cheaper and more elastic alternatives. The processing that I wrote was very much batch-focussed; read a set of files from block storage ('disk'), process and enrich the data, and write it back to block storage.

In this article I am going to look at Spark Streaming. This is one of several libraries that the Spark platform provides (others include Spark SQL, Spark MLlib, and Spark GraphX). Spark Streaming provides a way of processing "unbounded" data - commonly referred to as "streaming" data. It does this by breaking it up into microbatches, and supporting windowing capabilities for processing across multiple batches. You can read more in the excellent Streaming Programming Guide.

(image src)

Why Stream Processing?

Processing unbounded data sets, or "stream processing", is a new way of looking at what has always been done as batch in the past. Whilst intra-day ETL and frequent batch executions have brought latencies down, they are still independent executions with optional bespoke code in place to handle intra-batch accumulations. With a platform such as Spark Streaming we have a framework that natively supports processing both within-batch and across-batch (windowing).

By taking a stream processing approach we can benefit in several ways. The most obvious is reducing latency between an event occurring and taking an action driven by it, whether automatic or via analytics presented to a human. Other benefits include a more smoothed out resource consumption profile. We can avoid the very 'spiky' demands on CPU/memory/etc every time a batch runs by instead processing the same volume of data processed but in smaller intervals. Finally, given that most data we process is actually unbounded ("life doesn't happen in batches"), designing new systems to be batch driven - with streaming seen as an exception - is actually an anachronism with roots in technology limitations that are rapidly becoming moot. Stream processing doesn't have to imply, or require, "fast data" or "big data". It can just mean processing data continually as it arrives, and not artificially splitting it into batches.

For more details and discussion of streaming in depth and some of its challenges, I would recommend:

Use-Case and Development Environment

So with that case made above for stream processing, I'm actually going to go back to a very modest example. The use-case I'm going to put together is - almost inevitably for a generic unbounded data example - using Twitter, read from an Apache Kafka topic. We'll start simply, counting the number of tweets per user within each batch and doing some very simple string manipulations. After that we'll see how to do the same but over a period of time (windowing). In the next blog we'll extend this further into a more useful example, still based on Twitter but demonstrating how to satisfy some real-world requirements in the processing.

I developed all of this code using Jupyter Notebooks. I've written before about how awesome notebooks are (along with Jupyter, there's Apache Zeppelin). As well as providing a superb development environment in which both the code and the generated results can be seen, Jupyter gives the option to download a Notebook to Markdown. This blog runs on Ghost, which uses Markdown as its native syntax for composing posts - so in fact what you're reading here comes directly from the notebook in which I developed the code. Pretty cool.

If you want can view the notebook online here, and from there download it and run it live on your own Jupyter instance.

I used the docker image all-spark-notebook to provide both Jupyter and the Spark runtime environment. By using Docker I don't have to really worry about provisioning the platform on which I want to develop the code - I can just dive straight in and start coding. As and when I'm ready to deploy the code to a 'real' execution environment (for example EMR), then I can start to worry about that. The only external aspect was an Apache Kafka cluster that I had already, with tweets from the live Twitter feed on an Apache Kafka topic imaginatively called twitter.

To run the code in Jupyter, you can put the cursor in each cell and press Shift-Enter to run it each cell at a time -- or you can use menu option Kernel -> Restart & Run All. When a cell is executing you'll see a [*] next to it, and once the execution is complete this changes to [y] where y is execution step number. Any output from that step will be shown immediately below it.

To run the code standalone, you would download the .py from Jupyter, and execute it from the commandline using:

/usr/local/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 spark_code.py
Preparing the Environment

We need to make sure that the packages we're going to use are available to Spark. Instead of downloading jar files and worrying about paths, we can instead use the --packages option and specify the group/artifact/version based on what's available on Maven and Spark will handle the downloading. We specify PYSPARK_SUBMIT_ARGS for this to get passed correctly when executing from within Jupyter.

import os  
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'  
Import dependencies

We need to import the necessary pySpark modules for Spark, Spark Streaming, and Spark Streaming with Kafka. We also need the python json module for parsing the inbound twitter data

#    Spark
from pyspark import SparkContext  
#    Spark Streaming
from pyspark.streaming import StreamingContext  
#    Kafka
from pyspark.streaming.kafka import KafkaUtils  
#    json parsing
import json  
Create Spark context

The Spark context is the primary object under which everything else is called. The setLogLevel call is optional, but saves a lot of noise on stdout that otherwise can swamp the actual outputs from the job.

sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")  
sc.setLogLevel("WARN")  
Create Streaming Context

We pass the Spark context (from above) along with the batch duration which here is set to 60 seconds.

See the API reference and programming guide for more details.

ssc = StreamingContext(sc, 60)  
Connect to Kafka

Using the native Spark Streaming Kafka capabilities, we use the streaming context from above to connect to our Kafka cluster. The topic connected to is twitter, from consumer group spark-streaming. The latter is an arbitrary name that can be changed as required.

For more information see the documentation.

kafkaStream = KafkaUtils.createStream(ssc, 'cdh57-01-node-01.moffatt.me:2181', 'spark-streaming', {'twitter':1})  
Message Processing Parse the inbound message as json

The inbound stream is a DStream, which supports various built-in transformations such as map which is used here to parse the inbound messages from their native JSON format.

Note that this will fail horribly if the inbound message isn't valid JSON.

parsed = kafkaStream.map(lambda v: json.loads(v[1]))  
Count number of tweets in the batch

The DStream object provides native functions to count the number of messages in the batch, and to print them to the output:

We use the map function to add in some text explaining the value printed.

Note that nothing gets written to output from the Spark Streaming context and descendent objects until the Spark Streaming Context is started, which happens later in the code. Also note that pprint by default only prints the first 10 values.

parsed.count().map(lambda x:'Tweets in this batch: %s' % x).pprint()  

If you jump ahead and try to use Windowing at this point, for example to count the number of tweets in the last hour using the countByWindow function, it'll fail. This is because we've not set up the streaming context with a checkpoint directory yet. You'll get the error: java.lang.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint().. See later on in the blog for details about how to do this.

Extract Author name from each tweet

Tweets come through in a JSON structure, of which you can see an example here. We're going to analyse tweets by author, which is accessible in the JSON structure at user.screen_name.

The lambda anonymous function is used to apply the map to each RDD within the DStream. The result is a DStream holding just the author's screenname for each tweet in the original DStream.

authors_dstream = parsed.map(lambda tweet: tweet['user']['screen_name'])  
Count the number of tweets per author

With our authors DStream, we can now count them using the countByValue function. This is conceptually the same as this quasi-SQL statement:

SELECT   AUTHOR, COUNT(*)
FROM     DSTREAM
GROUP BY AUTHOR

Using countByValue is a more legible way of doing the same thing that you'll see done in tutorials elsewhere with a map / reduceBy.

author_counts = authors_dstream.countByValue()  
author_counts.pprint()  
Sort the author count

If you try and use the sortBy function directly against the DStream you get an error:

'TransformedDStream' object has no attribute 'sortBy'

This is because sort is not a built-in DStream function. Instad we use the transform function to access sortBy from pySpark.

To use sortBy you specify a lambda function to define the sort order. Here we're going to do it based on the number of tweets (index 1 of the RDD) per author. You'll note this index references being used in the sortBy lambda function x[1], negated to reverse the sort order.

Here I'm using \ as line continuation characters to make the code more legible.

author_counts_sorted_dstream = author_counts.transform(\  
  (lambda foo:foo\
   .sortBy(lambda x:( -x[1]))))
author_counts_sorted_dstream.pprint()  
Get top 5 authors by tweet count

To display just the top five authors, based on number of tweets in the batch period, we'll using the take function. My first attempt at this failed with:

AttributeError: 'list' object has no attribute '_jrdd'

Per my woes on StackOverflow a parallelize is necessary to return the values into a DStream form.

top_five_authors = author_counts_sorted_dstream.transform\  
  (lambda rdd:sc.parallelize(rdd.take(5)))
top_five_authors.pprint()  
Get authors with more than one tweet, or whose username starts with 'rm'

Let's get a bit more fancy now - filtering the resulting list of authors to only show the ones who have tweeted more than once in our batch window, or -arbitrarily- whose screenname begins with rm...

filtered_authors = author_counts.filter(lambda x:\  
                                                x[1]>1 \
                                                or \
                                                x[0].lower().startswith('rm'))

We'll print this list of authors matching the criteria, sorted by the number of tweets. Note how the sort is being done inline to the calling of the pprint function. Assigning variables and then pprinting them as I've done above is only done for clarity. It also makes sense if you're going to subsequently reuse the derived stream variable (such as with the author_counts in this code).

filtered_authors.transform\  
  (lambda rdd:rdd\
  .sortBy(lambda x:-x[1]))\
  .pprint()
List the most common words in the tweets

Every example has to have a version of wordcount, right? Here's an all-in-one with line continuations to make it clearer what's going on. Note that whilst it makes for tidier code, it also makes it harder to debug...

parsed.\  
    flatMap(lambda tweet:tweet['text'].split(" "))\
    .countByValue()\
    .transform\
      (lambda rdd:rdd.sortBy(lambda x:-x[1]))\
    .pprint()
Start the streaming context

Having defined the streaming context, now we're ready to actually start it! When you run this cell, the program will start, and you'll see the result of all the pprint functions above appear in the output to this cell below. If you're running it outside of Jupyter (via spark-submit) then you'll see the output on stdout.

ssc.start()  
ssc.awaitTermination()  
-------------------------------------------
Time: 2017-01-11 15:34:00
-------------------------------------------
Tweets in this batch: 188

-------------------------------------------
Time: 2017-01-11 15:34:00
-------------------------------------------
(u'jenniekmz', 1)
(u'SpamNewton', 1)
(u'ShawtieMac', 1)
(u'niggajorge_2', 1)
(u'agathatochetti', 1)
(u'Tommyguns&#95;&#95;&#95;&#95;&#95;', 1)
(u'zwonderwomanzzz', 1)
(u'Blesschubstin', 1)
(u'Prikes5', 1)
(u'MayaParms', 1)
...

-------------------------------------------
Time: 2017-01-11 15:34:00
-------------------------------------------
(u'RitaBezerra12', 3)
(u'xKYLN', 2)
(u'yourmydw', 2)
(u'wintersheat', 2)
(u'biebercuzou', 2)
(u'pchrin_', 2)
(u'uslaybieber', 2)
(u'rowblanchsrd', 2)
(u'__Creammy__', 2)
(u'jenniekmz', 1)
...

-------------------------------------------
Time: 2017-01-11 15:34:00
-------------------------------------------
(u'RitaBezerra12', 3)
(u'xKYLN', 2)
(u'yourmydw', 2)
(u'wintersheat', 2)
(u'biebercuzou', 2)

-------------------------------------------
Time: 2017-01-11 15:34:00
-------------------------------------------
(u'RitaBezerra12', 3)
(u'xKYLN', 2)
(u'yourmydw', 2)
(u'wintersheat', 2)
(u'biebercuzou', 2)
(u'pchrin_', 2)
(u'uslaybieber', 2)
(u'rowblanchsrd', 2)
(u'__Creammy__', 2)

-------------------------------------------
Time: 2017-01-11 15:34:00
-------------------------------------------
(u'RT', 135)
(u'Justin', 61)
(u'Bieber', 59)
(u'on', 41)
(u'a', 32)
(u'&amp;', 32)
(u'Ros\xe9', 31)
(u'Drake', 31)
(u'the', 29)
(u'Love', 28)
...
[...]

You can see the full output from the job in the notebook here.

So there we have it, a very simple Spark Streaming application doing some basic processing against an inbound data stream from Kafka.

Windowed Stream Processing

Now let's have a look at how we can do windowed processing. This is where data is processed based on a 'window' which is a multiple of the batch duration that we worked with above. So instead of counting how many tweets there are every batch (say, 5 seconds), we could instead count how many there are per minute. Here, a minutes (60 seconds) is the window interval. We can perform this count potentially every time the batch runs; how frequently we do the count is known as the slide interval.


Image credit, and more details about window processing, here.

The first thing to do to enable windowed processing in Spark Streaming is to launch the Streaming Context with a checkpoint directory configured. This is used to store information between batches if necessary, and also to recover from failures. You need to rework your code into the pattern shown here. All the code to be executed by the streaming context goes in a function - which makes it less easy to present in a step-by-step form in a notebook as I have above.

Reset the Environment

If you're running this code in the same session as above, first go to the Jupyter Kernel menu and select Restart.

Prepare the environment

These are the same steps as above.

import os  
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'  
from pyspark import SparkContext  
from pyspark.streaming import StreamingContext  
from pyspark.streaming.kafka import KafkaUtils  
import json  
Define the stream processing code
def createContext():  
    sc = SparkContext(appName="PythonSparkStreamingKafka_RM_02")
    sc.setLogLevel("WARN")
    ssc = StreamingContext(sc, 5)

    # Define Kafka Consumer
    kafkaStream = KafkaUtils.createStream(ssc, 'cdh57-01-node-01.moffatt.me:2181', 'spark-streaming2', {'twitter':1})

    ## --- Processing
    # Extract tweets
    parsed = kafkaStream.map(lambda v: json.loads(v[1]))

    # Count number of tweets in the batch
    count_this_batch = kafkaStream.count().map(lambda x:('Tweets this batch: %s' % x))

    # Count by windowed time period
    count_windowed = kafkaStream.countByWindow(60,5).map(lambda x:('Tweets total (One minute rolling count): %s' % x))

    # Get authors
    authors_dstream = parsed.map(lambda tweet: tweet['user']['screen_name'])

    # Count each value and number of occurences 
    count_values_this_batch = authors_dstream.countByValue()\
                                .transform(lambda rdd:rdd\
                                  .sortBy(lambda x:-x[1]))\
                              .map(lambda x:"Author counts this batch:\tValue %s\tCount %s" % (x[0],x[1]))

    # Count each value and number of occurences in the batch windowed
    count_values_windowed = authors_dstream.countByValueAndWindow(60,5)\
                                .transform(lambda rdd:rdd\
                                  .sortBy(lambda x:-x[1]))\
                            .map(lambda x:"Author counts (One minute rolling):\tValue %s\tCount %s" % (x[0],x[1]))

    # Write total tweet counts to stdout
    # Done with a union here instead of two separate pprint statements just to make it cleaner to display
    count_this_batch.union(count_windowed).pprint()

    # Write tweet author counts to stdout
    count_values_this_batch.pprint(5)
    count_values_windowed.pprint(5)

    return ssc
Launch the stream processing

This uses local disk to store the checkpoint data. In a Production deployment this would be on resilient storage such as HDFS.

Note that, by design, if you restart this code using the same checkpoint folder, it will execute the previous code - so if you need to amend the code being executed, specify a different checkpoint folder.

ssc = StreamingContext.getOrCreate('/tmp/checkpoint_v01',lambda: createContext())  
ssc.start()  
ssc.awaitTermination()  
-------------------------------------------
Time: 2017-01-11 17:08:55
-------------------------------------------
Tweets this batch: 782
Tweets total (One minute rolling count): 782

-------------------------------------------
Time: 2017-01-11 17:08:55
-------------------------------------------
Author counts this batch:    Value AnnaSabryan   Count 8
Author counts this batch:    Value KHALILSAFADO  Count 7
Author counts this batch:    Value socialvidpress    Count 6
Author counts this batch:    Value SabSad_   Count 5
Author counts this batch:    Value CooleeBravo   Count 5
...

-------------------------------------------
Time: 2017-01-11 17:08:55
-------------------------------------------
Author counts (One minute rolling):    Value AnnaSabryan   Count 8
Author counts (One minute rolling):    Value KHALILSAFADO  Count 7
Author counts (One minute rolling):    Value socialvidpress    Count 6
Author counts (One minute rolling):    Value SabSad_   Count 5
Author counts (One minute rolling):    Value CooleeBravo   Count 5
...

[...]

-------------------------------------------
Time: 2017-01-11 17:10:10
-------------------------------------------
Tweets this batch: 5
Tweets total (One minute rolling count): 245

-------------------------------------------
Time: 2017-01-11 17:10:10
-------------------------------------------
Author counts this batch:    Value NowOnFR   Count 1
Author counts this batch:    Value IKeepIt2000   Count 1
Author counts this batch:    Value PCH_Intl  Count 1
Author counts this batch:    Value ___GlBBS  Count 1
Author counts this batch:    Value lauracoutinho24   Count 1

-------------------------------------------
Time: 2017-01-11 17:10:10
-------------------------------------------
Author counts (One minute rolling):    Value OdaSethre Count 3
Author counts (One minute rolling):    Value CooleeBravo   Count 2
Author counts (One minute rolling):    Value ArrezinaR Count 2
Author counts (One minute rolling):    Value blackpinkkot4 Count 2
Author counts (One minute rolling):    Value mat_lucidream Count 1
...

You can see the full output from the job in the notebook here. Let's take some extracts and walk through them.

Total tweet counts

First, the total tweet counts. In the first slide window, they're the same, since we only have one batch of data so far:

-------------------------------------------
Time: 2017-01-11 17:08:55
-------------------------------------------
Tweets this batch: 782
Tweets total (One minute rolling count): 782 

Five seconds later, we have 25 tweets in the current batch - giving us a total of 807 (782 + 25):

-------------------------------------------
Time: 2017-01-11 17:09:00
-------------------------------------------
Tweets this batch: 25
Tweets total (One minute rolling count): 807 

Fast forward just over a minute and we see that the windowed count for a minute is not just going up - in some cases it goes down - since our window is now not simply the full duration of the inbound data stream, but is shifting along and giving a total count for the last 60 seconds only.

-------------------------------------------
Time: 2017-01-11 17:09:50
-------------------------------------------
Tweets this batch: 28
Tweets total (One minute rolling count): 1012

-------------------------------------------
Time: 2017-01-11 17:09:55
-------------------------------------------
Tweets this batch: 24
Tweets total (One minute rolling count): 254
Count by Author

In the first batch, as with the total tweets, the batch tally is the same as the windowed one:

-------------------------------------------
Time: 2017-01-11 17:08:55
-------------------------------------------
Author counts this batch:    Value AnnaSabryan   Count 8
Author counts this batch:    Value KHALILSAFADO  Count 7
Author counts this batch:    Value socialvidpress    Count 6
Author counts this batch:    Value SabSad_   Count 5
Author counts this batch:    Value CooleeBravo   Count 5
...

-------------------------------------------
Time: 2017-01-11 17:08:55
-------------------------------------------
Author counts (One minute rolling):    Value AnnaSabryan   Count 8
Author counts (One minute rolling):    Value KHALILSAFADO  Count 7
Author counts (One minute rolling):    Value socialvidpress    Count 6
Author counts (One minute rolling):    Value SabSad_   Count 5
Author counts (One minute rolling):    Value CooleeBravo   Count 5    

But notice in subsequent batches the rolling totals are accumulating for each author. Here we can see KHALILSAFADO (with a previous rolling total of 7, as above) has another tweet in this batch, giving a rolling total of 8:

-------------------------------------------
Time: 2017-01-11 17:09:00
-------------------------------------------
Author counts this batch:    Value DawnExperience    Count 1
Author counts this batch:    Value KHALILSAFADO  Count 1
Author counts this batch:    Value Alchemister5  Count 1
Author counts this batch:    Value uused2callme  Count 1
Author counts this batch:    Value comfyjongin   Count 1
...

-------------------------------------------
Time: 2017-01-11 17:09:00
-------------------------------------------
Author counts (One minute rolling):    Value AnnaSabryan   Count 9
Author counts (One minute rolling):    Value KHALILSAFADO  Count 8
Author counts (One minute rolling):    Value socialvidpress    Count 6
Author counts (One minute rolling):    Value SabSad_   Count 5
Author counts (One minute rolling):    Value CooleeBravo   Count 5
Summary

What I've put together is a very rudimentary example, simply to get started with the concepts. In the examples in this article I used Spark Streaming because of its native support for Python, and the previous work I'd done with Spark. Jupyter Notebooks are a fantastic environment in which to prototype code, and for a local environment providing both Jupyter and Spark it all you can't beat the Docker image all-spark-notebook.

There are other stream processing frameworks and languages out there, including Apache Flink, Kafka Streams, and Apache Beam, to name but three. Apache Storm and Apache Samza are also relevant, but whilst were early to the party seem to crop up less frequently in stream processing discussions and literature nowadays.

In the next blog we'll see how to extend this Spark Streaming further with processing that includes:

  • Matching tweet contents to predefined list of filter terms, and filtering out retweets
  • Including only tweets that include URLs, and comparing those URLs to a whitelist of domains
  • Sending tweets matching a given condition to a Kafka topic
  • Keeping a tally of tweet counts per batch and over a longer period of time, as well as counts for terms matched within the tweets
Categories: BI & Warehousing

Web-Based RPD Upload and Download for OBIEE 12c

Wed, 2017-01-11 06:15

I was among the people who were dancing and singing after finding out some of the OBIEE 12c new features. The feature I liked the most was a scripted deploy of an RPD file from a developer’s computer. I hate to make dozens of clicks for every deploy of an RPD in 11g. You may object and say that there is WLST in 11g which can do the same and even more. Well, you are right. Except for one thing: WLST is a server-side thing. Information security folk don’t like to give direct access to a server to OBIEE developers. And not every developer is capable of using it.

In OBIEE 12c the only way to upload and download RPDs from a developer’s local machine to the OBIEE server is through the command line. We’re big fans of the command-line approach because it enables automation, reduces the risk of error, and so on. But not all people like a script everything approach as we do. Many of OBIEE developers don’t like to use a command line to do what they used to do with their mouse for years. And today we have a solution for them!

Disclaimer. Everything below is a result of our investigation. It’s not a supported functionality or Oracle’s recommendation. It makes use of undocumented web services that Oracle may remove or change at any time.

Some time ago Robin Moffatt lifted the lid on OBIEE 12c Web Services. He found out how to use curl to do the same things Oracle does with their data-model-cmd (datamodel now) script. But that was purely for geek interest and intended to give us more understanding of what's going on inside of the OBIEE, not give us a new tool. So the next obvious step was to make a user-friendly interface over that web services so any OBIEE developer could utilise this sacred knowledge.

The Simplest Sample

Modern computer technologies offer us a lot of tools to build GUIs, but we wanted to keep it as simple as possible and because OBIEE’s front end is web-based, use of HTML for our RPD tool was the obvious choice too.

Download

Let's start with RPD download.
Here is the curl script to call OBIEE web service and get RPD file.

curl -X "POST" "http://192.168.0.66:7780/bi-lcm/v1/si/ssi/rpd/downloadrpd" \  
--data-urlencode "target-password=Admin123" \
--basic --user weblogic:Admin123 \
> downloadrpd.rpd

an animated gif of curl download

As you can see it's pretty simple. We send a message to http://<host>:<port>/bi-lcm/v1/si/ssi/rpd/downloadrpd using POST method. As a parameter, we send a password to set to the downloaded RPD file (target-password=Admin123) and authentication information (weblogic:Admin123). As a result, we get bytes of the RPD which we redirect to the downloadrpd.rpd file. And now we want a GUI for this script. Actually, Robin already did it.

<html>  
   <body>
      <FORM action="http://192.168.0.66:7780/bi-lcm/v1/si/ssi/rpd/downloadrpd"
         method="post" target=_blank>
         <P>
           New password for downloaded RPD file? <INPUT type="password" name="target-password"><BR>
            <INPUT type="submit" value="Send"> <INPUT type="reset">
      </FORM>
   </body>
</html>  

pic of a local download HTML form

This is not a snippet of code you somehow should incorporate into your system. No. That's almost complete GUI for RPD download! The only thing you need to do is to change hostname and port to match your system. That's all. Simply create an HTML file, put this code into it, change host and port, open with a browser, Enjoy!

This form has no field for authentication because OBIEE server will ask us for login and password at the first call and will maintain this session later.

Upload

The upload was a little bit more tricky from the curl side. Let's take a look at the script:

curl -X POST \  
     "http://192.168.0.66:7780/bi-lcm/v1/si/ssi/rpd/uploadrpd" \
     --form "file=@sample.rpd;type=application/vnd.oracle.rpd" \
     --form "rpd-password=Admin123" \
     --basic --user weblogic:Admin123

an animated gif of curl upload

Here we call another service to upload our file. Our parameters are:

  1. sample.rpd - the RPD file to upload
  2. type=application/vnd.oracle.rpd - MIME type of this file (that was the main trick).
  3. rpd-password=Admin123 - the password of sample.rpd
  4. weblogic:Admin123 - information for authentication.

But the GUI for this task is surprisingly simple. I expected it to be monstrous with lots of JS but in fact, it is small and easy. The minimum viable version is almost as simple as the upload one.

<html>  
   <body>
      <FORM action="http://192.168.0.66:9502/bi-lcm/v1/si/ssi/rpd/uploadrpd"
         method="post" enctype="multipart/form-data" target=_blank>
           File to upload <INPUT type=file name="file"><BR>
           Password for uploaded RPD file? <INPUT type="password" name="rpd-password"><BR>
           <INPUT type="submit" value="Send"> <INPUT type="reset">
      </FORM>
   </body>
</html>  

pic of a local upload HTML form

The use of this piece of code is exactly the same as for download. Simply put it into an HTML file, change host and port. Use it.

Keep in mind that for the both forms field names are fixed and shouldn't be changed. For example, the field for a file to upload should have name "file" and for a password - "rpd-password". Without it, magic won't work.

But there is a thing about this part that we could still improve. Depending on the browser you use it shows the response message either in the same window or downloads it as a text file. And this message is a JSON file.

In real life, this message is a one line JSON but here it is a more human-readable formatted with jq and slightly polished by hands.

{
  "clazz": ["rpd-response"],
  "links": [
    {
      "href": "http://192.168.0.66:7780/bi-lcm/v1/si/ssi/rpd/uploadrpd",
      "rel": ["self"]
    }
  ],
  "properties": {
    "entry": [
      {
        "key": "si",
        "value": {
          "type": "string",
          "value": "ssi"
        }
      },
      {
        "key": "description",
        "value": {
          "type": "string",
          "value": "RPD upload completed successfully."
        }
      },
      {
        "key": "desc_code",
        "value": {
          "type": "string",
          "value": "DESC_CODE_RPD_UPLOAD_SUCCESSFUL"
        }
      },
      {
        "key": "status",
        "value": {
          "type": "string",
          "value": "SUCCESS"
        }
      }
    ]
  },
  "title": "RPD-LCM response, SI=ssi, action=Upload RPD"
}

As you can see here, we have "description" field which holds a human readable message, "desc_code" field is the same but more suitable for automated processing and "status" field which is the first candidate to be used in automatic procedures.

It's easy to read this file but most of the time you'd prefer a simple "Success" message, right?

Going Further

These HTML forms do the trick. A developer can now download and upload RPD file easily with a minimum of clicks and without a need to learn a command-line interface. Security is managed by Weblogic server. Sounds good, right? But we can do it even better. From my point of view absolutely necessary improvements are:

  1. Add some JS to make diagnostics more user-friendly.
  2. Put these forms to a server so every developer in an organisation can use them.
Adding Some JavaScript Magic

My intent from the very beginning was to keep things as simple as possible. I’m not sure that this time my choice of JavaScript library (JQuery) was the simplest for this task, but anyways the code I have to write is very small so I like it.

<html>  
<head>  
<script src="./jquery-3.1.1.min.js"></script>  
</head>

<body>

<script>  
$(document).ready(function(){
$("#upload").on('submit', function( e ) {
        e.preventDefault();
    $.ajax( {
        url: $(this).attr('action'),
        type: 'POST',
        data: new FormData( this ),
        processData: false,
        contentType: false,
        dataFilter: function (data, type){$("#response").html(JSON.parse(data).properties.entry[1].value.value);$("#response").append('<details>'+data+'</details>');}
    });
});
});
</script>

<FORM id=upload action="http://192.168.0.136:9502/bi-lcm/v1/si/ssi/rpd/uploadrpd"  
         method="post" enctype="multipart/form-data">
           File to upload <INPUT type=file name="file"><BR>
           Password for uploaded RPD file? <INPUT type="password" name="rpd-password"><BR>
           <INPUT type="submit" value="Send"> <INPUT type="reset">
</FORM>

<div id=response></div>

</body>  
</html>  

The script reads the form and sends it content to the server, then it reads the answer, parses it and shows in a user-friendly way. Note that it does need the Jquery library to work. The problem with this code is that it won't work locally. If you try to use it in the same way as previous samples it won't do anything. But if we take a look at the developer console of the browser we immediately find the answer. OBIEE blocks my cross-domain JavaScript call.

a pic of developer console

That could become a problem but I was going to put these files on a server anyway so that all developers could access it.

Deploying It to a Server

What I want to do now is to put my forms to some place accessible with a browser from a server where OBIEE works. To achieve that I should do a few steps.

  1. Create a directory on the server.
  2. Put my files to that directory.
  3. Expose the directory with a web server.

There are no special requirements for the place for a directory I will create. It just should be accessible by a web server (Weblogic). I prefer to keep all user content in one place so my choice is to place it somewhere inside $ORACLEHOME/userprojects.

a pic of location

But there is one special requirement for the directory content. It should have a subdirectory WEB-INF with web.xml file inside.

a pic of web.xml

For my current purposes, almost empty web.xml is just fine. That may be not the best option for the real life but I'm trying to keep things simple, remember?

<?xml version="1.0" encoding="UTF-8"?>  
<web-app xmlns="http://java.sun.com/xml/ns/javaee"  
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"
    version="2.5"/>

I combined both download and upload forms into one rpdtools.html file and added some styling for a nicer look. From the functional point of view, these forms and script were not changed. And then I put this combined file and Jquery library into my "static" directory.

a pic with directory contents

Now everything is ready for the final step. I need to deploy this directory to Weblogic server so the users can access it with a browser.

Login to Weblogic console and deploy static folder as an application.
a pic of wls console

For more details on deploying folders to WLS, see the Official documentation on web.xml and Deploying Using Shared Folders.

And now the most exciting part of the process. Witness the power of this fully operational battle station! I mean RPD tools.
a pic of the final result

Summary

We showed here a very simple way of adding a web-based GUI for uploading and downloading RPD to any OBIEE 12c system. You can take it and adjust to suit your needs and be a useful day-to-day tool. Deploying this code to a server allows to give an access to it to all OBIEE developers in an organisation and add some cool JavaScript effects. But keep in mind that it uses non-documented services and is not supported by Oracle in any way. This means that it can stop working after an upgrade. Well, in that case, we'll have to invent something new for you.

In case you want to play with this tool, here is a link to our GitHub obi-web-rpd-tools.

Categories: BI & Warehousing

Python for Analytics - Exploring Data with Pandas

Thu, 2016-12-22 12:24
A Crack Team!

At Rittman Mead, we're always encouraged to branch out and pursue new skills in the field in an effort to improve upon our skill sets, and as a result, become more technically fluent. One of the great things about working here, aside from the previous, is that while we all have a really solid foundation in Oracle technologies, there are many who possess an incredibly diverse range of skills, fostered by years of working in tech-agnostic engagements. It's a great feeling to know that if you ever run up against some sort of bizarre challenge, or have to integrate some piece of arcane tech into an architectural solution, more than likely, someone, somewhere within Rittman Mead has had to do it. It is this freedom to pursue, within reason of course, other technical exploits that has nurtured a real spirit of innovation around the company within the past year. Our GitHub is overflowing with open source visualizations and performance monitoring and maintenance scripts that are simply there for the taking! We've put a lot of time into developing this stuff so our clients and partners don't have to.

Python

python logo But I digress. This blog is about Python, and well, I haven't really mentioned it up until this point. It is in this spirit of innovation, learning, and frankly, automating the boring stuff, however, that a lot of us have been pursuing automation and analytical endeavors using the Python language. It has been touted by many as THE language for data science, and rightfully so, given its accessibility and healthy selection of libraries perfectly suited to the task, such as NumPy, Seaborn, Pandas, Matplotlib. In today's exercise, we're going to walk through common data munging, transformation, and visualization tasks using some of these libraries in order to deliver quick insights into a data set that's near and dear to my heart, Game of Thrones battles and character deaths!

GoT Logo

Through this process, we will be creating our own data narrative that will help to expound upon the idle numbers and labels of the data set. We'll see that the process is less a hard and fast, rigid, set of rules for which to approach data exploration, and something more akin to solving a crime, clue by clue, letting the data tell the story.

PYTHON FOR DATA SCIENCE

Aside from its myriad, community driven and maintained libraries, the greatest thing, to me anyway, about Python is its relatively low barrier to entry. Even with little to no previous programming skills, an enterprising lad or lady can get up and running, performing basic, functional programming tasks in no time. You'll be amazed at how quickly you'll start coming up with daily tasks that you can throw a bit (or a lot) of Python at. Today, we'll be tackling some tasks like these, common to the everyday processes of data analysis and data science. Utilizing the Pandas library, in addition to a few others, we'll see how we can programmatically go from question to answer in no time, and with most any structured or unstructured data set. The primary audience of this blog will be those with a bit of Python fluency, in addition to those with an interest in data science and analytics. I will be explaining the steps and providing a Jupyter notebook (link here) for those who wish to follow along, however, for those who might need the extra guidance. So don't bail now! Let's get to it. In this instance, we'll be downloading the Game of Thrones data set from kaggle, a great site that provides open data sets and accompanying analysis projects in multiple languages. Download the data set if you'd like to follow along.

GETTING STARTED

stepping stones Let's begin by taking some steps to get our heads on straight and carve out a clear work flow. I find this is really helpful, especially in programming and/or analytical scenarios where one can begin to suffer from "analysis paralysis". So, at a high level, we'll be doing the following:

  1. First, we'll take a cursory look at the Python libraries we'll be incorporating into our data sleuthing exercise, how they're used, and some examples of their output and ideal use cases.

  2. Next we'll use the tools in these libraries to take a deeper dive into our data set and start to construct our initial line of questioning. This is where we get to be a bit creative in coming up with how we're going to wrap our heads around the data, and what kind of questions we're going to throw at it.

  3. We'll then chase down any leads, incorporating additional analyses where necessary, and begin to construct a narrative about our data set. At this point we'll be formulating hypotheses and attempting to construct visualizations that will help us to further or disprove our investigation.

PANDAS IN THE JUNGLE

Any great detective must always have with them a toolkit with which to thoroughly examine any crime scene, and that's essentially what we have in the Pandas, Seaborn, and Numpy ("num-pie") libraries for the Python programming language. They provide a set of methods (functions) that can take an input, or a number of inputs, do some magic, and then provide us with lots of really useful information. So let's begin by examining these libraries and what we can do with each.

Pandas and Numpy

pands_logo

Pandas is great at doing a bunch of really common tasks related to data exploration, not limited to, indexing and selection, merging and joining data sets, grouping and aggregations, and visualizing data. This will be the library with which we'll be doing a lot of the heavy lifting. Pandas also provides us with the Dataframe object that greatly expands on the comparatively more rigid Numpy's ndarray object. These 'objects' are simply containers that hold data of some kind, and allow us to interact on that data.

Matplotlib and Seaborn

matplot_logo

Matplotlib is a robust visualization library built to enable interactive, MATLAB style plotting on most any platform or back-end. This library, along with Seaborn, should be your go-to for producing super malleable graphs and visualizations. Working alongside matplotlib, seaborn pitches itself as a go-to for statistical based visualizations, but also supports complex, grid and algorithm based charts as well. Both of these libraries will help us to make quick and insightful decisions about our data, and help us to gather evidence further supporting, or disproving and hypotheses we might form.

THE INVESTIGATION

desk

Now that we've armed ourselves with the tools we need to thoroughly examine any potential data set we're given, let's do just that! Opening up the Game of Thrones csv files from above, let's first take a look at what kinds of information we have. The basic stats are as follows:

Synopsis

  • Battles - a complete listing of the battles in the book series and their stats! Attacker, defender, army size, you name it, we've got it.

  • Character Deaths - something the series/show is quite known for, who died? This contains some great info, such as allegiance and nobility status.

  • Character Predictions - The more morbid of the lot, this data set lists predictions on which character will die. We won't be using this sheet for our exercise.

A Hypothesis of Thrones

Having just finished the monumental series myself, you could say that at this point I'm somewhat of a subject matter expert; that at this point, we have a situation not unlike that which you might find in any organization. We've got an interested party that wants to look further into some aspect of their data. We can use our investigatory tool-set to get real results and gain some valuable and informative insights. As subject matter experts though, we should ideally be coming at our data with at least some semblance of a hypothesis, or something that we're trying to prove using our data (or disprove for that matter). For the sake of this exercise, and fitting in with the theme of the data, I'm going to try and dig up an answer to the following:

Does House Lannister, for as evil and scheming as they are, and as much as they get away with, eventually get what's coming to them?

No lannisters!

As much as I'd like to believe it's true, however, we're going to need to run the numbers, and let our data do the talking.

Importing the Data

You can follow along in the Jupyter notebook here now. Working with our Pandas library, we first need to get our data into some sort of workable object. As we stated before, this is the data frame. It is simply a table type object that is really good at handling empty values and data of many different types. We can easily perform operations on these objects and visualize them with minimal fuss. So, enough talk. Let's do it!

Working in your favorite IDE (Pycharm is easy to use and comes in a free version), we start a new project, import the libraries we need, and then drop in our first piece of code. This is the section that imports our csv data set and then converts it to data frame. So, now that we have our object, what do we do with it?

dataframe

A Graph Has No Name

Now that we have our data frame object, we can begin to throw some code at it, crunch some numbers, and see if, in fact, the Lannisters really did get what was coming to them by the end of book 5. Starting with the battles data set, let's see how they fared in the field through the arc of the story. Did they lose more or less troops comparatively? We can do this easily by breaking our data frame into smaller, more manageable chunks, and then graph these data points, accordingly. We are going to use the data set to build a step by step, set of analyses that examines the Lannister victories and defeats throughout the story.

Battle / Troop Loss Over Time

Did the Lannisters hit a losing streak, or did they do well throughout the story? Did they win or lose more of their battles over time?

  1. Start with new data frame based on house and troop sizes:
    troop sizes

  2. Filter to get new results (Lannisters only):

Lannister Troops

Right away we see we have some data issues, that is, there are some holes in the attacker size column. The good thing is that we can more or less look at this small table and get the all the info we need from it right away. The numbers drop down significantly through the years, and that's all there really is to it. But, was this in fact, because they lost more troops, or simply threw less at the problem as they began to carve out their claim to the kingdom? This analysis is not very telling. We're going to need to do some digging elsewhere to answer our question. Let's do some comparisons.

% of Battles Won / Lost

So how did the Lannisters do in the field? Of the 8 battles they fought in, how many did they win? How does this compare with the other armies of Seven Kingdoms?

As we did before, lets get a new data frame together, and then do our grunt work on it to try and answer these questions. Grabbing the columns we need, let's run the numbers on how the Lannisters stack up against the other houses of Westeros in the field.

How many battles did they fight compared to the other Houses?

battles fought

How many did they actually win?

won-to-total

We can see right away, that out of all the battles they fought throughout the series (which is decidedly more than the other houses in the series), that they came out on top. Could the Lannisters be the dominating force on the field, as well as at court? The Starks are the only house that meet them conflict for conflict, and the Lannisters still reign supreme! Let's take things down to a finer grain and see how those who aligned themselves with the Lion did compared to those who didn't.

Death by Allegiance

Opening up our character deaths file, right away we see we have some pretty good info here. We have a laundry list of characters, their death year, and the house, if any, to which they were aligned. Let's start by building a data frame, and first, filtering out those who are unaligned, in the Night's Watch, or a Wildling. We want to get a comparison between houses, and these groups will just muck up the works. Let's do the numbers. We can now plot this info on a basic bar chart to get a basic rundown of the massacre.

death by alignment

Things are starting to look up...depending on your point of view, I guess. The Lannisters, for all their dirty business, do seem to, in fact, lose the most named characters tied to their house. Of these, let's see how many were actually nobility, or rather, the most influential in furthering their cause!

noble deaths to total

It would seem our Lannisters aren't too good at keeping their hands clean, and letting those of lesser station do their dirty work for them. Although they have the second most aligned character deaths in the series, roughly 75% of them are Noble deaths, meaning that people important to their cause are dying! The only other houses that come close unfortunately, are the Starks (the Red Wedding, no doubt), and the Greyjoys. What this also means, however, is that our claim is gathering more support; the Lannisters may have climbed the royal ladder, but at what cost?

Paying Your Debts

death_prop

We can see from the donut chart above (excuse the repetition of colors) that indeed, the Lannisters have one of the highest % to total death numbers out of all the major houses in the Seven Kingdoms. This actually goes quite a long way in backing up our hypothesis; that of all the named characters in the series, the Lannisters lost the lion's share (pun intended). The disconcerting thing is that they either seem to bring down many others with them, or the other noble houses aren't terribly great at keeping themselves among the living either.

Conclusion

Are these figures, combined with their high noble of ranking noble deaths enough to satisfy my desire for vengeance? Did they truly reap what they have sown? I have to say I am ultimately undecided in the matter, as, although they did lose a great many, they in turn took a a greater number down along with them. It seems that despite these losses, any notion of vengeful satisfaction must be tempered by this fact; that although the Lannisters did end up getting hit pretty hard with significant losses, this is bittersweet when compared to the real and lasting damage they did throughout the span of the book's and show's history. Were you able to come up with any additional evidence for or against my case? Link out and show us! Thanks for reading.

Categories: BI & Warehousing

Pages