Rittman Mead Consulting

Subscribe to Rittman Mead Consulting feed Rittman Mead Consulting
Rittman Mead consults, trains, and innovates within the world of Oracle Business Intelligence, data integration, and analytics.
Updated: 1 month 2 days ago

HVR | Real-Time CDC ::Oracle Autonomous DW::

Fri, 2019-09-13 09:49
Introduction :Oracle Autonomous DW::

High quality Business Intelligence is key in decision making for any successful organisation and for an increasing number of businesses this means being able to access real-time data.  At Rittman Mead we are seeing a big upturn in interest in technologies that will help our customers to integrate real-time data into their BI systems.  In a world where streaming is revolutionising the way we consume data, log-based Change Data Capture (CDC) comes into the picture as a way of providing access to real-time transactional data for reporting and analysis. Most databases support CDC nowadays; in Oracle for example the redo logs are a potential source of CDC data. Integration tools that support CDC are coming to the fore and one of these is HVR.

HVR is a real-time data integration solution that replicates data to and from many different data management technologies, both on-premise and cloud-based, including Oracle, SAP and Amazon. The HVR GUI makes it easy to connect together multiple databases, replicate entire schemas, initialise the source and target systems and then keep them in sync.

HVR have recently added Oracle’s Autonomous Data Warehouse (ADW) to their list of supported technologies and so in this blog we are stepping through the process of configuring an HVR Channel to replicate data from an Oracle database to an instance of Oracle ADW.


Before setting up replication you have to install HVR itself. This is simple enough, a fairly manual CLI job with a couple of files to create and save in the correct directories. Firewalls also needs to allow all HVR connections. HVR needs a database schema in which to store the repository configuration data and so we created a schema in the source Oracle database. It also needs some additional access rights on the Oracle source database.

Process 1.

The first step is to register the newly created Hub in the HVR GUI. The GUI can be run on any machine that is able to connect to the server on which the HVR hub is installed. We tested two GUI instances, one  running on a Windows machine and one on a MAC. Both were easy to install and configure.

:Oracle Autonomous DW::

The database connection details entered here are for the HVR hub database, where metadata about the hub configuration is stored.


Next we need to define our source and target. In both cases the connection between the HVR and the data uses standard Oracle database connectivity. The source connection is to a database on the same server as the HVR hub and the target connection uses a TNS connection pointing at the remote ADW instance.

Defining the source database involves right clicking on Location Configuration and selecting New Location:

:Oracle Autonomous DW::

Configuring the target involves the same steps:

:Oracle Autonomous DW::

You can see from the screenshot that we are using one of the Oracle-supplied tnsnames entries to connect to ADW and also that we are using a separate Oracle client install to connect to ADW. Some actions within HVR use the Oracle Call Interface and require a more recent version of the Oracle client than provided by our 12c database server install.

Next up is creating the “channel”. A channel channel in HVR groups together the source and target locations and allows the relationship between the two to be defined and maintained.  Configuring a new channel involves naming it, defining source and target locations and then identifying the tables in the source that contain the data to be replicated.


The channel name is defined by right clicking on Channel Definitions and selecting New Channel.

:Oracle Autonomous DW::

We then open the new channel and right click on Location Groups and select New Group to configure the group to contain source locations:

:Oracle Autonomous DW::

The source location is the location we defined in step 2 above. We then right click on the newly created group and select New Action, Capture  to define the role of the group in the channel:

:Oracle Autonomous DW::

The Capture action defines that data will be read from the locations in this group.

A second Location Group is needed for the for the target. This time we defined the target group to have the Integrate action so that data will be written to the locations in this group.


The final step in defining the channel is to identify the tables we want to replicate. This can be done using the Table Explore menu option when you right-click on Tables.:

:Oracle Autonomous DW:: 5.

With the channel defined we can start synchronising the data between the two systems. We are starting with an empty database schema in our ADW target so we use the HVR Refresh action to first create the target tables in ADW and to populate them with the current contents of the source tables.  As the Refresh action proceeds we can monitor progress:

:Oracle Autonomous DW:: 6.

Now with the two systems in sync we can start the process of real-time data integration using the HVR Initialise action. This creates two new jobs in  the HVR Scheduler which then need to be started:

:Oracle Autonomous DW::

One more thing to do of course: test that the channel is working and replications is happening in real-time. We applied a series of inserts, updates and deletes to the source system and monitored the log files for the two scheduled jobs to see the activity captured from the redo logs on the source:

:Oracle Autonomous DW::

and then applied as new transactions on the target:

:Oracle Autonomous DW::

The HVR Compare action allows us to confirm that the source and target are still in sync.

:Oracle Autonomous DW::

Clearly the scenario we are testing here is a simple one. HVR can do much more - supporting one-to-many, many-to-many and also bi-directional replication configurations. Nonetheless we were impressed with how easy it was to install and configure HVR and also with the simplicity of executing actions and monitoring the channel through the GUI. We dipped in to using the command line interface when executing some of the longer running jobs and this was straightforward too.

Categories: BI & Warehousing

Tableau | Dashboard Design ::Revoke A50 Petition Data::

Mon, 2019-09-02 03:00

Dashboards are most powerful through visual simplicity. They’re designed to automatically keep track of a specific set of metrics and keep human beings updated. Visual overload is like a binary demon in analytics that many developers seem possessed by; but less is more.

For example, many qualified drivers know very little about their dashboard besides speed, revs, temperature and fuel gauge. When an additional dash warning light comes on, even if it is just the tyre pressure icon let alone engine diagnostics light, most people will just take their car to the garage. The most obvious metrics in a car are in regard to its operation; if you didn't know your speed while driving you'd feel pretty blind. The additional and not so obvious metrics (i.e. dash warning lights) are more likely to be picked up by the second type of person who will spend the most time with that car: its mechanic. It would be pointless to overload a regular driver with all the data the car can possibly output in one go; that would just intimidate them. That's not what you want a car to do to the driver and that's certainly not what any organisation would want their operatives to feel like while their “car” is moving.

In light of recent political events, the exact same can metaphorically be applied to the big red Brexit bus. Making sense of it all might be a stretch too far for this article. Still, with appropriate use of Tableau dashboard design it is possible to answer seemingly critical questions on the topic with publicly available data.

There's An Ongoing Question That Needs Answering?
Where did 6 million+ signatures really come from?

Back in the UK, the Brexit fiasco is definitely still ongoing. Just before the recent A50 extensions took place, a petition to revoke article 50 and remain in the EU attracted more than 6 million signatures, becoming the biggest and fastest growing ever in history and sparking right wing criticism over the origin of thousands of signatures, claiming that most came from overseas and discrediting its legitimacy. Government responded by rejecting the petition.

Thankfully the data is publicly available (https://petition.parliament.uk/petitions/241584.json) for us to use as an example of how a dashboard can be designed to settle such a question (potentially in real time too as more signatures come in).

Tableau can handle JSON data quite well and, to nobody’s surprise, we quickly discover that over 95% of signatures are coming from the UK.

Now that we know what we're dealing with, lets focus the map on Britain and provide additional countries data in a format that is easier to digest visually. As cool as it is to hover over the world map, there's simpler ways to take this in.

Because in this case we know more than 95% of signatures originate from the UK, the heatmap above is far more useful, showing us the signature count for each constituency at a glance. The hotter the shading, the higher the count.

Scales Might Need Calibration
Bar Chart All The Way

Humans of all levels compute a bar chart well and it's perfect for what we need to know on how many signatures are coming from abroad altogether and from what countries in descending order.

With a margin so tiny, it's trickier to get a visual that makes sense. A pie chart, for example, would hardly display the smaller slice containing all of the non-UK origin signatures. Even with a bar chart we are struggling to see anything outside of the UK in a linear scale; but it is perfect if using logarithmic scales, which are definitely a must in this scenario.

And voila! The logarithmic scale allows the remaining counts to appear alongside the UK, even though France, the next country after the UK with most signatures, has a count below 50k. This means we can keep an eye on the outliers in more detail quite effortlessly. Not much looks out of place right now considering the number of expats Britain produces to the countries on the list. Now we know, as long as none of the other countries turn red, we have nothing to worry about!

Innovate When Needed

The logarithmic scale in Tableau isn't as useful for these %, so hacking the visualised values in order to amplify the data sections of interest is a perfectly valid way of thinking outside the box. In this example, half the graph is dedicated to 90-100% and the other half 0-90%. The blue chunk is the percentage of signatures coming from the UK, while every other country colour chunk is still so small. Since the totals from other countries are about the same as each mainland constituency, it's more useful to see it as one chunk. Lastly, adding the heat colour coding keeps the visual integrity.


Now that we have the count, percentage and location breakdown into 3 simple graphs we feel much wiser. So it's time to make them interact with each other.

The constituency heatmap doesn't need to interact with the bar charts. The correlation between the hottest bars and the heatmap is obvious from the get go, but if we were to filter the bars using the map, the percentages would be so tiny you wouldn't see much on the % graph. The same occurs for the Country bar chart, meaning that only the percentage chart can be usefully used as a filter. Selecting the yellow chunk will show the count of signatures for every country within it only.

Another way in which interactivity can be introduced is through adding further visualisations to the tooltip. The petition data contains the MP responsible for each constituency, so we can effectively put a count of signatures to each name. It's nice to be able to see what their parliamentary voting record has been throughout this Brexit deadlock, which was obtained publicly from the House of Commons portal https://commonsvotes.digiminster.comand blended in; as more votes come in, the list will automatically increase.

Keep It Simple

As you can see, 3 is a magic number here. The trio of visuals working together makes a dashing delivery of intel to the brain. With very little effort, we can see how many signatures come from the UK compared to rest of the world, how many thousands are coming from each country, how many from each constituency, who the MP you should be writing to is and how they voted in the indicative votes. Furthermore, this dashboard can keep track of all of that in real time, flagging any incoming surge of signatures from abroad, continuously counting the additional signatures until August 2019 and providing a transparent record of parliamentary votes in a format that is very easy to visually digest.

Categories: BI & Warehousing

Kafka | IoT Ecosystem ::Cluster; Performance Metrics; Sensorboards & OBD-II::

Wed, 2019-08-28 04:30
:Cluster; Performance Metrics; Sensorboards & OBD-II::

Infrastructure is the place to start and the keyword here is scalability. Whether it needs to run on premise, on cloud or both, Kafka makes it possible to scale at low complexity cost when more brokers are either required or made redundant. It is also equally easy to deploy nodes and nest them in different networks and geographical locations. As for IoT devices, whether it’s a taxi company, a haulage fleet, a racing team or just a personal car, Kafka can make use of the existing vehicle OBDII port using the same process; whether it’s a recording studio or a server room packed with sensitive electronic equipment and where climate control is critical, sensorboards can be quickly deployed and stream almost immediately into the same Kafka ecosystem. Essentially, pretty much anything that can generate data and touch python will be able to join this ecosystem.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

In large data centres it is fundamental to keep a close eye on misbehaving nodes, possibly overheating, constantly failing jobs or causing unexpected issues. Fires can occur too. This is quite a challenge with thousands and thousands of nodes. Though, Kafka allows for all of the node stats to individually stream in real time and get picked up by any database or machine, using Kafka Connect or kafka-python for consumption.

To demonstrate this on a smaller scale with a RaspberryPi 3 B+ cluster and test a humble variety of different conditions, a cluster of 7 nodes, Pleiades, was set up. Then, to make it easier to identify them, each computer was named after the respective stars of the Pleiades constellation.

  • 4 nodes {Alcyone; Atlas; Pleione; Maia} in a stack with cooling fans and heatsinks
:Cluster; Performance Metrics; Sensorboards & OBD-II::

  • 1 node in metal case with heatsink {Merope}
:Cluster; Performance Metrics; Sensorboards & OBD-II::

  • 1 node in plastic case {Taygeta}
:Cluster; Performance Metrics; Sensorboards & OBD-II::

  • 1 node in touchscreen plastic case {Electra}
:Cluster; Performance Metrics; Sensorboards & OBD-II::::Yes. It's a portable Retropie, Kafka broker & perfect for Grafana dashboards too::

Every single node has been equipped with the same python Kafka-producer script, from which the stream is updated every second in real-time under 1 topic, Pleiades. Measures taken include CPU-Percentage-%, CPU-Temperature, Total-Free-Memory, Available-System-Memory, CPU-Current-Hz.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

Kafka then connects to InfluxDB on Pleione, which can be queried using the terminal through a desktop or android SSH client. Nothing to worry about in terms of duplication, load balancing or gaps in the data. Worst case scenario InfluxDB, for example, crashes and the data will still be retrievable using KSQL to rebuild gap in DB depending on the retention policy set.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

We can query InfluxDB directly from the command line. The Measure (InfluxDB table) for Pleiades is looking good and holding plenty of data for us to see in Grafana next.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

A live feed is then delivered with Grafana dashboards. It's worth noting how mobile friendly these dashboards really are.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

At a glance, we know the critical factors such as how much available memory there is and how much processing power is being used, for the whole cluster as well as each individual node, in real time and anywhere in the world (with an internet connection).

It has then been observed that the nodes in the stack remain fairly cool and stable between 37 °C and 43 °C, whereas the nodes in plastic cases around 63 °C. Merope is in the metal casing with a heatsink, so it makes sense to see it right in the middle there at 52 °C. Spikes in temperature and CPU usage are directly linked to running processes. These spikes are followed by software crashes. Moving some of the processes from the plastic enclosures over to the stack nodes stopped Grafana from choking; this was a recurring issue when connecting to the dashboards from an external network. Kafka made it possible to track the problem in real time and allow us to come up with a solution much quicker and effortlessly; and then immediately also track if that solution was the correct approach. In the end, the SD cards between Electra and Pleione were quickly swapped, effectively moving Pleione to the fan cooled stack where it was much happier living.

If too many spikes begin to occur, we should expect for nodes to soon need maintenance, repair or replacement. KSQL makes it possible to tap into the Kafka Streams and join to DW stored data to forecast these events with increased precision and notification time. It's machine-learning heaven as a platform. KSQL also makes it possible to join 2 streams together and thus create a brand new stream, so to add external environment metrics and see how they may affect our cluster metrics, a sensor board on a RaspberryPi Zero-W was setup producing data into our Kafka ecosystem too.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

To keep track of the room conditions where the cluster sits, an EnviroPhat sensor board is being used. It measures temperature, pressure, colour and motion. There are many available sensorboards for SBCs like RaspberryPi that can just as easily be added to this Kafka ecosystem. Again, important to emphasize both data streams and dashboards can be accessed from anywhere with an internet connection.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

OBDII data from vehicles can be added to the ecosystem just as well. There are a few ways this can be achieved. The most practical, cable free option is with a Bluetooth ELM327 device. This is a low cost adaptor that can be purchased and installed on pretty much any vehicle after 1995. The adaptor plugs into the OBDII socket in the vehicle, connects via Bluetooth to a Pi-Zero-W, which then connects to a mobile phone’s 4G set up as a wi-fi hotspot. Once the data is flowing as far as needing a Kafka topic, the create command is pretty straight forward.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

With the obd-producer python script running, another equivalently difficult command opens up the console consumer for the topic OBD in Alcyone, and we can check if we have streams and if the OBD data is flowing through Kafka. A quick check on my phone reveals we have flow.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

To make things more interesting, the non-fan nodes in plastic and metal enclosures {Taygeta; Electra; Merope} were moved to a different geographical location and setup under a different network. This helps network outages and power cuts become less likely to affect our dashboard services or ability to access the IoT data. Adding cloud services to mirror this setup at this point would make it virtually bulletproof; zero point of failure is the aim of the game. When the car is on the move, Kafka is updating InfluxDB + Grafana in real time, and the intel can be tracked live as it happens from a laptop, desktop or phone from anywhere in the world.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

In a fleet scenario, harsh braking could trigger a warning and have the on-duty tracking team take immediate action; if the accelerometer spikes as well, then that could suggest an accident may have just occurred or payload checks may be necessary. Fuel management systems could pick up on driving patterns and below average MPG performance, even sense when the driver is perhaps not having the best day. This is where the value of Kafka in IoT and the possibilities of using ML algorithms really becomes apparent because it makes all of this possible in real time without a huge overhead of complexity.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

After plugging in the OBDII bluetooth adapter to the old e92-335i and driving it for 20 minutes, having it automatically stream data over the internet to the kafka master, Alcyone, and automatically create and update an OBD influxdb measure in Pleione, it can quickly be observed in Grafana that it doesn't enjoy idling that much; the coolant and intake air temperature dropped right down as it started moving at a reasonable speed. This kind of correlation is easier to spot in time series Grafana dashboards whereas it would be far less intuitive with standard vehicle dashboards that provide only current values.

:Cluster; Performance Metrics; Sensorboards & OBD-II::

So now that a real bare-metal infrastructure exists - and it’s a self-monitoring, low power consumption cluster, spread across multiple geographical locations, keeping track of enviro-sensor producers from multiple places/rooms, logging all vehicle data and learning to detect problems as far ahead as possible - adding sensor data pickup points to this Kafka ecosystem is as simple as its inherent scalability. As such, with the right Kafka-Fu, pretty much everything is kind of plug-&-play from this point onwards, meaning we can now go onto connecting, centralising and automating as many things in life as possible that can become IoT using Kafka as the core engine under the hood.

:Cluster; Performance Metrics; Sensorboards & OBD-II::
Categories: BI & Warehousing

OAC Row Limits and Scale Up or Down

Wed, 2019-08-28 04:26
OAC Row Limits and Scale Up or Down

I created an OAC instance the other day for some analysis in preparation of my OOW talk, and during the analytic journey I reached the row limit with the error Exceeded configured maximum number of allowed input records.

OAC Row Limits and Scale Up or Down

Since a few releases back, each OAC instance has fixed row limits depending by the number of OCPU assigned that can be checked in the related documentation, with the current ones shown in the table below.

OAC Row Limits and Scale Up or Down

If you plan using BI Publisher (included in OAC a few versions ago) check also the related limits.

OAC Row Limits and Scale Up or Down

Since in my analytical journey I reached the row limit, I wanted to scale up my instance, but surprise surprise, the Scale Up or Down option wasn't available.

OAC Row Limits and Scale Up or Down

After some research I understood that Scaling Up&Down is available only if you chose originally a number of OCPUs greater than one. This is in line with Oracle's suggestion to use 1 OCPU only for non-production instances as stated in the instance creation GUI.

OAC Row Limits and Scale Up or Down

When choosing originally an OAC instance with 4 OCPUs the Scale Up/Down option becomes available (you need to start the instance first).

OAC Row Limits and Scale Up or Down

When choosing the scale option, we can decide whether to increase/decrease the number of OCPUs.

OAC Row Limits and Scale Up or Down

Please note that we could have limited choice in the number of OCPUs we can increase/decrease by depending on the availability and current usage.

Concluding, if you want to be able to Scale Up/Down your OAC instances depending on your analytic/traffic requirements, always start your instance with a number of OCPUs greater than one!

Categories: BI & Warehousing

Rittman Mead at Oracle OpenWorld 2019

Fri, 2019-08-23 07:52
Rittman Mead at Oracle OpenWorld 2019

Oracle OpenWorld is coming soon! 16th-20th September in Moscone Center, San Francisco. It's Oracle's biggest conference and I'll represent Rittman Mead there with the talk "Become a Data Scientist"  exploring how Oracle Analytics Cloud can speed any analyst path to data science. If you are an analyst looking to move your first steps in data-science or a manager trying to understand how to optimize your business analytics workforce, look no further, this presentation is your kickstarter!

Rittman Mead at Oracle OpenWorld 2019

To have an introduction to the topic have a look at my blog post series episodes I, II and III.

If you'll be at OOW2019 and you see me around, don't hesitate to stop me! I’d be pleased to speak with you about OAC, Analytics, ML, and more important topics like food or wine as well!

Categories: BI & Warehousing

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

Wed, 2019-08-07 08:19
Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIsHow much did I spend so far? Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

It's a typical question we ask ourselves daily and what do we do? Probably check the bank account status on our phone and yell at ourselves for all the money we trashed the previous night at the pub.

The Cloud

One of the great benefits of the cloud is that there is no big upfront cost required to start playing with the latest tool or technology, we just need to fill in a few forms, write down the credit card details and there we go! However, the cloud doesn't mean free: most of the times we pay based on resource and time consumption and things can become pretty expensive if we don't manage our resources wisely.

The main Oracle Cloud Dashboard offers a preview of the Month to Date Cost and by clicking on it, we can easily understand the cost per product. Like in the example below we spend £322.8 month to date and precisely £262.80 on Oracle Analytics Classic.

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

We can go another step down the line: if for example, we have multiple versions of the same product, we'll see a line for each version or licensing method. In our case, all the money comes from a single B88303 - OAC Enterprise Edition OCPU per Hour product with an overall 60 hours of uptime (OAC billing is per hour).  

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

However, this requires a manual login into the Oracle Cloud to check the balance, which is not optimal if we want to display this information in external tools or automate part of the cost-checking procedures. Fortunately, we can retrieve the same information with Oracle Cloud Account Metering REST APIs.

Oracle Cloud Account Metering REST APIs

Oracle Cloud Account Metering REST APIs expose a lot of useful information about our Oracle Cloud account via REST APIs. We can, for example, check our subscription details, credit promotions, resource usage, cost and quotas. All we need to test the REST APIs is cURL, a command-line utility for sending HTTP requests. The syntax to retrieve the data is



  • <TENANT_ID> is the identity domain ID, you can find it under the Oracle Analytics Cloud -> Overview
Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs
  • <BASE_PATH> is the base URI of the resource, e.g. /metering/api/v1
  • <RESOURCE_NAME> is the name of the specific resource we are requesting
Checking the Cost

If, as per the example below, we want to understand the cost, we simply need to call the usagecost resource passing the <ACCOUNT_ID> parameter which can be found in the Overview page of every service we already have in our account.

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

The basic cURL command to check the cost then becomes the following


Where on top of the parameters defined above we have

  • <START_TIME> and <END_TIME> with the format YYYY-MM-DDTHH:mm:sssZ e.g. 2019-08-01T00:00:00.000
  • <TIMEZONE> we specify which timezone to use for the date and time filter

So if like before, we're aiming to understand the cost from the beginning of the month, our suffix becomes


The result is in JSON format which we can easily parse the result with the command line tool jq.

curl -X GET ... | jq '.'

The output is

  "accountId": "<ACCOUNT_ID>",
  "items": [
      "subscriptionId": "...",
      "subscriptionType": "PRODUCTION",
      "serviceName": "ANALYTICS",
      "resourceName": "ANALYTICS_EE_PAAS_ANY_OCPU_HOUR",
      "currency": "GBP",
      "gsiProductId": "B88303",
      "startTimeUtc": "2019-08-01T00:00:00.000",
      "endTimeUtc": "2019-08-10T23:00:00.000",
      "serviceEntitlementId": "...",
      "costs": [
          "computedQuantity": 60,
          "computedAmount": 262.8,
          "unitPrice": 4.38,
          "overagesFlag": "Y"
  "canonicalLink": "/metering/api/v1/usagecost/<ACCOUNT_ID>?timeZone=Europe%2FRome&startTime=2019-08-01T00%3A00%3A00.000Z&endTime=2019-08-10T23%3A59%3A00.000Z"

As expected, we get, within the items section, an entry for every product and license type we have used. In our case we have the "serviceName": "ANALYTICS", with the Enterprise Edition option billed per hour ("resourceName": "ANALYTICS_EE_PAAS_ANY_OCPU_HOUR") and we used it for 60 hours with a unit price of £4.38 for a total amount of £262.8 perfectly in line with what we see in the webpage.

We can further filter our query using one of the following parameters:

  • computeType: the nature of the cost (Usage, Overcharge...)
  • datacenter: the datacenter for which cost needs to be retrieved
  • dcAggEnabled: to roll up the cost by datacenter
  • resourceName: the type of resource billing (e.g. ANALYTICS_EE_PAAS_ANY_OCPU_HOUR)
  • serviceEntitlementId: the Id of the service, can be found in the Overview page
Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs
  • serviceName: the name of the service e.g. ANALYTICS
  • usageType: the type of usage we want to be reported either TOTAL, HOURLY or DAILY

Unfortunately, none of the above filters allows us to check the cost associated with a precise instance of the service. If, for example, we have two instances with the same edition and type of billing, we can't determine, with the above call, what the cost associated to each of the two instances is since it's rolled up and instance type level. But we're not alone! We can achieve more granularity in the billing metrics by using the /tagged REST API and properly performing instance tagging on our services.

Instance Tagging

We can group instances of various services with Tags. Tags are labels that we can attach to an instance to group them based on our company rules. Oracle allows two types of tagging for resources: free-form and defined.

With free-form tagging we can append any key-value label to our instances, e.g. we may want to tag an instance as Environment:Training with Environment being the key and Training being the label.

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

The problem with free-form tagging is that we don't have control of which tag keys get associated to a certain resource and it's an error-prone method since we have to type a key and value every time (and they're not visible for cost-tracking).

If instead, we want to use a more controlled approach to tagging, we can then go for the defined tagging: while with free-form anyone was able to associate any key or value, with define tagging we create a namespace which will contain a set of tag keys.

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

Once created the namespace, we can then create the set of keys within it. In this case, we create two pre-defined keys Environment and Project, please note that we flagged the COST-TRACKING checkbox to be able to use the tags with the Oracle Cloud Account Metering APIs.

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

Please be aware that there are limits on the number of namespaces, of tags per resource and of cost-tracking tags which are available under the tagging documentation.

Now it's time to attach the defined tags to our instances, we can do so in the web UI during instance creation or after by selecting "Add Tags". More information under the related documentation.

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

After we added the tags marked for cost-tracking to our instances we may have to wait up to 5 hours to see them in the "My Services" or via the REST APIs.

Querying Tagged Resources

There is an API within the Oracle Cloud Account Metering REST APIs which allows to querying the cost associated with tagged resources. The call is very similar to the one we used above, with the additional tagged prefix and tags=.... parameter. Taking the example above, if we can see the consumption associated with instances tagged as Operations:Project=Training then the call is the following


And the result is

  "accountId": "<ACCOUNT_ID>",
  "items": [
      "subscriptionId": "...",
      "subscriptionType": "PRODUCTION",
      "serviceName": "ADWC",
      "resourceName": "ADWC_PAAS_BYOL_OCPU_HOUR",
      "currency": "GBP",
      "gsiProductId": "B89039",
      "startTimeUtc": "2019-08-01T00:00:00.000",
      "endTimeUtc": "2019-08-10T23:00:00.000",
      "serviceEntitlementId": "...",
      "costs": [
          "computedQuantity": 23.0,
          "computedAmount": 8.06235468,
          "unitPrice": 0.35053716,
          "overagesFlag": "N"
  "canonicalLink": "/metering/api/v1/usagecost/<ACCOUNT_ID>/tagged?timeZone=UTC&startTime=2019-08-01T00%3A00%3A00.000Z&endTime=2019-08-10T23%3A59%3A00.000Z&usageType=TOTAL&tags=operations%3AProject%3DTraining"

A usage of ADWC for 23 hours for a total of £8.06 which is also visible from the My Services webpage.

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs

Appending the following jq command to the cURL call also displays the relevant information like serviceName, and cost details as separate columns

jq --raw-output '.items[] | "\(.serviceName)\t\(.subscriptionType)\t\(.resourceName)\t\(.currency)\t\(.costs[].computedAmount)\t\(.costs[].computedQuantity)\t\(.costs[].unitPrice)\t\(.costs[].overagesFlag)"' 

And the result is


Oracle Cloud Account Metering REST APIs offer an easy way to expose the Oracle cloud usage and cost externally. Used smartly in conjunction with instance tagging they provide a way to ensure cost and usage tracking down to the single resource or project.

If on the other way, the integration with REST APIs is not what you need, but you're looking into ways of getting notified when you're spending too much, check out the Alerts section of Managing and Monitoring Oracle Cloud.

Edit: You can also download your billing information as CSV from Oracle Cloud web-ui as per screenshot below which is very handy for one-off analysis. If, on the other side, you want to automate the export of billing information, then Oracle Cloud Account Metering REST APIs is the way to go!

Where is My Money Going? Checking the balance with Oracle Cloud Account Metering REST APIs
Categories: BI & Warehousing

Spatial Analytics Made Easy: Oracle Spatial Studio

Fri, 2019-07-26 10:31
 Oracle Spatial Studio

Let's say we need to understand where our company needs to open a new shop. Most of the time the decision is driven by gut feeling and some knowledge of the market and client base, but what if we could have visual insights about where are the high density zones with customers not covered by a shop nearby like in the map below?

 Oracle Spatial Studio

Well... welcome Oracle Spatial Studio!

Spatial Studio is Oracle's new tool for creating spatial analytics with a visual GUI. It uses Oracle Spatial database functions in the backen d exposed with an interface in line with the Oracle Analytics Cloud one. Let's see how it works!

QuickStart Installation

First of all we need to download Spatial Studio from the Oracle web page, for this initial test I downloaded the "Quick Start", a self contained version pre-deployed in a lightweight application server. For more robust applications you may want to download the EAR file deployable in Weblogic.

 Oracle Spatial Studio

Once downloaded and unzipped the file, we just need to verify we have a Java JDK 8 (update 181 or higher) under the hood and we can immediately start Oracle Spatial Studio with the ./start.sh command.

The command will start the service on the local machine that can be accessed at https://localhost:4040/spatialstudio. By default Oracle Spatial Studio Quickstart uses HTTPS protocol with self-signed certificates, thus the first time you access the URL you will need to add a security exception in your browser. The configurations such as port, JVM parameters, host and HTTP/HTTPS protocol can be changed in the conf/server.json file.

We can then login with the default credentials admin/welcome1

 Oracle Spatial Studio

The first step in the Spatial Studio setup is the definition of the metadata connection type. This needs to point to an Oracle database with the spatial option. For my example I initially used an Oracle Autonomous Data Warehouse, for which I had to drop the wallet and specify the schema details.

 Oracle Spatial Studio

Once logged in, the layout and working flows are very similar to Oracle Analytics Cloud making the transition between the two very easy (more details on this later on). In the left menu we can access, like in OAC, Projects (visualizations), Data, Jobs and the Console.

 Oracle Spatial Studio

In order to do Spatial Analysis we need to start from a Dataset, this can be existing tables or views, or we can upload local files. To create a Dataset, click on Create and Dataset

 Oracle Spatial Studio

We have then three options:

  • Upload a Spreadsheet containing spatial information (e.g. Addresses, Postcodes, Regions, Cities etc)
  • Upload a Shapefile containing geometric locations and associated attributes.
  • Use spatial data from one of the existing connections, this can point to any connection containing spatial information (e.g. a table in a database containing customer addresses)
 Oracle Spatial StudioSample Dataset with Mockaroo

I used Mockaroo, a realistic data generator service, to create two excel files: one containing customers with related locations and a second one with shops and related latitude and longitude. All I had to do was to select which fields I wanted to include in my file and the related datatype.

 Oracle Spatial Studio

For example, the list of shop dataset contained the following columns:

  • Id: as row number
  • Shop Name: as concatenation of Shop and the Id
  • Lat: Latitude
  • Long: Longitude
  • Dept: the Department (e.g. Grocery, Books, Health&Beauty)

Mockaroo offers a perfect service and has a free tier of datasets with less than 1000 rows which can be useful for demo purposes. For each column defined, you can select between a good variety of column types. You can also define your own type using regular expressions!

 Oracle Spatial StudioAdding the Datasets to Oracle Spatial Studio

Once we have the two datasources in Excel format, it's time to start playing with Spatial Studio. We first need to upload the datasets, we can do it via Create and Dataset. Starting with the Customer.xlsx one. Once selected the file to upload Spatial Studio provides (as OAC) an overview of the dataset together with options to change configurations like dataset name, target destination (metadata database) and column names.

 Oracle Spatial Studio

Once modified the table name to TEST_CUSTOMERS and clicked on Submit Spatial Studio starts inserting all the rows into the SPATIAL_STUDIO connection with a routine that could take seconds or minutes depending on the dataset volume. When the upload routine finishes I can see the TEST_CUSTOMERS table appearing in the list of datasets.

 Oracle Spatial Studio

We can immediately see the yellow warning sign next to the dataset name, it's due to the fact that we have a dataset with no geo-coded information, we can solve this problem by clicking on the option button and then Prepare and Geocode Addresses

 Oracle Spatial Studio

Oracle Spatial Studio will suggest, based on the column content, some geo-type matching e.g. City Name, Country and Postal Code. We can use the defaults or modify them if we feel they are wrong.

 Oracle Spatial Studio

Once clicked on Apply the geocoding job starts.

 Oracle Spatial Studio

Once the job ends, we can see the location icon next to our dataset name

 Oracle Spatial Studio

We can do the same for the Shops.xlsx dataset, starting by uploading it and store it as TEST_SHOPS dataset.

 Oracle Spatial Studio

Once the dataset is uploaded I can geo-locate the information based on the Latitude and Longitude, I can click on the option button and the selecting Prepare and Create Lon/Lat Index. Then I'll need to assign the Longitude and Latitude column correctly and click on Ok.

 Oracle Spatial StudioSpatial Analytics

Now it's time to do some Spatial Analysis so I can click on Create and Project and I'll face an empty canvas by default

 Oracle Spatial Studio

The first step is to add a Map, I can do that by selecting the visualizations menu and then dragging the map to the canvas.

 Oracle Spatial Studio

Next step is to add some data by clicking on Data Elements and then Add Dataset

 Oracle Spatial Studio

I select the TEST_CUSTOMERS dataset and add it to the project, then I need to drag it on top of the map to visualize my customer data.

 Oracle Spatial Studio

Oracle Spatial Studio Offers several options to change the data visualizations like color, opacity, blur etc.

 Oracle Spatial Studio

Now I can add the TEST_SHOPS dataset and visualize it on the map with the same set of steps followed before.

 Oracle Spatial Studio

It's finally time for spatial analysis! Let's say, as per initial example, that I want to know which of my customers doesn't have any shops in the nearest 200km. In order to achieve that I need to first create buffer areas of 200km around the shops, by selecting the TEST_SHOPS datasource and then clicking on the Spatial Analysis.

 Oracle Spatial Studio

This will open a popup window listing a good number of spatial analysis, by clicking on the Transform tab I can see the Add a buffer of a specified distance option.

 Oracle Spatial Studio

Unfortunately the buffer function is not available in ADW at the moment.

 Oracle Spatial Studio

I had to rely on an Oracle Database Cloud Service 18c Enterprise Edition - High Performance (which includes the Spatial option) to continue for my metadata storage and processing. Few Takeaways:

  • Select 18c (or anything above 12.2): I hit an issue ORA-00972: identifier is too long when importing the data in a 12.1 Database, which (thanks StackOverflow) is fixed as of 12.2.
  • High Performance: This includes the Spatial Option

Once I used the DBCS as metadata store, I can finally use the buffer function and set the parameter of 200km around the shops.

 Oracle Spatial Studio

The TEST_SHOPS_BUFFER is now visible under Analysis and can be added on top of the Map correctly showing the 200km buffer zone.

 Oracle Spatial Studio

I can understand which customers have a shop in the nearest 200k by creating an analysis and select the option "Return shapes within a specified distance of another"

 Oracle Spatial Studio

In the parameters I can select the TEST_CUSTOMERS as Layer to be filtered, the TEST_SHOPS as the Layer to be used as filter and the 200Km as distance.

 Oracle Spatial Studio

I can then visualize the result by adding the TEST_CUSTOMERS_WITHIN_DISTANCE layer in the map.

 Oracle Spatial Studio

TEST_CUSTOMERS_WITHIN_DISTANCE contains the customers already "covered" by a shop in the 200km range, what I may want to do now is remove them from my list of customers in order to do analysis on the remaining ones, how can I do that? Unfortunately in the first Spatial Studio version there is no visual way of doing DATASET_A MINUS DATASET_B but, hey, it's just the first incarnation and we can expect that type of functions and many others to be available in future releases!

The following paragraph is an in-depth analysis in the database of functions that will probably be exposed in Spatial Studio's future version, so if not interested, progress directly to the section named "Progressing in the Spatial Analysis".

A Look in the Database

Since we want to achieve our goal of getting the customers not covered by a shop now, we need to look a bit deeper where the data is stored: in the database. This gives us two opportunities: check how Spatial Studio works under the covers and freely use SQL to achieve our goals (DATASET_A MINUS DATASET_B).

First let's have a look at the tables created by Spatial Studio: we can see some metadata tables used by studio as well as the database representation of our two excel files TEST_CUSTOMERS and TEST_SHOPS.

 Oracle Spatial Studio

Looking in depth at the metadata we can also see a table named SGTECH$TABLE followed by an ID. That table collects the information regarding the geo-coding job we executed against our customers dataset which were located starting from zip-codes and addresses. We can associate the table to the TEST_CUSTOMERS dataset with the following query against the SGTECH_OBJECTS metadata table.

  JSON_VALUE(data, '$.gcHelperTableName') DATASET  
 Oracle Spatial Studio

The SGTECH$TABLEA004AA549110B928755FC05F01A3EF89 table contains, as expected, a row for each customer in the dataset, together with the related geometry if the geo-coding was successful and some metadata flags like GC_ATTEMPTED, GC_STATUS and GC_MATCH_CODE stating the accuracy of the geo-coding match.

 Oracle Spatial Studio

What about all the analysis like the buffer and the customers within distance? For each analysis Spatial Studio creates a separate view with the SGTECH$VIEW prefix followed by an ID.

 Oracle Spatial Studio

To understand which view is referring to which analysis we need to query the metadata table SGTECH_OBJECTS with a query like

  JSON_VALUE(data, '$.tableName') DATASET  

With the following result

 Oracle Spatial Studio

We know then that the TEST_CUSTOMERS_WITHIN_DISTANCE can be accessed by the view SGTECH$VIEW0B2B36785A28843F74B58B3CCF1C51E3 and when checking its SQL we can clearly see that it executes the SDO_WITHIN_DISTANCE function using the TEST_CUSTOMERS.GC_GEOMETRY, the TEST_SHOPS columns LONGITUDE and LATITUDE and the distance=200 unit=KILOMETER parameters we set in the front-end.

CREATE OR replace force editionable view "SPATIAL_STUDIO"."SGTECH$VIEW0B2B36785A28843F74B58B3CCF1C51E3"
    "TEST_CUSTOMERS"   "t1",
    "TEST_SHOPS"       "t2"
    spatial_studio.sgtech_ptf("t2"."LONGITUDE", "t2"."LATITUDE"), 
    'distance=200 unit=KILOMETER'
    ) = 'TRUE';

Ok, we now understood which view contains the data, thus we can create a new view containing only the customers which are not within the 200km distance with

    t1.id            AS id,
    t1.first_name    AS first_name,
    t1.last_name     AS last_name,
    t1.email         AS email,
    t1.gender        AS gender,
    t1.postal_code   AS postal_code,
    t1.street        AS street,
    t1.country       AS COUNTRY,
    t1.city          AS city,
    t1.studio_id     AS studio_id,
    t1.gc_geometry   AS gc_geometry
    test_customers t1
    id NOT IN (
Progressing in the Spatial Analysis

In the previous paragraph we created a view in the database named TEST_CUSTOMERS_NOT_WITHIN_DISTANCE containing the customer without a shop in a 200km radius. We can now import it into Spatial Studio by creating a new dataset, selecting the connection to the database (in our case named SPATIAL_STUDIO) as source and then the newly created TEST_CUSTOMERS_NOT_WITHIN_DISTANCE view.

 Oracle Spatial Studio

The dataset is added, but it has a yellow warning icon next to it

 Oracle Spatial Studio

Spatial Studio requests us to define a primary key, we can do that by accessing the properties of the dataset, select the Columns tab, choosing which column acts as primary key and validate it. After this step I can visualize this customer in a map.

 Oracle Spatial Studio

What's next? Well If I want to open a new shop, I may want to do that where there is a concentration of customers, which is easily visualizable with Spatial Studio by changing the Render Style to Heatmap.

 Oracle Spatial Studio

With the following output

 Oracle Spatial Studio

We can clearly see some major concentrations around Dallas, Washington and Minneapolis. Focusing more on Dallas, Spatial Studio also offers the option to simulate a new shop in the map and calculate the 200km buffer around it. I can clearly see that adding a shop halfway between Oklahoma City and Dallas would allow me to cover both clients within the 200km radius.

 Oracle Spatial Studio

Please remember that this is a purely demonstrative analysis, and some of the choices, like the 200km buffer are expressly simplistic. Other factors could come into play when choosing a shop location like the revenue generated by some customers. And here it comes the second beauty of Oracle Spatial Studio, we can export datasets as GeoJSON or CSV and include them in Data Visualization.

 Oracle Spatial Studio

For example I can export the data of TEST_CUSTOMERS_NOT_WITHIN_DISTANCE from Spatial Studio and include then in a Data Visualization Project blending them with the Sales related to the same customers.

 Oracle Spatial Studio

I can now focus not only on the customer's position but also on other metrics like Profit or Sales Amount that I may have in other datasets. For another example of Oracle Spatial Studio and Data Visualization interoperability check out this video from Oracle Analytics Senior Director Philippe Lions.


Spatial analytics made easy: this is the focus of Oracle Spatial Studio. Before spatial queries were locked down at database level with limited access from an analyst point of view. Now we have a visual tool with a simple GUI (in line with OAC) that easily enables spatial queries for everybody!

But this is only the first part of the story: the combination of capabilities achievable when mixing Oracle Spatial Studio and Oracle Analytics Cloud takes any type of analytics to the next level!

Categories: BI & Warehousing

KSQL in Football: FIFA Women’s World Cup Data Analysis

Mon, 2019-07-22 02:53

One of the football (as per European terminology) highlights of the summer is the FIFA Women’s World Cup. France, Brazil, and the USA are the favourites, and this year Italy is present at the event for the first time in 20 years.

From a data perspective, the World Cup represents an interesting source of information. There's a lot of dedicated press coverage, as well as the standard social media excitement following any kind of big event.

The idea in this blog post is to mix information coming from two distinct channels: the RSS feeds of sport-related newspapers and Twitter feeds of the FIFA Women’s World Cup. The goal will be to understand how the sentiment of official news related to the two teams involved in the final compares to that of the tweets.

In order to achieve our targets, we'll use pre-built connectors available in Confluent Hub to source data from RSS and Twitter feeds, KSQL to apply the necessary transformations and analytics, Google’s Natural Language API for sentiment scoring, Google BigQuery for data storage, and Google Data Studio for visual analytics.

Data sources

The beginning of our journey starts with connecting to various data sources. Twitter represents the default source for most event streaming examples, and it's particularly useful in our case because it contains high-volume event streaming data with easily identifiable keywords that can be used to filter for relevant topics.

Ingesting Twitter data

Ingesting Twitter data is very easy with Kafka Connect, a framework for connecting Kafka with external systems. Within the pre-built connectors we can find the Kafka Connect Twitter, all we need to do is install it using the Confluent Hub client.

confluent-hub install jcustenborder/kafka-connect-twitter:latest

To start ingesting the Twitter data, we need to create a configuration file containing the following important bits:

  • filter.keywords: We need to list all the keywords we are interested in, separated by a comma. Since we want to check tweets from the FIFA Women’s World Cup, we’ll use FIFAWWC, representing both the World Cup Twitter handle and the most common related hashtag.
  • kafka.status.topic: This topic that will be used to store the tweets we selected. twitter_avro: This is because the connector output format is AVRO.
  • twitter.oauth: This represents Twitter credentials. More information can be found on the Twitter’s developer website.

After the changes, our configuration file looks like the following:

twitter.oauth.accessToken=<TWITTER ACCESS TOKEN>
twitter.oauth.accessTokenSecret=<TWITTER ACCESS TOKEN SECRET>
twitter.oauth.consumerKey=<TWITTER ACCESS CUSTOMER KEY>
twitter.oauth.consumerSecret=<TWITTER CUSTOMER SECRET>

It's time to start it up! We can use the Confluent CLI load command:

confluent load twitter -d $TWITTER_HOME/twitter.properties

$TWITTER_HOME is the folder containing the configuration file. We can check the Kafka Connect status by querying the REST APIs with the following:

curl -s "http://localhost:8083/connectors/twitter/status" | jq [.connector.state] 

We can also check if all the settings are correct by consuming the AVRO messages in the twitter_avro topic with a console consumer:

confluent consume twitter_avro --value-format avro

And the result is, as expected, an event stream of tweets.

RSS feeds as another data source

The second data source that we'll use for our FIFA Women’s World Cup sentiment analytics are RSS feeds from sports-related newspapers. RSS feeds are useful because they share official information about teams and players, like results, episodes, and injuries. RSS feeds should be considered neutral since they should only report facts. For this blog post, we’ll use RSS feeds as a way to measure the average sentiment of the news. As per the Twitter case above, a prebuilt Kafka Connect RSS Source exists, so all we need to do is to install it via the Confluent Hub client:

confluent-hub install kaliy/kafka-connect-rss:latest

Then, create a configuration file with the following important parameters:

  • rss.urls: This is a list of space-separated RSS feed URLs. For our Women’s World Cup example, we’ve chosen the following sources: La Gazzetta dello Sport, Transfermarkt, Eurosport, UEFA, The Guardian, Daily Mail, The Sun Daily, BBC
  • topic: The Kafka topic to write to, which is rss_avro in our case

The full configuration file looks like the following:

rss.urls=https://www.transfermarkt.co.uk/rss/news https://www.eurosport.fr/rss.xml https://www.uefa.com/rssfeed/news/rss.xml https://www.theguardian.com/football/rss https://www.dailymail.co.uk/sport/index.rss https://www.thesundaily.my/rss/sport http://feeds.bbci.co.uk/news/rss.xml https://www.gazzetta.it/rss/home.xml

And again, we can start the ingestion of RSS feeds with the Confluent CLI:

confluent load RssSourceConnector -d $RSS_HOME/RssSourceConnector.properties

We can test the status of Kafka Connectors using this simple procedure, and calling it like:

RssSourceConnector  |  RUNNING  |  RUNNING
twitter             |  RUNNING  |  RUNNING

We can see that the both the RssSourceConnector and the twitter Connect are up and running. We can then check the actual data with the console consumer.

confluent consume rss_avro --value-format avro

Below is the output as expected.

Shaping the event streams

After ingesting the Twitter and RSS event streams into topics, it’s time to shape them with KSQL. Shaping the topics accomplishes two purposes:

  1. It makes the topics queryable from KSQL
  2. It defines additional structures that can be reused in downstream applications

The Twitter stream lands in Avro format with the fields listed in the related GitHub repo. We can easily declare a TWITTER_STREAM KSQL stream on top of TWITTER_AVRO with:


There is no need to define the single fields in the event stream declaration because they are already in AVRO and thus will be sourced from the Confluent Schema Registry. Schema Registry is the component within Kafka, in charge of storing, versioning and serving the topics Avro Schemas. When a topic is in AVRO format, its schema is stored in the Schema Registry, where downstream applications (like KSQL in this case) can retrieve it and use it to “shape” the messages in the topic.

The important bits of the above KSQL for our definition are:

  • KAFKA_TOPIC='TWITTER_AVRO': the definition of the source topic
  • VALUE_FORMAT='AVRO': the definition of the source topic format
  • TIMESTAMP='CREATEDAT': the Tweet's creation date, which is used as the event timestamp

We can now check that the fields’ definition has correctly been retrieved by the Schema Registry with:


Or, we can use the REST API by:

curl -X "POST" "http://localhost:8088/ksql" \
-H "Content-Type: application/vnd.ksql.v1+json; charset=utf-8" \
-d $'{
"streamsProperties": {}

The resulting fields section will be:

"fields": [
"name": "ROWTIME",
"schema": {
"type": "BIGINT",
"fields": null,
"memberSchema": null
"name": "ROWKEY",
"schema": {
"type": "STRING",
"fields": null,
"memberSchema": null
"name": "CREATEDAT",
"schema": {
"type": "BIGINT",
"fields": null,
"memberSchema": null
"name": "ID",
"schema": {
"type": "BIGINT",
"fields": null,
"memberSchema": null

The same applies to the RSS feed contained in rss_avro with:

create stream RSS_STREAM

The result will be:

ksql> describe RSS_STREAM;
Name                 : RSS_STREAM
Field   | Type
ROWTIME | BIGINT           (system)

We can also use the URL manipulation functions added in KSQL 5.2 to extract useful information from the LINK column with:


The result will be:

www.dailymail.co.uk | /sport/football/article-6919585/Paul-Scholes-backs-Manchester-United-spring-surprise-Barcelona.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
www.dailymail.co.uk | /sport/formulaone/article-6916337/Chinese-Grand-Prix-F1-race-LIVE-Shanghai-International-Circuit.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
www.dailymail.co.uk | /sport/football/article-6919403/West-Brom-make-approach-Preston-manager-Alex-Neil.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
www.dailymail.co.uk | /sport/football/article-6919373/Danny-Murphy-Jermaine-Jenas-fascinating-mind-games-thrilling-title-race.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
www.dailymail.co.uk | /sport/football/article-6919215/Brazilian-legend-Pele-successfully-undergoes-surgery-remove-kidney-stone-Sao-Paulo-hospital.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
Limit Reached
Query terminated
Sentiment analytics and Google’s Natural Language APIs

Text processing is a part of machine learning and is continuously evolving with a huge variety of techniques and related implementations. Sentiment analysis represents a branch of text analytics and aims to identify and quantify affective states contained in a text corpus.

Natural Language APIs provide sentiment scoring as a service using two dimensions:

  1. Score: Positive (Score > 0) or Negative (Score < 0) Emotion
  2. Magnitude: Emotional Content Amount

For more information about sentiment score and magnitude interpretation, refer to the documentation.

Using Natural Language APIs presents various benefits:

  • Model training: Natural Language is a pre-trained model, ideal in situations where we don't have a set of already-scored corpuses.
  • Multi-language: RSS feeds and tweets can be written in multiple languages. Google Natural Language is capable of scoring several languages natively.
  • API call: Natural Language can be called via an API, making the integration easy with other tools.
Sentiment scoring in KSQL with user- defined functions (UDFs)

The Natural Language APIs are available via client libraries in various languages, including Python, C#, and Go. For the purposes of this blog post, we'll be looking at the Java implementation since it is currently the language used to implement KSQL user-defined functions (UDFs). For more details on how to build a UD(A)F function, please refer to How to Build a UDF and/or UDAF in KSQL 5.0 by Kai Waehner, which we'll use as base for the GSentiment class definition.

The basic steps to implementing Natural Language API calls in a UDF are the following:

  1. Add the google.cloud.language JAR dependency in your project. If you are using Maven, you just need to add the following in your pom.xml

<dependency>: <groupId>com.google.cloud</groupId> <artifactId>google-cloud-language</artifactId> <version>1.25.0</version> </dependency>

  1. Create a new Java class called GSentiment.
  2. Import the required classes:

//KSQL UDF Classes import io.confluent.ksql.function.udf.Udf; import io.confluent.ksql.function.udf.UdfDescription; //Google NL Classes import com.google.cloud.language.v1.LanguageServiceClient; import com.google.cloud.language.v1.Sentiment; import com.google.cloud.language.v1.AnalyzeSentimentResponse; import com.google.cloud.language.v1.Document; import com.google.cloud.language.v1.Document.Type;

  1. Define the GSentimen class and add the Java annotations @UdfDescription(name = "gsentiment", description = "Sentiment scoring using Google NL API") public class Gsentiment { ... }.
  2. Within the class, declare gsentiment as the method accepting a String text as input. As of now, UDFs can't return two output values, so we are returning the sentiment score and magnitude as an array of double. @Udf(description = "return sentiment scoring") public List<Double> gsentiment( String text) { ... }.
  3. Within the gsentiment method, invoke the Natural Language API sentiment and cast the result as an array. Since a UDF can return only one parameter currently, we need to pipe the sentiment score and magnitude into an array of two elements.
Double[] arr = new Double[2];   
try (LanguageServiceClient languageServiceClient = LanguageServiceClient.create()) {
Document document = Document.newBuilder()
AnalyzeSentimentResponse response = languageServiceClient.analyzeSentiment(document);
Sentiment sentiment = response.getDocumentSentiment();

catch (Exception e) {
arr[0]=(double) 0.0;
arr[1]=(double) 0.0;
return Arrays.asList(arr);

  1. As mentioned in How to Build a UDF and/or UDAF in KSQL 5.0, build an uber JAR that includes the KSQL UDF and any dependencies, and copy it to the KSQL extension directory (defined in the ksql.extension.dir parameter in ksql-server.properties).
  2. Add an environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to the service account key that will be used to authenticate to Google services.
  3. Restart KSQL.

At this point, we should be able to call the GSentiment UDF from KSQL:

[0.10000000149011612, -0.10000000149011612]
[0.20000000298023224, 0.20000000298023224]
[0.5                , 0.10000000149011612]
[0.10000000149011612, 0.10000000149011612]
[0.0                , 0.0]
Limit Reached
Query terminated

As expected, the UDF returns an ARRAY of numbers. In order to get the sentiment score and magnitude in separated columns, we simply need to extract the relevant values:

0.20000000298023224 | 0.10000000149011612
0.30000001192092896 | 0.10000000149011612
0.800000011920929   | 0.800000011920929
0.0                 | 0.0
0.30000001192092896 | 0.30000001192092896
Limit Reached
Query terminated

However, we should note that Natural Language APIs are priced per API call and, in the above SQL, we are calling the API two times—one for each GSENTIMENT call. Therefore, the above SQL will cost us two API calls per document. To optimise the cost, we can create a new event stream TWITTER_STREAM_WITH_SENTIMENT, which will physicalize in Kafka the array.


Next, parse the sentiment SCORE and MAGNITUDE from the TWITTER_STREAM_WITH_SENTIMENT event stream:


With this second method, we optimize the cost with a single Natural Language API call per tweet. We can do the same with the RSS feeds by declaring:

Sink to Google BigQuery

The following part of this blog post focuses on pushing the dataset into Google BigQuery and visual analysis in Google Data Studio.

Pushing the data into BigQuery is very easy—just install the BigQuery Sink Connector with:

confluent-hub install wepay/kafka-connect-bigquery:latest

Next, configure it while applying the following parameters (amongst others):

  • topics: defines the topic to read (in our case RSS_STREAM_WITH_SENTIMENT_DETAILS and TWITTER_STREAM_WITH_SENTIMENT_DETAILS)
  • project: the name of the Google project that we’ll use for billing
  • datasets=.*=wwc: defines the BigQuery dataset name
  • keyfile=$GOOGLE_CRED/myGoogleCredentials.json: points to the JSON file containing Google's credentials (which in our case is the same file used in the Google Natural Language scoring)

Before starting the connector, we need to ensure the BigQuery dataset named wwc (as per configuration file) exists, otherwise, the connector will fail. To do so, we can log into BigQuery, select the same project defined in the configuration file, and click on CREATE DATASET. Then, we’ll need to fill in all the details (more information about the dataset creation in the Google documentation).

After creating the dataset, it’s time to start the connector with the Confluent CLI:

confluent load bigquery-connector -d $RSS_HOME/connectorBQ.properties

If the Kafka sink works, we should see one table per topic defined in the configuration file, which are RSS_STREAM_WITH_SENTIMENT_DETAILS and TWITTER_STREAM_WITH_SENTIMENT_DETAILS in our case.

Of course, we can query the data from BigQuery itself.

Visual Analysis in Google Data Studio

To start analysing the data in Google Data Studio, simply connect to the related console and select “Blank Report.”

We’ll be asked which data source to use for the project. Thus, we need to set up a connection to the wwc dataset by clicking on CREATE NEW DATASOURCE and selecting BigQuery as the connection. Then, select the project, dataset, and table ( TWITTER_STREAM_WITH_SENTIMENT_DETAILS).

We can then review the list of columns, types, and aggregations,  adding the data source to the report.

Finally, we can start creating visualisations like tiles to show record counts, line charts for the sentiment trend, and bar charts defining the most used languages.

A more advanced visualisation like a scatterplot shows the most common hashtags and the associated average sentiment value.

Below is a map visualising the average sentiment by country.

Analysing and comparing sentiment scores

Now that we have the two streams of data coming from Twitter and RSS feeds, we can do the analysis in KSQL and, in parallel, visually in Google Data Studio. We can, for example, examine the average sentiment over a timeframe and check how one source sentiment score compares to the other.

On the 27th of June, the quarterfinal match between Norway and England was played, with the result being that England beat Norway 3–0. Let’s check if we can somehow find significant similarities in the sentiment scoring of our dataset.

Starting with the Twitter feed, we can check all the tweets including ENGLAND and NORWAY by filtering the related hashtag #NORENG. To obtain the team related overall score, I’m then assigning to each team all the tweets containing the country full name and aggregating the  SENTIMENTSCORE with the following SQL:


We can check the overall sentiment score associated with the two teams using:


The GROUP BY 1 is necessary since KSQL currently requires a GROUP BY clause when using aggregation functions like SUM. Since we don’t aggregate for any columns other than the window time, we can use the number 1 as fix aggregator for the total. The result of the above query is in line with the final score, with the winner (England) having an average sentiment score of 0.212, and the loser (Norway) having a score of 0.0979.

We can also look at the behaviour per hour with the TUMBLING KSQL windowing function:


The query yields the following result:

Thu 27 Jun 16 | 0.18409091 | 0.16075758
Thu 27 Jun 17 | 0.14481481 | 0.13887096
Thu 27 Jun 18 | 0.14714406 | 0.12107647
Thu 27 Jun 19 | 0.07926398 | 0.34757579
Thu 27 Jun 20 | 0.10077705 | 0.13762544
Thu 27 Jun 21 | 0.08387538 | 0.17832865

We can clearly see that towards match time (19:00 BST), the ENGLAND average score has a spike, in coincidence with England’s first goal in the third minute. We can see the same on the Line Chart in Google Data Studio.

We can do a similar exercise on top of the RSS feeds stream, but first, we need to somehow filter it to get only FIFA Women’s World Cup 2019 data, since the predefined connector is ingesting all the news from the RSS sources without a topic filter. To do so, we create a new stream filtering only contents containing WOMEN and CUP:


We can now analyse the overall RSS sentiment with:


As before, the SUM(SENTIMENTSCORE)/COUNT(SENTIMENTSCORE) is calculating the average. We can then calculate the sentiment average for the selected team. Taking the same example of ENGLAND and NORWAY used previously, we just declare a stream filtering the sentiment for the two nations. For example:


Then, we can analyse the separate scoring with:

WHERE ROWTIME > STRINGTODATE('2019-06-27', 'yyyy-MM-dd')

The result is an average sentiment of 0.0575 for Norway and 0.111 for England, again in line with the match result where England won 3–0.

We can also understand the variation of the sentiment over time by using KSQL windowing functions like TUMBLING:

WHERE ROWTIME >= STRINGTODATE('2019-06-27', 'yyyy-MM-dd')

This yields the following results:

Thu 27 Jun 17 | 0.12876364 | 0.12876364
Thu 27 Jun 18 | 0.24957054 | 0.24957054
Thu 27 Jun 19 | 0.15606978 | 0.15606978
Thu 27 Jun 20 | 0.09970317 | 0.09970317
Thu 27 Jun 21 | 0.00809077 | 0.00809077
Thu 27 Jun 23 | 0.41298701 | 0.12389610

As expected from this source, most of the scoring of the two countries are the same since the number of articles is limited and almost all articles mention both ENGLAND and NORWAY.

Strangely, as we can see in the graph above, the NORWAY sentiment score on the 27th of June at 11:00 pm GMT (so after the match ended) is much higher than the ENGLAND one.

We can look at the data closely with:

https://www.theguardian.com/...      | 0.375974032  | 0.375974032
https://www.bbc.co.uk/.../48794550   | null         | -0.45428572
https://www.bbc.co.uk/.../48795487   | 0.449999988  | 0.449999988

We can see that  NORWAY is being associated with two articles: one from The Guardian with a positive 0.375 score and one from BBC with a positive 0.449 score. ENGLAND, on the other hand, is associated with another BBC article, having a negative -0.454 score.
We can also compare the hourly Twitter and RSS sentiment scores by creating two tables:

WHERE ROWTIME >= STRINGTODATE('2019-06-27', 'yyyy-MM-dd')

The key of both the tables is the window start date, as we can see from:

ksql> select rowkey from RSS_NORWAY_ENGLAND_TABLE limit 1;
1 : Window{start=1561557600000 end=-}

We can then join the results together with the following statement:


This yields the following result:

Thu 27 Jun 17 | 0.14481481 | 0.13887096 | 0.12876364 | 0.12876364
Thu 27 Jun 18 | 0.14714406 | 0.12107647 | 0.24957054 | 0.24957054
Thu 27 Jun 19 | 0.07926398 | 0.34757579 | 0.15606978 | 0.15606978
Thu 27 Jun 20 | 0.10077705 | 0.13762544 | 0.09970317 | 0.09970317
Thu 27 Jun 21 | 0.08387538 | 0.17832865 | 0.00809077 | 0.00809077

And the end result similarly in a Data Studio Line Chart

Interested in more?

If you’re interested in what KSQL can do, you can download the Confluent Platform to get started with the event streaming SQL engine for Apache Kafka. To help you get started, Rittman Mead provides a 30 day Kafka quick start package

This article was originally posted on the Confluent blog.

Categories: BI & Warehousing

Merging OBIEE 12c .RPD binary files directly in Git

Thu, 2019-07-11 15:41
Let's talk about OBIEE concurrent development!

Enabling concurrent development for OBIEE RPD is a recurring theme in OBIEE blogs. Full support for RPD development with Gitflow has long since been part of the Rittman Mead's BI Developer Toolkit and is described in great detail in Minesh's blog post. What you are currently reading is a follow-up to Minesh's post, but taking it one step further: instead of calling Python scripts to perform Gitflow steps, we want to perform all those steps directly from our Git client (including the ones performing a merge, like feature finish), be it command line or a visual application like Sourcetree.

RPD versioning directly in Git - do we need that?

How is versioning directly in a Git client better than calling Python scripts? First of all, it is the convenience of using the same approach, the same tool for all content your need to version control. A Python script will have to come with instructions for its use, whereas every developer knows how to use Git. Last but not least, a 3-way merge, which is used for Gitflow's feature finish, release finish and hotfix finish commands, requires three repositories that need to be passed to the script in the right order. Doing merges in your Git client would be quicker and less error prone.

What is a Git merge?

Before we proceed with discussing our options for merging OBIEE RPDs, let us quickly recap on how Git merges work.

There are two types of Git merges: Fast-forward Merges and 3-way Merges. Strictly speaking, Fast-forward Merges are no merges at all. If the base branch has not seen any changes whilst you worked on your feature branch, merging the feature back into the base simply means 'fast-forwarding' the base branch tip to your feature branch tip, i.e. your feature becomes the new base. That is allowed because the two branches have not diverged - the histories of the base and the feature branches form a single history line.

When the two branches have diverged, i.e. when the base has been modified by the time we want to merge our feature, a 3-way merge is the only option.

In the above diagram, feature 1 can be fast-forward merged whereas feature 2 must be 3-way merged into the develop branch.

Note that because a Fast-forward Merge is not an actual merge but rather a replacement, it is not relevant what content is being merged. The 3-way Merge however, depending on the content being merged, can be quite challenging or even impossible. And can result in merge conflicts that require manual resolution.

So... can Git 3-way merge RPDs?

OBIEE RPD can be saved in two formats: a single binary .rpd file or one or many .xml files (depending on what rpd-to-xml conversion method you use). The choice here seems obvious - it is common knowledge that Git cannot reliably 3-way merge binary files. So XML format it is. Or is it?

Like any other text file, Git certainly can merge XML files. But will it produce an XML that is still recognised as a consistent OBIEE RPD? Well, there are some OBIEE developer teams that have reported success with this approach. My own experience even with the most trivial of RPD changes shows that somewhere during the .xml to .rpd conversion, then introducing changes in the .rpd and in the end converting it back to .xml, the XML tags get reshuffled and sometimes their identifiers can change as well. (Equalising RPD objects is supposed to help with the latter.) I found no standard Git merge algorithm that would reliably and consistently perform RPD merge for XML format produced this way, be it a single large XML file or a collection of small XML files.

Fortunately, there is a better (and less risky) way.

Creating a Git custom merge driver

It is possible to create custom Git merge drivers and then assign them to specific file extensions (like .rpd) in the .gitattributes file - as described in Git documentation. According to the guide, defining a custom merge driver in Git is really straight forward: just add a new entry to the .git/config file:

[merge "filfre"]
	name = feel-free merge driver
	driver = filfre %O %A %B %L %P
	recursive = binary

Here, filfre is the code name of the custom merge driver, feel-free merge driver is the descriptive name of it (hardly used anywhere) and the driver value is where we define the driver itself. It is a shell command for your operating system. Typically it would call a shell script or a binary executable. It can be a java -jar execution or a python my-python-script.py call. The latter is what we want - we have already got a 3-way merge script for OBIEE RPD in the Rittman Mead's BI Developer Toolkit, as blogged by Minesh.

For the script to know about the content to be merged, it receives the following command line arguments: %O %A %B %L %P. These are the values that Git passes to the custom merge driver:

  • %O - this is the Base or the Original for the 3-way merge. If we are using Git Flow, this is the develop branch's version, from which our feature branch was created;
  • %A - this is the Current version for the 3-way merge. If we are using Git Flow, this is the feature branch that we want to merge back into develop;
  • %B - this is the Other or the Modified version of the 3-way merge. If we are using Git Flow, this is the develop branch as it is currently (diverged from the original Base), when we want to merge our feature branch back into it.

There are two more values, which we do not need and will ignore: %L is Conflict marker size, e.g. 7 for '>>>>>>>'. This is irrelevant for us, because we are handling binary files. %P is the full path name where the merge result will be stored - again irrelevant for us, because Python is capable of getting full paths for the files it is handling, in case it needs it.

Creating a Git custom merge driver for OBIEE .rpd binary files

What we need here is a Python script that performs a 3-way RPD merge by calling OBIEE commands comparerpd and patchrpd from command line. Please note that OBIEE creates a 4th file as the output of the merge, whereas a git merge driver is expected to overwrite the Current (%A) input with the merge result. In Python, that is quite doable.

Another important thing to note is that the script must return exit code 0 in case of a success and exit code 1 in case there were merge conflicts and automatic merge could not be performed. Git determines the success of the merge solely based on the exit code.

Once we have the Python script ready and have tested it standalone, we open our local Git repository folder where our OBIEE .rpd files will be versioned and open the file <repo root>/.git/config for editing and add the following lines to it:

[merge "rpdbin"]
    name = binary RPD file merge driver
    driver = python C:/Developer/bi_developer_toolkit/git-rpd-merge-driver.py %O %A %B

Our Python script expects 3 command line arguments - names of .rpd files: Base (%O), Current (%A) and Modified (%B). Those will be temporary files, created by Git in run time.

Once the config file is modified, create a new file <repo root>/.gitattributes and add the following line to it:

*.rpd merge=rpdbin

This assumes that your binary RPD files will always have the extension .rpd. If with a different extension, the custom merge driver will not be applied to them.

And that is it - we are done!

Note: if you see that the custom merge driver works from the Git command line tool but does not work in Sourcetree, you may need to run Sourcetree as Administrator.

Trying it out

We will use Sourcetree as our Git/Gitflow client - it is really good at visualising the versioning flow and shows the currently available Gitflow commands for the currently checked out branch.

We will use the RPD from Oracle Sample Application v602 for OBIEE 12c for our testing.

After initialising Gitflow in our Git repository, we add the out-of-the-box Sample Apps RPD to our repository's develop branch - that will be our Base.

Then we create two copies of it and modify each copy to introduce changes we would like to see merged. In the screenshots below, you can see Business Models and Databases renamed. But I did also change the content of those Business Models.

Repo 1:

Repo 2:

Now we create a new feature branch and overwrite the Base rpd it contains with our Repo 1 rpd.

As the next step, we check out the develop branch again and replace the Base rpd there with Repo 2 rpd.

Note that we need to make sure the develop branch is different from the original Base when we finish our feature. If the develop branch will be the same as the original Base when we finish the feature, a fast-forward merge will be done instead and our custom merge driver will not be applied.

The result should look like this in Sourcetree. You can see a fork, indicating that the develop and the feature8 branches have diverged:

We are ready to test our custom 3-way merge driver. In Sourcetree, from the Gitflow menu, select Finish Feature.

Confirm your intention to merge the feature back into develop.

If all goes as planned, Git will call your custom merge driver. In Sourcetree, click the Show Full Output checkbox to see the output from your script. In my script, I tagged all output with a [Git RPD Merge Driver] prefix (except the output coming from external functions). This is what my output looks like:

Now let us check the result: make sure the develop branch is checked out, then open the merged RPD in the Admin tool.

We can see that it worked - we can now do full Gitflow lifecycle for OBIEE .rpd files directly in Git.

But what if the merge fails?

If the merge fails, the feature branch will not be deleted and you will have to merge the .rpd files manually in the OBIEE Admin tool. Note that you can get the Current, the Modified and the Base .rpd files from Git. Once you are happy with your manual merge result, check out the develop branch and add it there.

Categories: BI & Warehousing

Kafka and Football: KSQL, Google Natural Language APIs, BigQuery and DataStudio

Thu, 2019-07-04 02:08

If you missed it, yesterday I wrote a guest blog post for Confluent! The blog post mixes two of my favorite topics: Apache Kafka and Football! The post starts by defining the data ingestion from Twitter and sport news RSS feeds via Kafka Connect, continues with the definition of a KSQL UDF Function using Google Natural Language APIs for Sentiment Analysis. Then it's time to define the data sink to Google Big Query and the data visualization with Google Data Studio.

The last bit of the post is dedicated to data analysis with both KSQL and DataStudio on top of the quarterfinal match won by England against Norway. If you are interested in the full article, check it out here!

Categories: BI & Warehousing

Oracle Analytics Summit 2019

Mon, 2019-07-01 09:16

Last week I attended the Oracle Analytics Summit at Skywalker Ranch. The event was live-streamed on Facebook and widely Tweeted, so a fair number of people may be up to speed with the announcements, I’ll summarise what I thought the key points were.

The purpose of the event, I think, was to relaunch/re-energise Oracle’s analytics products and services, make some specific announcements, and to try and better engage customers and analysts.

I think a lot of this stems from T. K. Anand taking over last summer and the changes in the team and direction he is looking to make. He was the first speaker at talked of openness and simplicity.

New Product Line

First, he announced a rationalisation of the product line with three offerings:

  • OAC (Oracle Analytics Cloud) - this is OAC on the second generation OCI (Oracle Cloud Infrastructure), with a simpler pricing model, details below.
  • OAS (Oracle Analytics Server) - this is a feature parity version of OAC available on premise.
  • OAA (Oracle Analytics Applications) - this is a new ‘BI Apps’ product for Oracle’s SaaS offerings such as ERP Cloud and HCM Cloud, more details below.
New Pricing

Second, he announced new pricing for OAC:

  • $20/month for Professional Edition
  • $2,000/month per OCPU for Enterprise Edition


Image from Doug Henschen

This also streamlined the different versions of OAC. Professional is essentially DV (Data Visualisation), and Enterprise is OBIEE and DV. I am not sure what this means for the existing Data Lake edition, however, I believe Essbase is being moved to a separate PaaS service.

I believe this also means that in the future if you are an on-premise OAS customer with the Enterprise Edition, then DV will be included in the price.

At the time of writing this, I’m not sure of the minimum number of users for the Professional Edition, however, given the price point, it looks like Oracle are targeting Power BI, Tableau etc., so I would expect a low number.

The Future of BI Apps

BI Apps has risen from the flames, and I believe there are now a couple of options:

  1. The existing ODI ‘version’ of BI Apps is being updated to support ODI 12c, OBIEE 12c/OAC and ADW. This means whether you have your application reporting on-premise or in the cloud, there is an upgrade path for the traditional BI Apps components, plus, due to the support of ADW, you will be able to migrate an existing on-premise system to Oracle’s cloud. Independent of anything that happens to DIPC, ODI will be supported on Oracle’s IaaS.
  2. There is a new cloud-only version of packaged applications, see OAA above. These provide a managed data pipeline between the SaaS application and an instance of ADW. Information is surfaced through what looks like DV, with some extended capabilities, allowing KPIs to be defined and these KPIs to arranged as a series of Tiles on a dashboard, not unlike the look of the Day by Day app.
Customer Commitment

One clear objective is more interaction and integration with customers, and to this end Oracle is increasing the number of people in support, there is more online content available, such as their Udemy channel and they are also creating something called the Idea Lab for the analytics community, where users can make suggestions and interact with the Product Managers.


I felt throughout the event that Oracle are aware of their shortcomings in the past and are committed to listening to customers, giving them a better experience and putting them more at the centre of their world.

The product line and pricing changes act as proof that they are committed to acting on this; it’s not just lip service. Through the pricing of the Professional Edition, I think you can also see Oracle going for the departmental or shadow IT world of data exploration and visualisation.

One customer commented at the end of the event that no one tried to sell him anything, that, if anything, is a good indicator for change.

Categories: BI & Warehousing

Rittman Mead at Kscope 2019

Thu, 2019-06-13 05:17
Rittman Mead at Kscope 2019

June is time for one of my favourite conferences: Kscope! This year the location is Seattle and the Agenda is impressive. The event starts on Sunday with the Symposiums driven by Oracle Product Managers and divided by stream of interest.

The main conference is Monday-Wednesday with the Thursday dedicated to Deep Dive sessions. On Tuesday morning the Analytics track attention will be shifted to Skywalker Ranch for live-stream of the Oracle Analytics Summit!

I'll be representing Rittman Mead with two talks: One about "Become an Equilibrista: Find the Right Balance in the Analytics Tech Ecosystem" (Jun 24, 2019 03:45 PM - 04:45 PM, Room 3B, Level 3) discussing how to make self-service and centralized Analytics coexist successfully thus avoiding the "Excel Hell".

Rittman Mead at Kscope 2019

The second session is about two topics I love: Machine Learning and Wine! The "Is It Corked? Wine Machine Learning Predictions with OAC" (Jun 25, 2019 03:45 PM - 04:45 PM, Room 3B, Level 3) will show the details on how Oracle Analytics Cloud democratizes data science using a Wine Dataset as example. During the session we'll explore how OAC provides an easy and visual interface to Machine Learning and how a predictive model can be built, tested and evaluated within the same platform.

Rittman Mead at Kscope 2019

If you'll be at Kscope 2019 and you see me around, don't hesitate to stop me! I’d be pleased to speak with you about Wine, Food, Coffee, and Analytics of course!

Categories: BI & Warehousing

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

Wed, 2019-04-17 10:03
 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

Hi, Game of Thrones aficionados, welcome to GoT Series 8 and my tweet analysis! If you missed any of the prior season episodes, here are I, II and III. Finally, after almost two years, we have a new series and something interesting to write about! If you didn't watch Episode 1, do it before reading this post as it might contain spoilers!

Let's now start with a preview of the starting scene of Episode 2:

 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

If you followed the previous season blog posts you may remember that I was using Kafka Connect to source data from Twitter, doing some transformations with KSQL and then landing the data in BigQuery using Connect again. On top of it, I was using Tableau to analyze the data.

 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

The above infrastructure was working fine and I have been able to provide insights like the sentiment per character and the "game of couples" analysing how a second character mentioned in the same tweet could change the overall sentiment.

 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

The sentiment scoring was however done at visualization time, with the data extracted from BigQuery into Tableau at tweet level, scored with an external call to R, then aggregated and finally rendered.

 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

As you might understand the solution was far from optimal since:

  • The Sentiment scoring was executed for every query sent to the database, so possibly multiple times per dashboard
  • The data was extracted from the source at tweet level, rather than aggregated

The dashboard indeed was slow to render and the related memory consumption huge (think about data volumes being moved around). Furthermore, Sentiment Scores were living only inside Tableau: if any other people/application/visualization tool wanted to use them, they had to recalculate from scratch.

My question was then: where should I calculate Sentiment Scores in order to:

  • Do it only once per tweet, not for every visualization
  • Provide them to all the downstream applications

The answer is simple, I need to do it as close to the source as possible: in Apache Kafka!

Sentiment Scoring in Apache Kafka

There are a gazillion different ways to implement Sentiment Scoring in Kafka, so I chose a simple method based on Python and Google's Natural Language API.

Google Natural Language API

Google's NL APIs is a simple interface over a pre-trained Machine Learning model for language Analysis and as part of the service it provides sentiment scoring.

The Python implementation is pretty simple, you just need to import the correct packages

from google.cloud import language_v1
from google.cloud.language_v1 import enums

Instantiate the LanguageServiceClient

client = language_v1.LanguageServiceClient()

Package the tweet string you want to be evaluated in a Python dictionary

content = 'I'm Happy, #GoT is finally back!'
type_ = enums.Document.Type.PLAIN_TEXT
document = {'type': type_, 'content': content}

And parse the response

response = client.analyze_sentiment(document)
sentiment = response.document_sentiment
print('Score: {}'.format(sentiment.score))
print('Magnitude: {}'.format(sentiment.magnitude))

The result is composed by Sentiment Score and Magnitude:

  • Score indicated the emotion associated with the content as Positive (Value > 0) or Negative (Value < 0)
  • Magnitude indicates the power of such emotion, and is often proportional with the content length.

Please note that Google's Natural Language API is priced per document so the more content you send for scoring, the bigger your bill will be!

Creating a Kafka Consumer/Producer in Python

Once we fixed how to do Sentiment Scoring, it's time to analyze how we can extract a tweet from Kafka in Python. Unfortunately, there is no Kafka Streams implementation in Python at the moment, so I created an Avro Consumer/Producer based on Confluent Python Client for Apache Kafka. I used the jcustenborder/kafka-connect-twitter Connect, so it's always handy to have the Schema definition around when prototyping.

Avro Consumer

The implementation of an Avro Consumer is pretty simple: as always first importing the packages

from confluent_kafka import KafkaError
from confluent_kafka.avro import AvroConsumer
from confluent_kafka.avro.serializer import SerializerError

then instantiating the AvroConsumer passing the list of brokers, group.id useful, as we'll see later, to add multiple consumers to the same topic, and the location of the schema registry service in schema.registry.url.

c = AvroConsumer({
    'bootstrap.servers': 'mybroker,mybroker2',
    'group.id': 'groupid',
    'schema.registry.url': ''})

Next step is to subscribe to a topic, in my case got_avro


and start polling the messages in loop

while True:
        msg = c.poll(10)

    except SerializerError as e:
        print("Message deserialization failed for {}: {}".format(msg, e))



In my case, the message was returned as JSON and I could extract the tweet Text and Id using the json package


Avro Producer

The Avro Producer follows a similar set of steps, first including needed packages

from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer

Then we define the Avro Key and Value Schemas, in my case I used the tweet Id as key and included the text in the value together with the sentiment score and magnitude.

key_schema_str = """
   "namespace": "my.test",
   "name": "value",
   "type": "record",
   "fields" : [
       "name" : "id",
       "type" : "long"
value_schema_str = """
   "namespace": "my.test",
   "name": "key",
   "type": "record",
   "fields" : [
       "name" : "id",
       "type" : "long"
       "name" : "text",
       "type" : "string"
       "name" : "sentimentscore",
       "type" : "float"
       "name" : "sentimentmagnitude",
       "type" : "float"

Then it's time to load the Key and the Value

value_schema = avro.loads(value_schema_str)
key_schema = avro.loads(key_schema_str)
key = {"id": id}
value = {"id": id, "text": text,"sentimentscore": score ,"sentimentmagnitude": magnitude}

Creating the instance of the AvroProducer passing the broker(s), the schema registry URL and the Key and Value schemas as parameters

avroProducer = AvroProducer({
    'bootstrap.servers': 'mybroker,mybroker2',
    'schema.registry.url': 'http://schem_registry_host:port'
    }, default_key_schema=key_schema, default_value_schema=value_schema)

And finally produce the event defining as well the topic that will contain it, in my case got_avro_sentiment.

avroProducer.produce(topic='got_avro_sentiment', value=value, key=key)

The overall Producer/Consumer flow is needless to say, very easy

 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

And it works!

 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and PythonParallel Sentiment Scoring

One thing I started noticing immediately, however, is that especially on tweeting peaks, the scoring routine couldn't cope with the pace of the incoming tweets: a single python Consumer/Producer was not enough. No problem! With Kafka, you can add multiple consumers to the same topic, right?

Of course Yes! But you need to be careful.

Consumer Groups and Topic Partitions

You could create multiple consumers on different Consumer Groups (defined by the group.id parameter mentioned above), but by doing this you're telling Kafka that those consumers are completely independent, thus Kafka will send each one a copy of every message. In our case, we'll simply end up scoring N times the same message, one for each consumer.

If, on the other hand, you create multiple consumers with the same consumer group, Kafka will treat them as unique consuming process and will try to share the load amongst them. However, it will do so only if the source topic is partitioned and will exclusively associate each consumer to one (or more) topic partitions! To read more about this check the Confluent documentation.

The second option is what we're looking for, having multiple threads reading from the same topic and splitting the tweet workload, but how do we split an existing topic into partitions? Here is where KSQL is handy! If you don't know about KSQL, read this post!

With KSQL we can define a new STREAM sourcing from an existing TOPIC or STREAM and the related number of partitions and partition key (the key's hash will be used to assign deterministically a message to a partition). The code is the following


Few things to keep in mind:

  • Choose the number of partitions carefully, the more partitions for the same topic, the more throughput but at the cost of extra complexity.
  • Choose the <PARTITION_KEY> carefully: if you have 10 partitions but only 3 distinct Keys, then 7 partitions will not be used. If you have 10 distinct keys but 99% of the messages have just 1 key, you'll end up using almost always the same partition.

Yeah! We can now create one consumer per partition within the same Consumer Group!

Joining the Streams

As the outcome of our process so far we have:

  • The native GOT_AVRO Stream coming from Kafka Connect, which we divided into 6 partitions using the tweet id as Key and named GOT_AVRO_PARTITIONED.
  • A GOT_AVRO_SENTIMENT Stream that we created using Python and Google's Natural Language API, with id as Key.

The next logical step would be to join them, which is possible with KSQL by including the WITHIN clause specifying the temporal validity of the join. The statement is, as expected, the following:

    ON A.ID=B.ID; 

Please note that I left a two minute window to take into account some delay in the scoring process. And as you would expect I get............ 0 results!

 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

Reading the documentation better gave me the answer: Input data must be co-partitioned in order to ensure that records having the same key on both sides of the join are delivered to the same stream task.

Since the GOT_AVRO_PARTITIONED stream had 6 partitions and GOT_AVRO_SENTIMENT only one, the join wasn't working. So let's create a 6-partitioned version of GOT_AVRO_SENTIMENT.


Now the join actually works!

 Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

Next topics are: pushdown to Google's BigQuery and visualization using Google's Data Studio! But this, sadly, will be for another post! See you soon, and enjoy Game of Thrones!

Categories: BI & Warehousing

Democratize Data Science with Oracle Analytics Cloud - Predictions and Final Considerations

Wed, 2019-04-10 06:44
Democratize Data Science with Oracle Analytics Cloud - Predictions and Final Considerations

Oracle Analytics Cloud as an enabler for Data Science: this is the third post of a series which started with Episode I where we discussed the path from a Data Analyst to a Data Scientist and how to start a Data Science journey with OAC. Episode II was focused on Feature Engineering, Data Analytics and Machine Learning, showing how those steps can be performed in OAC using a visual and easily understandable interface.

This last post will focus on how to perform Predictions in OAC and lastly, it will analyze how the Data Science option provided by OAC can fit in a corporate Data Science Strategy.

Step #6: Predict

In the second part of the blog series, we understood how to train a model and evaluate the scoring predictions via OAC. It's important to note that the various models we created are stored as independent objects in OAC: they will not be changed by modification or deletion of the DataFlow that originated them. We can set the sharing permissions directly in the Model Inspect window.

Democratize Data Science with Oracle Analytics Cloud - Predictions and Final Considerations

Note: It's worth mentioning that models are only identified by Model Name so multiple DataFlows generating models with the same name will be overwriting the same object.

As for model training, also for Prediction OAC provides two methods: On the Fly or via DataFlow. To create a project using On The Fly method, do the following:

  • Select Create Scenario
  • Select an ML Model from the list
  • Associate the columns used for the Model training with columns coming from the dataset
  • Start using the predicted fields in one (or more) visualization
Democratize Data Science with Oracle Analytics Cloud - Predictions and Final Considerations

The On the Fly method works well and can add multiple scenarios at the same time, however, it has two disadvantages:

  • The prediction is done at run time: the Python code behind the model is executed for every analysis and after every refresh, which could make the visualization time consuming and resource hungry to deliver.
  • It's difficult to create new formulas using predictions and source dataset: providing a prediction is only part of the game, as Analysts, we may want to create complex calculations dependent on both columns coming from the predictions and the original dataset. This is possible (the prediction is done via an EVALUATE_SCRIPT which can be found in the Developer view) but it's very hard and error prone since the columns coming from the evaluation are not provided in the Project's code editor for new columns.
Democratize Data Science with Oracle Analytics Cloud - Predictions and Final Considerations

The second point mentioned is not an actual disadvantage of the method itself and mainly related to the editor implementation which I hope it will be fixed in a future release.

Both the above disadvantages are addressed by the "Batch" mode using DataFlow. In DataFlow there is the Apply Model step which follows the same logic as before:

  • Select Apply Model
  • Select an ML Model from the list
  • Associate the columns used for the Model training with columns coming from the dataset
  • Select which Columns from the prediction to include in the target dataset
  • Store the resulting dataset
Democratize Data Science with Oracle Analytics Cloud - Predictions and Final Considerations

The DataFlow method will apply the predictions only once and store them in a new dataset. Since the resulting dataset will contain both columns from the original dataset as well as the ones from the prediction any combined metric can easily be created as part of the DataFlow or directly in the Project visualization mode. It's worth mentioning that the DataFlow method can't be executed in situations where real-time data is analyzed thus the predictions need to happen at query time. However Incremental DataFlows can be scheduled at every few min intervals, thus near-real-time analytics cases could be addressed.

Final Considerations

Bringing Data Science to people, this is one of Oracle Analytics Cloud's missions: providing automated insights about datasets, training and evaluating Machine Learning Models in an easy and low-code interface, all included within the same tool Business Analysts use in their day to day job. OAC does a great job by offering a rich and powerful set of options via the same GUI Business Users are already experiencing.

Other Data Science Use Cases and Alternatives

We can't, however, expect OAC to cover all the Data Science use cases. OAC is perfect for the initial Data Science enablement, providing tools to get deep insights on existing datasets and create fairly complex Machine Learning models. On the other hand when dealing with massive amounts of data, when complex data transformations need to happen or when parameter fine-tuning for perfect performances is needed, then OAC may not be the right tool.

When running any model training or prediction, OAC is extracting the data, executing the training/scoring python code on the OAC server, and then either displaying or storing the data. OAC is basically moving the data to ML which is inefficient for huge data volumes. In such cases, the approach of moving Machine Learning to Data, by using Oracle Advanced Analytics directly in the Database should be the selected one. Moreover, for more sophisticated use cases, usually in the hands of Data Scientist, Oracle provides Oracle Machine Learning, a Machine Learning SQL notebook based on Apache Zeppelin on top of Oracle Autonomous Data Warehouse.

Worth also mentioning that, for typical code-based Data Scientist type work, Oracle R Enterprise offers the powerful functions of R optimized for the usage in the Oracle database. If you're more a Python guy, Oracle Advanced Analytics contains Oracle Machine Learning for Python (OML4Py) enabling to run an optimized Python version directly in the Oracle Database.

The bonus point of doing Machine Learning in the database is that predictions are usually only a function-call away, thus can be packaged in a view and consumed as standard database objects by Oracle Analytics Cloud users. This allows us to mix and match various Machine Learning creation flows depending on the complexity, amount of data and skillset of the team involved. On a future outlook, Oracle Data Science Cloud is also coming soon, providing collaborative end-to-end ML environment in the cloud.


Oracle Analytics Cloud is the perfect tool to speed up a company's path to Data Science: not only it empowers Business Users with Augmented Analytics, Machine Learning training and prediction flows, it also provides an easy interface to learn tasks like Data Cleaning, Feature Engineering and Model Selection & Evaluation which are proper of Data Scientists. Furthermore, including OAC in an ecosystem with Oracle Advanced Analytics and Oracle Machine Learning, provides a unique, consistent and connected solution capable of handling all Data Science company needs.

In a future blog post series I'll be doing a technical deep dive on how ML works under the hood and how you can extend it with custom Models. Check this space for news!

Categories: BI & Warehousing

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Tue, 2019-04-09 04:01
Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Welcome back! In my previous Post, I described how the democratization of Data Science is a hot topic in the analytical industry. We then explored how Oracle Analytics Cloud can act as an enabler for the transformation from Business Analyst to Data Scientist and covered the first steps in a Data Science project: problem definition, data connection & cleaning. In today's post, we'll cover the second part of the path: from the data transformation and enrichment, the analysis, the machine learning model training and evaluation. Let's Start!

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine LearningStep #3: Transform & Enrich

In the previous post, we understood how to clean data in order to handle wrong values, outliers, perform aggregation, feature scaling and divide our dataset between train and test. Cleaning the data, however, is only the first step in data processing, and should be followed by what in Data Science is called Feature Engineering.

Feature Engineering is a fancy name to call what in ETL terms we always called data transformation, we take a set of columns in input and we apply transformation rules to create new columns. The aim of Feature Engineering is to create good predictors for the following machine learning model. Feature Engineering is a bit of black art and to achieve excellent results requires a deep understanding of the ML Model we intend to use. However, most of the basic transformations are actually driven by domain knowledge: we should create new columns that we think will improve the problem explanation. Let's see some examples:

  • If we're planning to predict the Taxi Fare in New York between any two given points and we have source and destination, a good predictor for the fare probably would be the Euclidean distance between the two.
  • If we have Day/Month/Year on separate columns, we may want to condense the information in a unique column containing the Date
  • In case our dataset contains location names (Cities, Regions, Countries) we may want to geo-tag those properly with ZIP codes or ISO Codes.
  • If we have personal information like Credit Cards details or Person Name, we may want to decide to obfuscate or extract features like the person's sex from the name (on this topic please check the blog post about GDPR and ML from Brendan Tierney).
  • If we have continuous values like the person's age, do we think there is much difference between a 35, 36 or 37 year-old person? If not we should think about binning them in the same category.
  • Most Machine Learning Models can't cope with categorical data, thus we need to transform them to numbers (aka encoding). The standard process, when no ordering exists between the labels, is to create a new column for each value and mark the rows with 1/0 accordingly.
Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Oracle Analytics Cloud again covers all the above cases with two tools: Euclidean distance, generic data transformation like data condensation and binning are standard steps of the Dataflow component. We only need to set the correct parameters or write simple SQL-like statements. Moreover, for binning, there are options to do it manually as well as automatically providing equal-width and equal-height bins therefore taking out the manual labour and related BIAS.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

On the other side the geo-tagging, data obfuscation, automatic feature extraction (like person's sex based on name) is something that with most of the other tools needs to be resolved by hand, with complex SQL statements or dedicated Machine Learning efforts.

OAC again does a great job during the Data Preparation Recommendation step: after defining a data source, OAC will scan column names and values in order to find interesting features and propose some recommendations like geo-tagging, obfuscation, data splitting (e.g. Full Name split into First and Last Name) etc.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

The accepted recommendations will be added to a Data Preparation Script that can be automatically applied when updating our dataset.

Step #4: Data Analysis

Data Analysis is declared as Step #4 however since the Data Transformation and Enrichment phase we started a circular flow in order to optimize our predictive model output.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

The analysis is a crucial step for any Data Science project; in R or Python one of the first steps is to check dataset head() that will show a first overview of the data like the below

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

OAC does a similar job with the Metadata Overview where we can see for each column the name, type and sample values as well as the Attribute/Metric definition and associated aggregation than we can then change later on.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Analysing Data is always a complex task and is where the expert eye of a data scientist makes the difference. OAC, however, can help with the excellent Explain feature. As described in the previous post, by right clicking on any column in the dataset and selecting Explain, OAC will start calculating statistics and metrics related to the column and display the findings in graphs that we can incorporate in the Data Visualization project.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Even more, there are additional tabs in the Explain window that provide Key Drivers, Segments and Anomalies.

  • Key Drivers provides the statistically significant drivers for the column we are examining.
  • Segments shows hidden groups in the dataset that can predict outcomes in the column
  • Anomalies does an outlier detection, showing which are corner cases in our dataset
Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Some Data Science projects could already end here. If the objective was to find insights, anomalies or particular segments in our dataset, Explain already provides that information in a clear and reusable format. We can add the necessary visualization to a Project and create a story with the Narrate option.

If on the other side, our scope is to build a predictive model, then it's time to tackle the next phase: Model Training & Evaluation.

Step #5: Train & Evaluate

Exciting: now it's time to tackle Machine Learning! The first thing to do is to understand what type of problem we are trying to solve. OAC allows us to solve problems in the following categories:

  • Supervised when we have a history of the problem's solution and we want to predict future outcomes, we can then identify two subcategories
    • Regression when we are trying to predict a continuous numerical value
    • Classification when we are trying to assign every sample to a category out of two or more
  • Unsupervised when we don't have a history of the solution, but we ask the ML tool to help us understanding the dataset.
    • Clustering when we try to label our dataset in categories based on similarity.
Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

OAC provides two different ways to apply Machine Learning on a dataset: On the Fly or via DataFlows. The On the Fly method is provided directly in the data visualization: when we create any chart, OAC provides the option to add Clusters, Outliers, Trend and Forecast Lines.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

When adding one of the Analytics, we have some control over the behaviour of the predictive model. For the clustering image above we can decide which algorithm to implement (between K-means and Hierarchical Clustering), the number of clusters and the trellis scope in case we visualize multiple scatterplots, one for each value of a dimension.

Applying Machine Learning models on the fly is very useful and could provide some great insights, however, it suffers from a limitation: the columns analysed by the model are only the ones included in the visualization, we have no control over other columns we may want to add to the model to increase predictions accuracy.

If we want to have granular control over columns, algorithm and parameters to use, OAC provides the Train Model step in the DataFlow component.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

As described above OAC provides the option to solve Regression problems via Numeric Prediction, apply Binary or Multi-Classifier for Classification, and Clustering. There is also an option to train Custom Models which can be scripted by a Data Scientist, wrapped in XML tags and included in OAC (more about this topic in a later post).

Once we've selected the class of problem we're aiming to solve, OAC lets us select which Model to train between various prebuilt ones. After selecting the model, we need to identify which is the target column (for Supervised ML classes) and fix the parameters. Note the Train Partition Percent providing an automated way to split the dataset in train/test and Categorical/Numerical Column Imputation to handle the missing values. As part of this process, the encoding for categorical data is executed.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

... But which Model should we use? What parameters should we pick? One lesson I got from my knowledge of Machine Learning is that there is no golden model and parameters set to solve all problems. Data Scientist will try to use different models, compare them and tune parameters based on experimentation (aka trial and error).

OAC allows us to create an initial Dataflow, select a model, set the parameters then save the Dataflow and model output.  Then restart by opening the Dataflow changing the model or the parameters and storing the artefacts with different names to compare them.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

After creating one or more Models, it's time to evaluate them, on OAC we can select a Model and Click on Inspect. In the Overview tab, Inspect shows the model description and properties. Far more interesting is the Quality tab which provides a set of Model scoring metrics based on the test dataset created following the Train Partition Percent parameter. In case of a Numeric Prediction problem, the Quality tab will show for each model quality metrics like the Root Mean Squared Error. OAC will provide similar metrics no matter which ML algorithm you're implementing, making the analysis and comparison easy.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

In the case of Classification, the Quality Tab will show the confusion matrix together with some pre-calculated metrics like Precision, Recall etc.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

The model selection then becomes an optimization problem for the metric (or set of) we picked during the problem definition (see TEP in the previous post). After trying several models, parameters, features, we'll then choose the model that minimizes the error (or increase the accuracy) of our prediction.

Note: as part of the model training, it's very important to select which columns will be used for the prediction. A blind option is to use all columns but adding irrelevant columns isn't going to provide better results and, for big or wide (huge number of columns) datasets, it becomes computationally very expensive. As written before, the Explain function provides the list of columns that represent statistically significant predictors. The columns listed there should represent the basics of the model training.

Ok, part II done, we saw how to perform Feature Engineering and Model Training and Evaluation, check my next post for the final piece of the Data Science journey: Predictions and final considerations!

Edit: Link to Part III of the series.

Categories: BI & Warehousing

Democratize Data Science with Oracle Analytics Cloud

Mon, 2019-04-08 08:37
Democratize Data Science with Oracle Analytics Cloud

In the last few weeks, I had the chance to speak both at Analytics and Data Summit held in Oracle HQ in San Francisco and OUG Norway Spring Conference 2019 on a wavy cruise between Oslo and Kiel. The underlying topic of the presentations was just one: demonstrate how Oracle Analytics Cloud can be used to bridge the gap between Business Analysts and Data Science.

Imagine: you only need to connect to the data, and the tool starts giving you suggestions on enrichments and transformations, important drivers, segments and anomalies, and an easy way to create machine learning predictions. Well, that's Oracle Analytics Cloud!

Democratize Data Science with Oracle Analytics Cloud

But let's start from the beginning, let's define the problem.

Problem Definition

Let's try to dive into the subject by answering a few questions.

What is Data Science? Why should I be curious about that?

Data science, as per Wikipedia, is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

Why should you be interested? Well, I'm not going to reply to this question directly, I leave it to Gartner to do that. In a recent publication, they claimed that Artificial Intelligence implementation adoption has growth by 270% in the last four years with companies implementing it raising from 10% to 37%. Data Science represents the basics of Artificial Intelligence, the spread of successful AI projects is directly related to the correct application of Data Science principles to raw or curated datasets. Again, as Gartner writes "By 2023, AI and deep-learning techniques will be the most common approaches for new applications of Data Science"

I'm a Business Analyst, I'm an expert in my field, why do I need Data Science?

I see three main drivers to answer this question:

  • Complexity of the data: nowadays we are working on top of huge volumes of highly dimensional datasets. Traditional tools (like Excel) and methods (visual analysis) can't fulfil anymore the analytical needs: data is too complex to be extracted, manipulated and analyzed manually. Thus Data Science techniques and Machine Learning need to be used in the analytical routines to automate parts of the daily workflow of insights generation.
  • Personal Bias: every person has a history that serves as the basis for the daily work. What if your prebuilt knowledge actually poses an obstacle to new insights discovery? With Data Science you can discover what is statistically significant in the data versus what you think it's significant.
  • Future Work Opportunities: I leave this driver to an article of insidebigdata.com. The Data Scientist skillset is rare and there is a boom in demand. What are you waiting for?

I'm a Business Analyst, can't I just rebrand myself as Data Scientist?

There is no single definition of a Business Analyst and Data Scientist, it hugely depends both on prior knowledge and study as well as the context where the person is working on. However, there have been multiple attempts of defining Data Scientist skills over the years, with the below Venn diagram being quite famous, coming from drewconway.com.

Democratize Data Science with Oracle Analytics Cloud

The diagram is already showing some common points and differences between the general idea of Business Analyst and the Data Scientist:

  • Substantive Expertise: this skillset is a must-have in both roles. Even if Data Science standard techniques can be applied in every context, domain knowledge is required to gain better insights into the data.
  • Math & Statistic Knowledge: this is the first difference, the math used in Data Science is probably more advanced than the one used in the day to day job as Business Analysts. As per datascience.com's suggestion, #1 step to go from a Data Analyst to a Data Scientist is to refresh statistics.
  • Hacking Skills: this is where the biggest difference is, Data Scientist mostly uses scripting languages like R and Python which are not traditionally included in the skillset of a Business Analyst who instead uses visual tools.

Summarizing, there is quite some difference between the skillset traditionally associated with a Business Analyst and the Data Scientist one. A simple "rebranding" is not enough.

How can I empower my Business Analysts with Data Science Skills?

This is what it's called Data Science Democratization: using tools and techniques to lower the barrier to Data Science!

One of the drivers of Data Science Democratization is Augmented Analytics! Gartner defines Augmented Analytics as the Future of Data Analytics, but what is it?

Augmented Analytics

It's the concept of using Machine Learning to automate some steps of data preparation, insights discovery and sharing. Those steps are included in the Analytical platform already used by Business Analysts.

Let's see an example: the typical task of a Business Analyst is to find the significant drivers of Sales. With the usual flow based on BA's prior knowledge: the experience was driving the analysis and thus insight generation. However, how we discussed before, this brings a personal BIAS that could stop BA from understanding new and hidden patterns. What if the analytical platform itself suggests, based on the dataset, what are the statistically significant drivers? Those drivers can then be verified by the Business Analyst and maybe generate new insights.

Democratize Data Science with Oracle Analytics CloudLow-code Machine Learning

The second trend in Data Science Democratization is empowering Business Users to create low-code Machine Learning models, in the same GUI-driven tool they use for their day-to-day job.

Let's see another example: if a Business Analyst wants to classify a client based on the fact that he'll accept an offer. With no prior knowledge, he starts from a 50/50 position (50% chance he will say yes or no). The traditional approach was either personal-experience based (with the related Bias problem) or the involvement of a Data Scientist. What if, instead, with a series of simple, visual and repeatable steps he could create an ML model correctly predicting 70% of the customers accepting the offer? The model will not be perfect but still giving the option to focus the calls to the clients more likely to accept the offer, generating potentially a 20% increase in sales.

Democratize Data Science with Oracle Analytics CloudHow Oracle Analytics Cloud Bridges the Gap to Data Science?

Oracle Analytics Cloud offers a series of features enabling Data Science, I've been blogging about them in 2018 when they were first published.  Let's see now how a Data Science process could be executed with OAC.

Democratize Data Science with Oracle Analytics CloudStep #0: Problem Definition

The basic step in Data Science (not in the picture above) actually doesn't require any tool: it's the Problem Definition. There is no general purpose Data Science! We need to define precisely what problem we're going to solve. A good approach is to define using TEP: Task, Experience and Performance.

  • Task represents the problem we're trying to solve: Classify Spam/Not-Spam, predict the temperature in the room
  • Experience represents the historical background that we'll use: Corpus of emails already classified as Spam/Not-Spam, history of hourly room temperatures
  • Performance is the metric we are using to understand how good/bad our prediction is. There is a multitude of metrics we can use, depending on the type of problem we are trying to solve. The metric(s) selection will guide the decisions in our Data Science process.
Democratize Data Science with Oracle Analytics CloudStep #1: Data Connection

This is the first step in the tool itself, and, I believe, very familiar for any Business Analyst working with Oracle Analytics Cloud. OAC provides users with the ability to connect to:

  • Curated Subject Areas coming from a prebuilt repository in the hands of the IT department where data is sourced, modelled and exposed following precise business logic and KPI definitions. The importance of a central curated data repository is detailed here
  • Raw Data Sources: do you have data in Excel, MongoDB, Kafka, Dropbox? No problem, OAC allows you to connect to it and start analysing within few clicks! All you need is specifying the connection URL, credentials and parameters.
  • Both: do you want to mix curated KPIs with data coming from a variety of datasets? OAC covers this case by allowing you to mash them up!
Democratize Data Science with Oracle Analytics CloudStep #2: Clean

Data cleaning, I guess many Business Analysts are already familiar with this. What are, on the other hand, the typical cleaning steps a data scientist performs?

Democratize Data Science with Oracle Analytics Cloud

The usual steps are the following

  • Handle Missing Values: What does a missing value mean? How should I treat it?  Understanding and handling missing values are fundamental since it could enhance our comprehension of the dataset. Should I remove samples with missing values? Should I substitute them with a default value? The response depends on the value meaning.
  • Correcting Wrong Values: Free Text, easy to implement, a nightmare to fix! In the example above the same content, MARK is written in two different ways (capital and lower case A). For a human eye, the two values are the same, but for a computer, they'll belong to two different classes. If we want to optimise the output of our model, we should optimise the inputs by correcting the wrong values.
  • Remove Irrelevant Observation: If we are trying to do predictions on the city of Milan in Italy, are the observations of Rome relevant? Or the ones from Milan in Michigan, USA? If not we should remove them because the following Machine Learning model a) could learn behaviours which are not relevant  b) the related process will run faster since it will need to handle less data
  • Labelling Columns: especially when connecting to raw datasources we can't expect to have proper naming of the columns. E.g. Col1 is containing ClientName. The Machine Learning Tools will probably don't care about column names, but you do in order to explain their behaviour.
  • Handling Outliers: outliers represent corner cases of our dataset. Is the CIO salary relevant in our dataset if we are trying to predict a person salary based on the years of experience in the company? Well, probably not since the CIO salary will be far away from any other salary. The images below show the same predictive model: trying to fit all salaries including the CIO one (10 Years Experience) in A, and excluding the CIO salary in B. We can clearly see that excluding the corner case in our dataset improves the fitting of our predicting model.
Democratize Data Science with Oracle Analytics Cloud
  • Feature Scaling: Some of the predictive models in data science try to optimise the Euclidean distance (the distance between the predicted points and the actual points). If we have features on different scales (e.g. the number of bedrooms in a house and house price) those models will try to optimise much better the feature having a larger scale since the Euclidean distance gain will be much higher than when optimising the feature having a smaller scale.
  • Aggregations: most of the times the dataset we'll face with will have a different granularity to the one we intend to work with, just imagine working with a website clickstream trying to predict customer buying. Probably we'll want to aggregate the clickstream at session or user level and count the number of clicks, the time within the session or between sessions.
  • Train/Test Split: it's very important to test our prediction against unseen data in order to avoid overfitting. The dataset division in train/test can be done randomly or based on specific columns (Ids, timestamps) depending on the problem we are facing.

How do we perform the above steps with OAC? In the OAC bundle, there is a component called Dataflow (described here), providing ETL capabilities in the same visual platform where we visualize the data.

Democratize Data Science with Oracle Analytics Cloud

DataFlows provides a variety of steps including:

  • CASE WHEN...THEN... statements useful to handle missing values
  • FILTERS to remove irrelevant observations and outliers and split train and test datasets.
  • COLUMN MANAGEMENT to rename, aggregate, combine, write simple transformations and aggregations useful in features scaling and error corrections.

An important topic during the data cleaning is the Outlier Identification for which we have several options with OAC. The basic one is by plotting the data and visually selecting the outliers.

Democratize Data Science with Oracle Analytics Cloud

OAC also offers the option to automatically label points a Outliers/Not Outliers.

Democratize Data Science with Oracle Analytics Cloud

The third option OAC offers is via the Explain option: by selecting any column in the dataset and clicking on Explain OAC, it provides information about the column itself which can be useful for spotting outliers. Like in the example below we can clearly see NULL values (on the right side of the chart) and odd values in the range 0-25 where most of the values are in the range 80-100.

Democratize Data Science with Oracle Analytics Cloud

So far we understood how OAC helps out in the first three steps in our data science path: objective definition, data connection and cleaning. Stay tuned for the next blog posts covering feature engineering, model creation & evaluation and prediction!

Edit: Links to Part II and III of the series.

Categories: BI & Warehousing

Oracle Analytics Cloud (OAC) training with Rittman Mead

Tue, 2019-03-19 11:36

Rittman Mead have today launched it's new Oracle Analytics Cloud (OAC) Bootcamp. Run on OAC, the course lasts four days and covers everything you need to know in order to manage your Cloud BI platform and assumes no prior knowledge up-front.

As the course is modular, you are able to choose which days you'd like to attend. Day 1 covers an OAC overview, provisioning, systems management, integration and security. Day 2 covers RPD Modelling and Data Modeller. Day 3 is devoted to creating reports, dashboards, alerts and navigation. Day 4 covers content creation using Oracle Data Visualization.

Book here: https://www.rittmanmead.com/training-schedule/

Got a team to train? You can also have our OAC Bootcamp delivered on-site at your location. For more information and prices contact training@rittmanmead.com

Categories: BI & Warehousing

Spark Streaming and Kafka, Part 3 - Analysing Data in Scala and Spark

Tue, 2019-03-12 10:50

In my first two blog posts of the Spark Streaming and Kafka series - Part 1 - Creating a New Kafka Connector and Part 2 - Configuring a Kafka Connector - I showed how to create a new custom Kafka Connector and how to set it up on a Kafka server. Now it is time to deliver on the promise to analyse Kafka data with Spark Streaming.

When working with Apache Spark, you can choose between one of these programming languages: Scala, Java or Python. (There is also support for Spark querying in R.) Python is admittedly the most popular, largely thanks to Python being the most popular (and easiest to learn) programming language from the selection above. Python's PySpark library is catching up with the Spark features available in Scala, but the fact that Python relies on dynamic typing, poses challenges with Spark integration and in my opinion makes Spark a less natural fit with Python than with Scala.

Spark and Scala - the Basics

Spark was developed in Scala and its look and feel resembles its mother language quite closely. In fact, before diving into Spark Streaming, I am tempted to illustrate that for you with a small example (that also nicely recaptures the basics of Spark usage):

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

object SparkTellDifference extends App {

    // set up Spark Context
    val sparkSession = SparkSession.builder.appName("Simple Application").config("spark.master", "local[*]").getOrCreate()
    val sparkContext = sparkSession.sparkContext

    // step 0: establish source data sets
    val stringsToAnalyse: List[String] = List("Can you tell the difference between Scala & Spark?", "You will have to look really closely!")
    val stringsToAnalyseRdd: RDD[String] = sparkContext.parallelize(stringsToAnalyse)

    // step 1: split sentences into words
    val wordsList: List[String]   = stringsToAnalyse    flatMap (_.split(" "))
    val wordsListRdd: RDD[String] = stringsToAnalyseRdd flatMap (_.split(" "))

    // step 2: convert words to lists of chars, create (key,value) pairs.
    val lettersList: List[(Char,Int)]   = wordsList    flatMap (_.toList) map ((_,1))
    val lettersListRdd: RDD[(Char,Int)] = wordsListRdd flatMap (_.toList) map ((_,1))

    // step 3: count letters
    val lettersCount: List[(Char, Int)] = lettersList groupBy(_._1) mapValues(_.size) toList
    val lettersCountRdd: RDD[(Char, Int)] = lettersListRdd reduceByKey(_ + _)

    // step 4: get Top 5 letters in our sentences.
    val lettersCountTop5: List[(Char, Int)] = lettersCount sortBy(- _._2) take(5)
    val lettersCountTop5FromRdd: List[(Char, Int)] = lettersCountRdd sortBy(_._2, ascending = false) take(5) toList

    // the results
    println(s"Top 5 letters by Scala native: ${lettersCountTop5}")
    println(s"Top 5 letters by Spark: ${lettersCountTop5FromRdd}")

    // we are done

The code starts by setting up a Spark Session and Context. Please note that Spark is being used in local mode - I do not have Spark nodes installed in my working environment. With Spark Context set up, step 0 is to establish data sources. Note that the Spark RDD is based on the Scala native List[String] value, which we parallelize. Once parallelized, it becomes a Spark native.

Step 1 splits sentences into words - much like we have seen in the typical Spark word count examples. Step 2 splits those word strings into Char lists - instead of words, let us count letters and see which letters are used the most in the given sentences. Note that Steps 1 and 2 look exactly the same whilst the first one is Scala native whereas the second works with a Spark RDD value. Step 2 ends with us creating the familiar (key,value) pairs that are typically used in Spark RDDs.

Step 3 shows a difference between the two - Spark's reduceByKey has no native Scala analogue, but we can replicate its behaviour with the groupBy and mapValues functions.

In step 4 we sort the data sets descending and take top 5 results. Note minor differences in the sortBy functions.

As you can see, Spark looks very Scala-like and you may have to look closely and check data types to determine if you are dealing with Scala native or remote Spark data types.

The Spark values follow the typical cycle of applying several transformations that transform one RDD into another RDD and in the end the take(5) action is applied, which pulls the results  from the Spark RDD into a local, native Scala value.

Introducing Spark Streaming

A good guide on Spark Streaming can be found here.

A quick overview of Spark Streaming with Kafka can be found here, though it alone will unlikely be sufficient to understand the Spark Streaming context - you will need to read the Spark Streaming guide as well.

Working with Spark streams is mostly similar to working with regular RDDs. Just like the RDDs, on which you apply transformations to get other immutable RDDs and then apply actions to get the data locally, Spark Streams work similarly. In fact, the transformation part looks exactly the same - you apply a transformation on a Discretized Stream (DStream) to get another DStream. For example, you can have a val words: DStream[String] that represents a stream of words. You can define another DStream with those same words in upper case as

val wordsUpper: DStream[String] = words map (_.toUpperCase)

Note that both these values represent streams - data sources where new data production might be ongoing. So if you have an incoming stream of words, you can define another data stream of the same words but in upper case. That includes the words not yet produced into the stream.

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html

(If the values words were an RDD, the wordsUpper calculation would look almost the same: val wordsUpper: RDD[String] = words map (_.toUpperCase).) However, DStreams and RDDs differ when it comes to getting the data locally - for RDDs you call actions, for DStreams it is a bit more complicated. But... let us start from the beginning.

Setting up Spark Streaming

Much like a Spark Session and Context, Spark Streaming needs to be initialised.

We start by defining Spark Config - much like for SparkSession in the simple Spark example, we specify the application name and define the nodes we are going to use - in our case - local nodes on my developer workstation. (The asterisk means that Spark can utilise all my CPU threads.)

val sparkConfig = 
  new SparkConf().setMaster("local[*]").setAppName("SparkKafkaStreamTest")

The next step is creating a Spark StreamingContext. We pass in the config defined above but also specify the Spark Streaming batch interval - 1 minute. This is the same as the production interval by our Connector set up in Kafka. But we could also define a 5 minute batch interval and get 5 records in every batch.

val sparkStreamingContext = new StreamingContext(sparkConfig, Minutes(1))

Before we proceed, we would like to disable the annoying INFO messages that Spark likes to flood us with. Spark log level is set in Spark Context but we do not have SparkContext defined, do we? We only have StreamingContext. Well, actually, upon the creation of a StreamingContext, SparkContext is created as well. And we can access it via the StreamingContext value:


That is the Spark Streaming Context dealt with.

Setting up Access to Kafka

Setting up access to Kafka is equally straightforward. We start by configuring Kafka consumer:

val kafkaConfig = Map[String, Object](
    "client.dns.lookup" -> "resolve_canonical_bootstrap_servers_only",
    "bootstrap.servers" -> "",
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> "kafkaSparkTestGroup",
    "auto.offset.reset" -> "latest",
    "enable.auto.commit" -> (false: java.lang.Boolean)

The parameters given here in a Scala Map are Kafka Consumer configuration parameters as described in Kafka documentation. is the IP of my Kafka Ubuntu VM.

Although I am referring to my Kafka server by IP address, I had to add an entry to the hosts file with my Kafka server name for my connection to work: kafka-box

The client.dns.lookup value did not have an impact on that.

The next step is specifying an array of Kafka topics - in our case that is only one topic - 'JanisTest':

val kafkaTopics = Array("JanisTest")
Getting First Data From Kafka

We are ready to initialise our Kafka stream in Spark. We pass our StreamingContext value, topics list and Kafka Config value to the createDirectStream function. We also specify our LocationStrategy value - as described here. Consumer Strategies are described here.

val kafkaRawStream: InputDStream[ConsumerRecord[String, String]] =
        KafkaUtils.createDirectStream[String, String](
            ConsumerStrategies.Subscribe[String, String](kafkaTopics, kafkaConfig)

What gets returned is a Spark Stream coming from Kafka. Please note that it returns Kafka Consumer record (key,value) pairs. The value part contains our weather data in JSON format. Before we proceed with any sort of data analysis, let us parse the JSON in a similar manner we did JSON parsing in the Part 1 of this blog post. I will not cover it here but I have created a Gist that you can have a look at.

The weatherParser function converts the JSON to a WeatherSchema case class instance - the returned value is of type DStream[WeatherSchema], where DStream is the Spark Streaming container:

val weatherStream: DStream[WeatherSchema] = 
   kafkaRawStream map (streamRawRecord => weatherParser(streamRawRecord.value))

Now our data is available for nice and easy analysis.

Let us start with the simplest - check the number of records in our stream:

val recordsCount: DStream[Long] = weatherStream.count()

The above statement deserves special attention. If you have worked with Spark RDDs, you will remember that the RDD count() function returns a Long value instead of an RDD, i.e. it is an action, not a transformation. As you can see above, count() on a DStream returns another DStream, instead of a native Scala long value. It makes sense because a stream is an on-going data producer. What the DStream count() gives us is not the total number of records ever produced by the stream - it is the number of records in the current 1 minute batch. Normally it should be 1 but it can also be empty. Should you take my word for it? Better check it yourself! But how? Certainly not by just printing the recordsCount value - all you will get is a reference to the Spark stream and not the stream content.

Displaying Stream Data

Displaying stream data looks rather odd. To display the recordsCount content, you need the following lines of code:



sparkStreamingContext.start() // start the computation
sparkStreamingContext.awaitTermination() // await termination

The DStream value itself has a method print(), which is different from the Scala's print() or println() functions. However, for it to actually start printing stream content, you need to start() stream content computation, which will start ongoing stream processing until terminated. The awaitTermination() function waits for the process to be terminated - typically with a Ctrl+C. There are other methods of termination as well, not covered here. So, what you will get is recordsCount stream content printed every batch interval (1 minute in our example) until the program execution is terminated.

The output will look something like this, with a new record appearing every minute:

Time: 1552067040000 ms

Time: 1552067100000 ms

Time: 1552067160000 ms

Notice the '...' between the recordsCount.print() and the stream start(). You can have DStream transformations following the recordsCount.print() statement and other DStream print() calls before the stream is started. Then, instead of just the count, you will get other values printed for each 1 minute batch.

You can do more than just print the DStream content on the console, but we will come to that a bit later.

Analysing Stream Data

Above we have covered all the basics -  we have initialised Spark Context and Kafka access, we have retrieved stream data and know how how to set up ongoing print of the results for our Stream batches. Before we proceed with our exploration, let us define a goal for our data analysis.

We are receiving a real-time stream of weather data. What we could analyse is the temperature change dynamics within the past 60 minutes. Note that we are receiving a new batch every minute so every minute our 60 minute window will move one step forward.

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html

What we have got is our weatherStream DStream value. First let us define a Stream window of 60 minutes (check Spark documentation for explanation on how Stream Windows work.)

val weatherStream1Hour: DStream[WeatherSchema] = weatherStream.window(Minutes(60))

The WeatherSchema case class contains many values. But all we need for our simple analysis is really just the timestamp and temperature. Let us extract just the data we need and put it in a traditional RDD (key,value) pair. And we print the result to verify it.

val weatherTemps1Hour: DStream[(String, Double)] = 
   weatherStream1Hour map (weatherRecord => 
     (weatherRecord.dateTime, weatherRecord.mainTemperature) 


Please note that the above code should come before the sparkStreamingContext.start()call.

The result we are getting looks something like this:

Time: 1552068480000 ms
(08/03/2019 16:57:27,8.42)
(08/03/2019 16:57:27,8.42)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)

Notice the ellipse at the end. Not all records get displayed if there are more than 10.

Of course, we will get a new result printed every minute. However, the latest results will be at the bottom, which means they will be hidden if there are more than 10 of them. Also note that the weather data we are getting is actually not refreshed every minute but more like every 10 minutes. Our 1 minute batch frequency does not represent the actual frequency of weather data updates. But let us deal with those problems one at a time.

For me, vanity always comes first. Let me convert the (key,value) pair output to a nice looking narrative.

val weatherTemps1HourNarrative = weatherTemps1Hour map { 
  case(dateTime, temperature) => 
    s"Weather temperature at ${dateTime} was ${temperature}" 


The result:

Time: 1552068480000 ms
Weather temperature at 08/03/2019 16:57:27 was 8.42
Weather temperature at 08/03/2019 16:57:27 was 8.42
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38

We are still limited to the max 10 records the DStream print() function gives us. Also, unless we are debugging, we are almost certainly going to go further than just printing the records on console. For that we use the DStream foreachRDD function, which works similar to the map function, but does not return any data. Instead, whatever we do with the Stream data - print it to console, save it into a CSV file or database - that needs to take place within the foreachRDD function.

The foreachRDD Function

The foreachRDD function accepts a function as a parameter, which receives as its input a current RDD value representing the current content of the DStream and deals with that content in the function's body.

Ok, at long last we are getting some results back from our Spark stream that we can use, that we can analyse, that we know how to deal with! Let us get carried away!

weatherTemps1Hour foreachRDD { currentRdd =>
  println(s"RDD content:\n\t${currentRdd.collect().map{case(dateTime,temperature) => s"Weather temperature at ${dateTime} was ${temperature}"}.mkString("\n\t")}")

  val tempRdd: RDD[Double] = currentRdd.map(_._2)

  val minTemp = if(tempRdd.isEmpty()) None else Some(tempRdd.min())
  val maxTemp = if(tempRdd.isEmpty()) None else Some(tempRdd.max())

  println(s"Min temperature: ${if(minTemp.isDefined) minTemp.get.toString else "not defined"}")

  println(s"Max temperature: ${if(maxTemp.isDefined) maxTemp.get.toString else "not defined"}")

  println(s"Temperature difference: ${if(minTemp.isDefined && maxTemp.isDefined) (maxTemp.get - minTemp.get).toString}\n")

Here we are formatting the output and getting min and max temperatures within the 60 minute window as well as their difference.  Let us look at the result:

RDD content:
	Weather temperature at 08/03/2019 16:57:27 was 8.42
	Weather temperature at 08/03/2019 16:57:27 was 8.42
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 18:01:10 was 7.99
	Weather temperature at 08/03/2019 18:01:10 was 7.99
	Weather temperature at 08/03/2019 18:01:10 was 7.99
Min temperature: 7.99
Max temperature: 8.42
Temperature difference: 0.4299999999999997

(Now there is no Time: 1552068480000 ms signature in our results printout because we are no longer using the DStream print() function).

However, I would like my analysis to be more detailed. It is time to involve Spark DataFrames.

Kafka Stream Data Analysis with Spark DataFrames

Just like in the previous statement, I need to extract Stream data with the currentRDD function. In fact, all the code that follows will be within the currentRDD function block:

weatherStream1Hour foreachRDD { currentRdd => {
... // the following code comes here

First, let us create a DataFrame from an RDD. Spark RDDs and DataFrames are two quite different representations of distributed data. And yet - look how simply the conversion works:

val spark =
import spark.implicits._

val simpleDF: DataFrame = currentRdd.toDF()

This trick works because our weatherStream1Hour DStream and consequently the currentRdd value that represents the Stream content, are based on the WeatherSchema case class. (data types - weatherStream1Hour: DStream[WeatherSchema] and currentRdd: RDD[WeatherSchema].) Therefore the currentRdd.toDF() implicit conversion works - Spark understands Scala case classes.

Once we have the DataFrame created, we can create a Temp view so we can query this DF with Spark SQL - that is what the createOrReplaceTempView function is for.

Let us start with the simplest queries - query the count(*) and the full content of the DataFrame:

val countDF: DataFrame = spark.sql("select count(*) as total from simpleDF")

val selectAllDF = spark.sql("select * from simpleDF")

The result:

DataFrame's show() function by default only shows 20 rows, but we can always adjust that to show more or less. However, as we had established earlier in our analysis, the weather data actually does not get updated every minute - we are getting lots of duplicate records that we could get rid of. It is easy with SQL's distinct:

val distinctDF = spark.sql("select distinct * from simpleDF")

The result - only 7 distinct weather measurements, confirming our suspicion that we are only getting a weather update approximately every 10 minutes:

Let us go back to our analysis - temperature change dynamics within the past 60 minutes. The temperature value in our DataFrame is named 'mainTemperature'. But where is the timestamp? We did have the dateTime value in our RDD. Why is it missing from the DataFrame? The answer is - because dateTime is actually a function. In RDD, when we referenced it, we did not care if it is a value or a function call. Now, when dealing with DataFrames, it becomes relevant.

As can be seen in the Gist, dateTime is a function, in fact it is a WeatherSchema case class method and is calculated from the dt value, which represents time in Unix format. The function that performs the actual conversion - dateTimeFromUnix - is defined in the WeatherParser object in the same Gist. If we want to get the save dateTime value in our DataFrame, we will have to register a Spark User Defined Function (UDF).

Creating a Spark User Defined Function (UDF)

Fortunately, creating UDFs is no rocket science - we do that with the Spark udf function. However, to use this function in a Spark SQL query, we need to register it first - associate a String function name with the function itself.

val dateTimeFromSeconds: Double => String = WeatherParser.dateTimeFromUnix(_)
val dateTimeFromSecondsUdf = udf(dateTimeFromSeconds)
spark.udf.register("dateTimeFromSecondsUdf", dateTimeFromSecondsUdf) // to register for SQL use

Now let us query the temperature and time:

val tempTimeDF = spark.sql(
   "select distinct 
      dt timeKey, 
      dateTimeFromSecondsUdf(dt) temperatureTakenTime, 
      mainTemperature temperatureCelsius 
    from simpleDF order by timeKey"


We show the results but also register the resulting DataFrame as a Temp View so we can from now on reference it in Spark SQL.

Note that we are converting the dt value to a String timestamp value but also keeping the original dt value - because dt is a number that can be sorted chronologically whereas the String timestamp cannot.

The result looks like this:

More Kafka Stream Data Analysis with Spark DataFrames

Now we have the times and the temperatures. But we want to see how temperatures changed between measurements. For example, between the two consecutive measurements at 17:06 and 17:17 the temperature (in London) dropped from 8.38 to 8.28 degrees Celsius. We want to have that value of minus 0.1 degrees in our result set.

If we were using Oracle database, the obvious choice would be the LAG analytic function. Do we have an analogue for that in Spark? Yes, we do! However, this time, instead of using Spark SQL, we will use the withColumn DataFrame function to define the LAG value:

val lagWindow = org.apache.spark.sql.expressions.Window.orderBy("timeKey")
val lagDF = tempTimeDF
  .withColumn("temperatureCelsiusPrev", lag("temperatureCelsius", 1, 0).over(lagWindow))
  .withColumn("rownum", monotonically_increasing_id())


The result:

Here we are actually adding two values - lag and rownum, the latter being an analogue to the Oracle SQL ROW_NUMBER analytic function.

Note the inputs for the Spark lag function: The first is the source column name, the second is the lag offset and the third is default value - 0. The default value in our case will mean zero degrees Celsius, which will mess up our temperature delta for the first temperature measurement. Fortunately, Spark SQL also supports the CASE function so we can deal with this challenge with ease. In addition, let us round the result to get rid of the floating point artefacts.

val tempDifferenceDF = spark.sql(
		temperatureCelsius - temperatureCelsiusPrev temperatureDiff, 
			WHEN (rownum = 0) 
			THEN 0 
			ELSE temperatureCelsius - temperatureCelsiusPrev 
		END, 2) AS temperatureDiffAdjusted 
	from tempTimeWithPrev")


And the result:


Kafka stream data analysis with Spark Streaming works and is easy to set up, easy to get it working. In this 3-part blog, by far the most challenging part was creating a custom Kafka connector. Once the Connector was created, setting it up and then getting the data source working in Spark was smooth sailing.

One thing to keep in mind when working with streams is - they are different from RDDs, which are static, immutable data sources. Not so with DStreams, which by their nature are changing, dynamic.

The challenging bit in the code is the

sparkStreamingContext.start() // start the computation
sparkStreamingContext.awaitTermination() // await termination

code block and its interaction with the foreachRDD function  - to somebody not familiar with how Spark Streaming works, the code can be hard to understand.

The ease of creating a DataFrame from the original RDD was a pleasant surprise.

So, is using Spark and Kafka with Scala a good idea? Definitely yes. It works like a charm. However, in real life, additional considerations like the availability and cost of Python vs Scala developers as well as your existing code base will come into play. I hate real life.

Categories: BI & Warehousing

Rittman Mead at Analytics and Data Summit 2019

Tue, 2019-03-05 09:49
Rittman Mead at Analytics and Data Summit 2019

The Analytics and Data Summit 2019 (formerly known as BIWA) is happening next week in Oracle HQ in Redwood Shores. I'm excited to participate since it's one of the best conferences for the Oracle Analytics crowd where you can get three days full of content from experts as well as hints on products future developments directly from the related Product Managers!

I'll be representing Rittman Mead in two sessions:

This two-hour workshop will cover all the details of OAC: Product Overview, Instance Creation and Management, Moving from on-prem OBIEE to OAC, Data Preparation and Data Visualization, Advanced Analytics.  With interactive labs where participants can experience Data Visualization and Data Flows.

Rittman Mead at Analytics and Data Summit 2019

Become a Data Scientist with OAC! This session will explain how Oracle Analytics Cloud acts as an  enabler for the transformation from a Data Analyst to a Data Scientist. Connection to the Data, Cleaning, Transformation, and Analysis will be the intermediate steps before training of several machine learning models which then will be evaluated and used to predict outcomes on unseen data. With a demo showing all the steps in a real example based on a wine dataset!

Rittman Mead at Analytics and Data Summit 2019

There is a full list of all sessions here. You can follow the conference on twitter with the hashtag #AnDSummit2019, and I'll be tweeting about it too as @ftisiot.

The presentations that I'm delivering will be available to download on speakerdeck.

Categories: BI & Warehousing

Spark Streaming and Kafka - Configuring a Kafka Connector

Wed, 2019-02-27 05:14

In my previous blog post, I covered the development of a custom Kafka Source Connector, written in Scala. This blog post is about deploying that Connector. We are getting closer to analysing the stream data in Spark - as promised in the title.

If you are installing a Connector from Confluent Hub, instead of installing a custom one, please follow this guide instead.

Setting up Kafka Server

This blog post does not cover setting up a Kafka instance. However, let me give a quick overview of the environment I am using:

  • OS: Ubuntu server 18.10.
  • Kafka version - at the time of writing, the latest version is 2.1.0. So that is the one I am using. Please note that the Kafka Connector API interface is slightly different for different Kafka versions. (In fact, early in my Connector development I saw Kafka throwing strange missing function errors. It turned out I was using a slightly older API for Kafka 2.0.0 instead of the one for 2.1.0. - the SourceConnector class's function `taskConfigs` had been renamed in the latest version.)
  • Java: OpenJDK 11.0.1. It seems the OpenJDK is preferred for Kafka over the Oracle Java JDK.

I got the Kafka install file from:

Setting up Kafka Topic

Confluent have an excellent user guide on how to set up Connectors. There is also a Quick Start guide in the Apache Kafka website, though it is much less detailed than the guide from Confluent.

The first step for me, once my Kafka instance is up and running, is to create a new topic for my Weather records. Let me call it 'JanisTest'. From Kafka root folder I execute:

$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic JanisTest
$ bin/kafka-topics.sh --list --zookeeper localhost:2181

After creating it, I run the --list command to see that it is indeed created and available.

Install Connector

Copy jars from your Scala development environment to your Kafka instance. I chose to have a folder of jars, not one uber-jar with all library dependencies packaged in. But either approach should work fine.

$ cd ~/kafka
$ mkdir connectors
$ cd connectors
$ cp /VMSharedFolder/connectorJars/* .

Do not add the above folder to CLASSPATH! Instead, add it to the
connect-*.properties config files. In the later versions of Kafka, the server itself manages paths to plugins.

$ cd ~/kafka/config
$ vi connect-standalone.properties

We will run our connector in standalone mode, therefore the important file for us to edit is 'connect-standalone.properties'. But we can also edit
the 'connect-distributed.properties' config file if we ever decide to run Kafka on multiple nodes.

Add the following line to the connect-standalone.properties and connect-distributed.properties configuration files:


If you need to add several connector folders, the plugin.path value is comma-separated.

Configure Connector

Similar to the connect-standalone.properties file, our newly installed Source Connector also requires a configuration file. Let us create one:

$ cd ~/kafka/config
$ vi connect-http-source.properties

We specify the following configuration attributes in the config file - these attributes were defined in the `HttpSourceConnectorConstants` object we discussed in my previous blog post.


http.api.key=<YOUR API KEY>


  • name - name of the Connector.
  • http.url - the base URL for the HTTP API.
  • http.api.key - replace <YOUR API KEY> with the key you got when registering with OpenWeatherMap.
  • http.api.params - any additional parameters you would like to specify for your API request - as described in the API documentation. Follow the format name=value. In case you need to specify several attributes, separate them with a &.
  • service.name - you can name your service.
  • topic - give the topic name that was created previously, as described in this guide.
  • poll.interval.ms - the frequency at which the API will be queried. 60000ms = 1 minute. Because the weather does not change that quickly, this polling interval is frequent enough for us. The free API tier limits us to no more than 1 query per second.
  • tasks.max - the maximum number of concurrent tasks allowed. In our case it is 1.
  • connector.class - the full class path of the SourceConnector class we have implemented, which is described in my previous blog post.
Run Connector

Because we are only running the Connector in standalone mode, there is not much config to be done. We are ready to run it. To start our Source Connector, we run the connect-standalone.sh command:

$ cd ~/kafka
$ ./bin/connect-standalone.sh ./config/connect-standalone.properties ./config/connect-http-source.properties

The first argument passed to the connect-standalone.sh script is the worker properties configuration - only one configuration. It is followed by one or many (on our case - one) connector configuration.

Take a deep breath and hit Enter.

A long log output will follow. If you get any errors, the command will return to shell prompt. Some possible errors are missing classes - if that is the case, please check that you have successfully deployed all the jars and they are registered in the connect-standalone.properties file. Also, if any of the required configuration parameters are missing in the connect-http-source.properties file, you will get an error message.

If successful, the log listing will end with something like this:

[2019-02-19 14:14:55,294] INFO Kafka version : 2.1.0 (org.apache.kafka.common.utils.AppInfoParser:109)
[2019-02-19 14:14:55,297] INFO Kafka commitId : 809be928f1ae004e (org.apache.kafka.common.utils.AppInfoParser:110)
[2019-02-19 14:14:55,334] INFO Created connector http-source (org.apache.kafka.connect.cli.ConnectStandalone:104)
[2019-02-19 14:14:55,406] INFO Setting up an HTTP service for http://api.openweathermap.org/data/2.5/weather... (com.rittmanmead.kafka.connector.http.HttpSourceTask:40)
[2019-02-19 14:14:55,466] INFO Starting to fetch from http://api.openweathermap.org/data/2.5/weather each 60000ms... (com.rittmanmead.kafka.connector.http.HttpSourceTask:47)
[2019-02-19 14:14:55,469] INFO WorkerSourceTask{id=http-source-0} Source task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSourceTask:199)
[2019-02-19 14:14:56,560] INFO Http return code: 200 (com.rittmanmead.kafka.connector.http.HttpSourceTask:31)
[2019-02-19 14:14:56,569] INFO Weather Schema parser: JSON text to be parsed: {"coord":{"lon":-0.13,"lat":51.51},"weather":[{"id":802,"main":"Clouds","description":"scattered clouds","icon":"03d"}],"base":"stations","main":{"temp":10.09,"pressure":1019,"humidity":61,"temp_min":9,"temp_max":11},"visibility":10000,"wind":{"speed":4.6,"deg":250},"clouds":{"all":44},"dt":1550582400,"sys":{"type":1,"id":1414,"message":0.0038,"country":"GB","sunrise":1550559964,"sunset":1550597002},"id":2643743,"name":"London","cod":200} (com.rittmanmead.kafka.connector.http.HttpSourceTask:273)
[2019-02-19 14:14:57,481] INFO JSON parsed class content: WeatherSchema(Coord(-0.13,51.51),List(WeatherAtom(802.0,Clouds,scattered clouds,03d)),stations,Main(10.09,1019.0,61.0,9.0,11.0),10000.0,Wind(4.6,250.0),Clouds(44.0),1.5505824E9,Sys(1.0,1414.0,0.0038,GB,1.550559964E9,1.550597002E9),2643743.0,London,200.0) (com.rittmanmead.kafka.connector.http.HttpSourceTask:283)
[2019-02-19 14:14:57,499] INFO Got 1 results for CurrentWeatherData (com.rittmanmead.kafka.connector.http.HttpSourceTask:75)

Some of the above messages were generated by our Connector directly, like the last line above - 'Got 1 results for CurrentWeatherData'. Others are generated by Kafka itself.

Check Connector Output

Our ultimate goal is to analyse a Kafka Stream with Spark in Scala. However, that will have to wait till my next blog post. For now we can check if the topic is being populated. Please note that our polling interval is 1 minute - the topic will not be flooded with records right away.

$ cd ~/kafka
$ ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic JanisTest --from-beginning

If all is going well, you should see Weather records, generated according to our custom-defined schema:

Quite easy, was it not? Certainly setting up a Kafka connector is easier than developing one.

The next step will be analysing this data from Spark. But that will be my next blog post.

Categories: BI & Warehousing