BI & Warehousing

I gave up my cell phone & laptop for the weekend: This is what I learned

Look Smarter Than You Are - Mon, 2019-04-22 10:10
It was time for a technology detox. When I left work on Good Friday, I left my laptop at the office. I got home at 3PM and put my mobile phone on a charger that I wouldn't see until Monday at 9AM. And my life free of external, involuntary, technological distraction began... along with the stress of being out of touch for the next 3 days. Here's what I learned.

Biggest Lessons
  1. It's really stressful at first, but you get over it.
  2. All those people you told "if it's an emergency, contact my significant other" will not have any emergencies suitable for contacting your significant other.
  3. It will leave you wanting more.
I learned far more about myself and we'll get to that in a second.

Why in the name of God?
Thanks to the cruel "Screen Time" tracking feature of my Apple iPhone, I found that on the average day, I lift up my phone more than 30 times before 11AM every day and then it gets worse from there. In general, I am using my phone 6+ hours per day and many days are a lot worse. I pay more attention to my phone than the people around me: if it's always within arm's reach and I use it for everything. As a CEO, my outward reason for my phone addiction is that I have to be connected: emails and text messages must be dealt with immediately and without my calendar, I might miss a Very Important Meeting. In reality, I am completely addicted to my cell phone and the whole "I have to stay connected" thing is largely rationalization.

But about a week ago, I looked around at the people in my life and realized that we're all addicted: for some of us, it's about communication. Others live in their games. Some people are on Instagram looking at puppies and kittens. Whatever your thing, you're getting it through either your phone or your laptop.

So why take a break? Mostly to find out 1) if I could make it for 42 hours; and 2) what I could learn from the experience. I settled on Easter weekend (April 19-22).

Things I thought I couldn't live without
Texting. According the aforementioned Evil Screen Time, I knew that I spent 1.5 hours a day on text messaging. To be clear, I'm not a tween: my company uses text messaging more than any other communication vehicle, it's how I stay in contact with friends (who has time for phone calls?), and it's about the only way my kids will talk to me.

Email. While texting is great for short communications and quick back-and-forths, I get around 200 non-spam emails on the average day and about 50 on the average weekend. When you have something longer to say or it's not urgent, email is the way to go.

Navigation. I have long since forgotten how to drive without the little blue dot directing me. There are about four places I felt I could find on my own (work, home, airport, grocery store), but I was sure that I would be lost without Google Maps or Waze.

Games. I am level 40 on Pokemon Go (humble brag) and I have played it every day since July 2016. It's literally the only game on my phone, but I have to keep my daily streak going lest... I don't know, actually, but the stress of missing out on my 7-day rewards was seriously getting to me.

Turns out, I didn't miss Pokemon Go, I'm actually a decent driver without a phone (it's like falling off a bike: you never forget how), and if you're off email, you never know what you're missing. I did miss texting, but not in the way I thought I would. So what did I actually miss?

Things I actually missed
Bitmoji. I genuinely missed sending cute pictures around to my friends of me as the Easter Bunny and receiving their pictures dressed up inside Easter eggs. I kept wanting to sneak peeks at my wife's phone to see if she was getting anything cute, though I did manage to resist.

Information. I had forgotten the days when questions didn't have answers. What's the address of Academy Sports? I didn't know, so I just had to drive in the general area where I thought it was. What time does Salata open? No idea, so I drove there and got to wander outside for a bit until they opened for the day (fun fact: stores still post actual opening/closing hours on their front doors!). What time is the movie Little playing at the AMC Grapevine Mills 30? Who won the Texas Rangers game (when in doubt, assume it's the team they're playing against)? Who is the actor that plays that one character in that movie, oh, come on, you know who I'm talking about, that guy, let me just look it up for you, oh, damn, I can't until Monday, FML?

Calendar. I worried all weekend about my schedule for the upcoming week: when was my first appointment on Monday, what did I have scheduled for after work, was there anything I should be preparing for, when was I leaving town next, where was I supposed to be for Memorial Day weekend? It went on-and-on, and it turns out that none of it matters.

Photos. I didn't realize how many photos I take of the world around me, until I couldn't take any photos at all. I had to use a long-forgotten mental trick called "memory." It made me pay a lot more attention to the world around me, and I genuinely remember more of how I experienced the weekend than if I had been trying to catalog everything through pictures. I'm sure photos would have made this blog more appealing, but I'm doing all this from memory, so all we have are words.

Connection. I wanted to know what my friends and family were doing and to let them know I was thinking of them. Without technology, this is almost impossible nowadays. I had to resort to seeing them in-person: I met a couple of them at a restaurant and we got together with another friend for cycling, a movie, and Game of Thrones. But it turns out that those friends - the ones I spent time with in-person - I felt more deeply connected to than before the weekend started. Texting is about surface-level connecting, but facetime (note that this is different than FaceTime) is about bonding.

What changed over the weekend?For one, I spent a lot more time outside. I played frisbee, went on a fourteen-mile bike ride, worked out at the gym, walked around some, went to the mall, saw a movie, and in general, I actually experienced more of the world than I normally do. I also didn't trip over a curb once, because unlike normal, I was looking up the whole time.

I read more instead of looking at my phone each night to fall asleep. I made it 100 pages into a book that I've been meaning to read for a year now. And in the morning I didn't reach for my phone on my bedside table either. I tend to forget how immersed you can get in a book when you don't have notifications popping up constantly telling you what you should be doing instead of reading in peace.

I spent a lot of time with my wife this weekend to the point that she was probably sick of me by Sunday night, but we spent real time with each other without any technological distractions. I finally gave her an Edward Break last night by heading off to take a long bath while reading more of my book (Stealing Snow, if you're curious). She fell asleep and I stayed up reading until midnight.

Any lasting effects?I thought I would be longing for my phone and my laptop (particularly text and emails) at exactly 9AM this morning. I waited until 9AM and opened up my laptop to see what appointment I had at 9AM. It turns out no one needs me - or loves me? - until 10:30AM, so I opened up a browser window to write my first blog entry in many, many months. My cell phone is still face down, and as of 10AM, I still have no idea who texted or emailed me all weekend. I'm blissfully writing away, and I have to admit, I'm not looking forward to going back to my constantly-connected world.

Will giving up your technology addiction for a weekend give you some sort of mystical clarity, a purity of soul that let's you know how the Dalai Lama must feel when he's between text messages? No, but it will help you find out just how addicted you are, and how strong your willpower is. It'll help you understand what you're missing when you're disconnected, and if you're like me, you'll find that in some ways, you actually like it.

Now will I ever do this again? I'll let you know after I log into my email, read all my texts, and see just how bad the world got over the weekend. Until then, I'm blissfully unaware.
Categories: BI & Warehousing

Oracle Analytics Cloud (OAC) training with Rittman Mead

Rittman Mead Consulting - Tue, 2019-03-19 11:36

Rittman Mead have today launched it's new Oracle Analytics Cloud (OAC) Bootcamp. Run on OAC, the course lasts four days and covers everything you need to know in order to manage your Cloud BI platform and assumes no prior knowledge up-front.

As the course is modular, you are able to choose which days you'd like to attend. Day 1 covers an OAC overview, provisioning, systems management, integration and security. Day 2 covers RPD Modelling and Data Modeller. Day 3 is devoted to creating reports, dashboards, alerts and navigation. Day 4 covers content creation using Oracle Data Visualization.

Book here: https://www.rittmanmead.com/training-schedule/

Got a team to train? You can also have our OAC Bootcamp delivered on-site at your location. For more information and prices contact training@rittmanmead.com

Categories: BI & Warehousing

Spark Streaming and Kafka, Part 3 - Analysing Data in Scala and Spark

Rittman Mead Consulting - Tue, 2019-03-12 10:50

In my first two blog posts of the Spark Streaming and Kafka series - Part 1 - Creating a New Kafka Connector and Part 2 - Configuring a Kafka Connector - I showed how to create a new custom Kafka Connector and how to set it up on a Kafka server. Now it is time to deliver on the promise to analyse Kafka data with Spark Streaming.

When working with Apache Spark, you can choose between one of these programming languages: Scala, Java or Python. (There is also support for Spark querying in R.) Python is admittedly the most popular, largely thanks to Python being the most popular (and easiest to learn) programming language from the selection above. Python's PySpark library is catching up with the Spark features available in Scala, but the fact that Python relies on dynamic typing, poses challenges with Spark integration and in my opinion makes Spark a less natural fit with Python than with Scala.

Spark and Scala - the Basics

Spark was developed in Scala and its look and feel resembles its mother language quite closely. In fact, before diving into Spark Streaming, I am tempted to illustrate that for you with a small example (that also nicely recaptures the basics of Spark usage):

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

object SparkTellDifference extends App {

    // set up Spark Context
    val sparkSession = SparkSession.builder.appName("Simple Application").config("spark.master", "local[*]").getOrCreate()
    val sparkContext = sparkSession.sparkContext
    sparkContext.setLogLevel("ERROR")

    // step 0: establish source data sets
    val stringsToAnalyse: List[String] = List("Can you tell the difference between Scala & Spark?", "You will have to look really closely!")
    val stringsToAnalyseRdd: RDD[String] = sparkContext.parallelize(stringsToAnalyse)

    // step 1: split sentences into words
    val wordsList: List[String]   = stringsToAnalyse    flatMap (_.split(" "))
    val wordsListRdd: RDD[String] = stringsToAnalyseRdd flatMap (_.split(" "))

    // step 2: convert words to lists of chars, create (key,value) pairs.
    val lettersList: List[(Char,Int)]   = wordsList    flatMap (_.toList) map ((_,1))
    val lettersListRdd: RDD[(Char,Int)] = wordsListRdd flatMap (_.toList) map ((_,1))

    // step 3: count letters
    val lettersCount: List[(Char, Int)] = lettersList groupBy(_._1) mapValues(_.size) toList
    val lettersCountRdd: RDD[(Char, Int)] = lettersListRdd reduceByKey(_ + _)

    // step 4: get Top 5 letters in our sentences.
    val lettersCountTop5: List[(Char, Int)] = lettersCount sortBy(- _._2) take(5)
    val lettersCountTop5FromRdd: List[(Char, Int)] = lettersCountRdd sortBy(_._2, ascending = false) take(5) toList

    // the results
    println(s"Top 5 letters by Scala native: ${lettersCountTop5}")
    println(s"Top 5 letters by Spark: ${lettersCountTop5FromRdd}")

    // we are done
    sparkSession.stop()
}

The code starts by setting up a Spark Session and Context. Please note that Spark is being used in local mode - I do not have Spark nodes installed in my working environment. With Spark Context set up, step 0 is to establish data sources. Note that the Spark RDD is based on the Scala native List[String] value, which we parallelize. Once parallelized, it becomes a Spark native.

Step 1 splits sentences into words - much like we have seen in the typical Spark word count examples. Step 2 splits those word strings into Char lists - instead of words, let us count letters and see which letters are used the most in the given sentences. Note that Steps 1 and 2 look exactly the same whilst the first one is Scala native whereas the second works with a Spark RDD value. Step 2 ends with us creating the familiar (key,value) pairs that are typically used in Spark RDDs.

Step 3 shows a difference between the two - Spark's reduceByKey has no native Scala analogue, but we can replicate its behaviour with the groupBy and mapValues functions.

In step 4 we sort the data sets descending and take top 5 results. Note minor differences in the sortBy functions.

As you can see, Spark looks very Scala-like and you may have to look closely and check data types to determine if you are dealing with Scala native or remote Spark data types.

The Spark values follow the typical cycle of applying several transformations that transform one RDD into another RDD and in the end the take(5) action is applied, which pulls the results  from the Spark RDD into a local, native Scala value.

Introducing Spark Streaming

A good guide on Spark Streaming can be found here.

A quick overview of Spark Streaming with Kafka can be found here, though it alone will unlikely be sufficient to understand the Spark Streaming context - you will need to read the Spark Streaming guide as well.

Working with Spark streams is mostly similar to working with regular RDDs. Just like the RDDs, on which you apply transformations to get other immutable RDDs and then apply actions to get the data locally, Spark Streams work similarly. In fact, the transformation part looks exactly the same - you apply a transformation on a Discretized Stream (DStream) to get another DStream. For example, you can have a val words: DStream[String] that represents a stream of words. You can define another DStream with those same words in upper case as

val wordsUpper: DStream[String] = words map (_.toUpperCase)

Note that both these values represent streams - data sources where new data production might be ongoing. So if you have an incoming stream of words, you can define another data stream of the same words but in upper case. That includes the words not yet produced into the stream.

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html

(If the values words were an RDD, the wordsUpper calculation would look almost the same: val wordsUpper: RDD[String] = words map (_.toUpperCase).) However, DStreams and RDDs differ when it comes to getting the data locally - for RDDs you call actions, for DStreams it is a bit more complicated. But... let us start from the beginning.

Setting up Spark Streaming

Much like a Spark Session and Context, Spark Streaming needs to be initialised.

We start by defining Spark Config - much like for SparkSession in the simple Spark example, we specify the application name and define the nodes we are going to use - in our case - local nodes on my developer workstation. (The asterisk means that Spark can utilise all my CPU threads.)

val sparkConfig = 
  new SparkConf().setMaster("local[*]").setAppName("SparkKafkaStreamTest")

The next step is creating a Spark StreamingContext. We pass in the config defined above but also specify the Spark Streaming batch interval - 1 minute. This is the same as the production interval by our Connector set up in Kafka. But we could also define a 5 minute batch interval and get 5 records in every batch.

val sparkStreamingContext = new StreamingContext(sparkConfig, Minutes(1))

Before we proceed, we would like to disable the annoying INFO messages that Spark likes to flood us with. Spark log level is set in Spark Context but we do not have SparkContext defined, do we? We only have StreamingContext. Well, actually, upon the creation of a StreamingContext, SparkContext is created as well. And we can access it via the StreamingContext value:

sparkStreamingContext.sparkContext.setLogLevel("ERROR")

That is the Spark Streaming Context dealt with.

Setting up Access to Kafka

Setting up access to Kafka is equally straightforward. We start by configuring Kafka consumer:

val kafkaConfig = Map[String, Object](
    "client.dns.lookup" -> "resolve_canonical_bootstrap_servers_only",
    "bootstrap.servers" -> "192.168.1.13:9092",
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> "kafkaSparkTestGroup",
    "auto.offset.reset" -> "latest",
    "enable.auto.commit" -> (false: java.lang.Boolean)
)

The parameters given here in a Scala Map are Kafka Consumer configuration parameters as described in Kafka documentation. 192.168.1.13 is the IP of my Kafka Ubuntu VM.

Although I am referring to my Kafka server by IP address, I had to add an entry to the hosts file with my Kafka server name for my connection to work:

192.168.1.13 kafka-box

The client.dns.lookup value did not have an impact on that.

The next step is specifying an array of Kafka topics - in our case that is only one topic - 'JanisTest':

val kafkaTopics = Array("JanisTest")
Getting First Data From Kafka

We are ready to initialise our Kafka stream in Spark. We pass our StreamingContext value, topics list and Kafka Config value to the createDirectStream function. We also specify our LocationStrategy value - as described here. Consumer Strategies are described here.

val kafkaRawStream: InputDStream[ConsumerRecord[String, String]] =
        KafkaUtils.createDirectStream[String, String](
            sparkStreamingContext,
            LocationStrategies.PreferConsistent,
            ConsumerStrategies.Subscribe[String, String](kafkaTopics, kafkaConfig)
        )

What gets returned is a Spark Stream coming from Kafka. Please note that it returns Kafka Consumer record (key,value) pairs. The value part contains our weather data in JSON format. Before we proceed with any sort of data analysis, let us parse the JSON in a similar manner we did JSON parsing in the Part 1 of this blog post. I will not cover it here but I have created a Gist that you can have a look at.

The weatherParser function converts the JSON to a WeatherSchema case class instance - the returned value is of type DStream[WeatherSchema], where DStream is the Spark Streaming container:

val weatherStream: DStream[WeatherSchema] = 
   kafkaRawStream map (streamRawRecord => weatherParser(streamRawRecord.value))

Now our data is available for nice and easy analysis.

Let us start with the simplest - check the number of records in our stream:

val recordsCount: DStream[Long] = weatherStream.count()

The above statement deserves special attention. If you have worked with Spark RDDs, you will remember that the RDD count() function returns a Long value instead of an RDD, i.e. it is an action, not a transformation. As you can see above, count() on a DStream returns another DStream, instead of a native Scala long value. It makes sense because a stream is an on-going data producer. What the DStream count() gives us is not the total number of records ever produced by the stream - it is the number of records in the current 1 minute batch. Normally it should be 1 but it can also be empty. Should you take my word for it? Better check it yourself! But how? Certainly not by just printing the recordsCount value - all you will get is a reference to the Spark stream and not the stream content.

Displaying Stream Data

Displaying stream data looks rather odd. To display the recordsCount content, you need the following lines of code:

recordsCount.print()

...

sparkStreamingContext.start() // start the computation
sparkStreamingContext.awaitTermination() // await termination

The DStream value itself has a method print(), which is different from the Scala's print() or println() functions. However, for it to actually start printing stream content, you need to start() stream content computation, which will start ongoing stream processing until terminated. The awaitTermination() function waits for the process to be terminated - typically with a Ctrl+C. There are other methods of termination as well, not covered here. So, what you will get is recordsCount stream content printed every batch interval (1 minute in our example) until the program execution is terminated.

The output will look something like this, with a new record appearing every minute:

-------------------------------------------
Time: 1552067040000 ms
-------------------------------------------
1

-------------------------------------------
Time: 1552067100000 ms
-------------------------------------------
0

-------------------------------------------
Time: 1552067160000 ms
-------------------------------------------
1

Notice the '...' between the recordsCount.print() and the stream start(). You can have DStream transformations following the recordsCount.print() statement and other DStream print() calls before the stream is started. Then, instead of just the count, you will get other values printed for each 1 minute batch.

You can do more than just print the DStream content on the console, but we will come to that a bit later.

Analysing Stream Data

Above we have covered all the basics -  we have initialised Spark Context and Kafka access, we have retrieved stream data and know how how to set up ongoing print of the results for our Stream batches. Before we proceed with our exploration, let us define a goal for our data analysis.

We are receiving a real-time stream of weather data. What we could analyse is the temperature change dynamics within the past 60 minutes. Note that we are receiving a new batch every minute so every minute our 60 minute window will move one step forward.

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html

What we have got is our weatherStream DStream value. First let us define a Stream window of 60 minutes (check Spark documentation for explanation on how Stream Windows work.)

val weatherStream1Hour: DStream[WeatherSchema] = weatherStream.window(Minutes(60))

The WeatherSchema case class contains many values. But all we need for our simple analysis is really just the timestamp and temperature. Let us extract just the data we need and put it in a traditional RDD (key,value) pair. And we print the result to verify it.

val weatherTemps1Hour: DStream[(String, Double)] = 
   weatherStream1Hour map (weatherRecord => 
     (weatherRecord.dateTime, weatherRecord.mainTemperature) 
   )

weatherTemps1Hour.print()

Please note that the above code should come before the sparkStreamingContext.start()call.

The result we are getting looks something like this:

-------------------------------------------
Time: 1552068480000 ms
-------------------------------------------
(08/03/2019 16:57:27,8.42)
(08/03/2019 16:57:27,8.42)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
(08/03/2019 17:06:02,8.38)
...

Notice the ellipse at the end. Not all records get displayed if there are more than 10.

Of course, we will get a new result printed every minute. However, the latest results will be at the bottom, which means they will be hidden if there are more than 10 of them. Also note that the weather data we are getting is actually not refreshed every minute but more like every 10 minutes. Our 1 minute batch frequency does not represent the actual frequency of weather data updates. But let us deal with those problems one at a time.

For me, vanity always comes first. Let me convert the (key,value) pair output to a nice looking narrative.

val weatherTemps1HourNarrative = weatherTemps1Hour map { 
  case(dateTime, temperature) => 
    s"Weather temperature at ${dateTime} was ${temperature}" 
}

weatherTemps1HourNarrative.print()

The result:

-------------------------------------------
Time: 1552068480000 ms
-------------------------------------------
Weather temperature at 08/03/2019 16:57:27 was 8.42
Weather temperature at 08/03/2019 16:57:27 was 8.42
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
Weather temperature at 08/03/2019 17:06:02 was 8.38
...

We are still limited to the max 10 records the DStream print() function gives us. Also, unless we are debugging, we are almost certainly going to go further than just printing the records on console. For that we use the DStream foreachRDD function, which works similar to the map function, but does not return any data. Instead, whatever we do with the Stream data - print it to console, save it into a CSV file or database - that needs to take place within the foreachRDD function.

The foreachRDD Function

The foreachRDD function accepts a function as a parameter, which receives as its input a current RDD value representing the current content of the DStream and deals with that content in the function's body.

Ok, at long last we are getting some results back from our Spark stream that we can use, that we can analyse, that we know how to deal with! Let us get carried away!

weatherTemps1Hour foreachRDD { currentRdd =>
  println(s"RDD content:\n\t${currentRdd.collect().map{case(dateTime,temperature) => s"Weather temperature at ${dateTime} was ${temperature}"}.mkString("\n\t")}")

  val tempRdd: RDD[Double] = currentRdd.map(_._2)

  val minTemp = if(tempRdd.isEmpty()) None else Some(tempRdd.min())
  val maxTemp = if(tempRdd.isEmpty()) None else Some(tempRdd.max())

  println(s"Min temperature: ${if(minTemp.isDefined) minTemp.get.toString else "not defined"}")

  println(s"Max temperature: ${if(maxTemp.isDefined) maxTemp.get.toString else "not defined"}")

  println(s"Temperature difference: ${if(minTemp.isDefined && maxTemp.isDefined) (maxTemp.get - minTemp.get).toString}\n")
}

Here we are formatting the output and getting min and max temperatures within the 60 minute window as well as their difference.  Let us look at the result:

RDD content:
	Weather temperature at 08/03/2019 16:57:27 was 8.42
	Weather temperature at 08/03/2019 16:57:27 was 8.42
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:06:02 was 8.38
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:17:32 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:30:52 was 8.28
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:37:51 was 8.19
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 17:51:10 was 8.1
	Weather temperature at 08/03/2019 18:01:10 was 7.99
	Weather temperature at 08/03/2019 18:01:10 was 7.99
	Weather temperature at 08/03/2019 18:01:10 was 7.99
Min temperature: 7.99
Max temperature: 8.42
Temperature difference: 0.4299999999999997

(Now there is no Time: 1552068480000 ms signature in our results printout because we are no longer using the DStream print() function).

However, I would like my analysis to be more detailed. It is time to involve Spark DataFrames.

Kafka Stream Data Analysis with Spark DataFrames

Just like in the previous statement, I need to extract Stream data with the currentRDD function. In fact, all the code that follows will be within the currentRDD function block:

weatherStream1Hour foreachRDD { currentRdd => {
... // the following code comes here
}

First, let us create a DataFrame from an RDD. Spark RDDs and DataFrames are two quite different representations of distributed data. And yet - look how simply the conversion works:

val spark =
   SparkSession.builder.config(currentRdd.sparkContext.getConf).getOrCreate()
import spark.implicits._

val simpleDF: DataFrame = currentRdd.toDF()
simpleDF.createOrReplaceTempView("simpleDF")

This trick works because our weatherStream1Hour DStream and consequently the currentRdd value that represents the Stream content, are based on the WeatherSchema case class. (data types - weatherStream1Hour: DStream[WeatherSchema] and currentRdd: RDD[WeatherSchema].) Therefore the currentRdd.toDF() implicit conversion works - Spark understands Scala case classes.

Once we have the DataFrame created, we can create a Temp view so we can query this DF with Spark SQL - that is what the createOrReplaceTempView function is for.

Let us start with the simplest queries - query the count(*) and the full content of the DataFrame:

val countDF: DataFrame = spark.sql("select count(*) as total from simpleDF")
countDF.show()

val selectAllDF = spark.sql("select * from simpleDF")
selectAllDF.show()

The result:

DataFrame's show() function by default only shows 20 rows, but we can always adjust that to show more or less. However, as we had established earlier in our analysis, the weather data actually does not get updated every minute - we are getting lots of duplicate records that we could get rid of. It is easy with SQL's distinct:

val distinctDF = spark.sql("select distinct * from simpleDF")
distinctDF.show()

The result - only 7 distinct weather measurements, confirming our suspicion that we are only getting a weather update approximately every 10 minutes:

Let us go back to our analysis - temperature change dynamics within the past 60 minutes. The temperature value in our DataFrame is named 'mainTemperature'. But where is the timestamp? We did have the dateTime value in our RDD. Why is it missing from the DataFrame? The answer is - because dateTime is actually a function. In RDD, when we referenced it, we did not care if it is a value or a function call. Now, when dealing with DataFrames, it becomes relevant.

As can be seen in the Gist, dateTime is a function, in fact it is a WeatherSchema case class method and is calculated from the dt value, which represents time in Unix format. The function that performs the actual conversion - dateTimeFromUnix - is defined in the WeatherParser object in the same Gist. If we want to get the save dateTime value in our DataFrame, we will have to register a Spark User Defined Function (UDF).

Creating a Spark User Defined Function (UDF)

Fortunately, creating UDFs is no rocket science - we do that with the Spark udf function. However, to use this function in a Spark SQL query, we need to register it first - associate a String function name with the function itself.

val dateTimeFromSeconds: Double => String = WeatherParser.dateTimeFromUnix(_)
val dateTimeFromSecondsUdf = udf(dateTimeFromSeconds)
spark.udf.register("dateTimeFromSecondsUdf", dateTimeFromSecondsUdf) // to register for SQL use

Now let us query the temperature and time:

val tempTimeDF = spark.sql(
   "select distinct 
      dt timeKey, 
      dateTimeFromSecondsUdf(dt) temperatureTakenTime, 
      mainTemperature temperatureCelsius 
    from simpleDF order by timeKey"
)

tempTimeDF.show()
tempTimeDF.createOrReplaceTempView("tempTime")

We show the results but also register the resulting DataFrame as a Temp View so we can from now on reference it in Spark SQL.

Note that we are converting the dt value to a String timestamp value but also keeping the original dt value - because dt is a number that can be sorted chronologically whereas the String timestamp cannot.

The result looks like this:

More Kafka Stream Data Analysis with Spark DataFrames

Now we have the times and the temperatures. But we want to see how temperatures changed between measurements. For example, between the two consecutive measurements at 17:06 and 17:17 the temperature (in London) dropped from 8.38 to 8.28 degrees Celsius. We want to have that value of minus 0.1 degrees in our result set.

If we were using Oracle database, the obvious choice would be the LAG analytic function. Do we have an analogue for that in Spark? Yes, we do! However, this time, instead of using Spark SQL, we will use the withColumn DataFrame function to define the LAG value:

val lagWindow = org.apache.spark.sql.expressions.Window.orderBy("timeKey")
val lagDF = tempTimeDF
  .withColumn("temperatureCelsiusPrev", lag("temperatureCelsius", 1, 0).over(lagWindow))
  .withColumn("rownum", monotonically_increasing_id())

lagDF.show()
lagDF.createOrReplaceTempView("tempTimeWithPrev")

The result:

Here we are actually adding two values - lag and rownum, the latter being an analogue to the Oracle SQL ROW_NUMBER analytic function.

Note the inputs for the Spark lag function: The first is the source column name, the second is the lag offset and the third is default value - 0. The default value in our case will mean zero degrees Celsius, which will mess up our temperature delta for the first temperature measurement. Fortunately, Spark SQL also supports the CASE function so we can deal with this challenge with ease. In addition, let us round the result to get rid of the floating point artefacts.

val tempDifferenceDF = spark.sql(
	"select 
		temperatureTakenTime, 
		temperatureCelsius, 
		temperatureCelsius - temperatureCelsiusPrev temperatureDiff, 
		ROUND(CASE 
			WHEN (rownum = 0) 
			THEN 0 
			ELSE temperatureCelsius - temperatureCelsiusPrev 
		END, 2) AS temperatureDiffAdjusted 
	from tempTimeWithPrev")

tempDifferenceDF.show()

And the result:

Conclusion

Kafka stream data analysis with Spark Streaming works and is easy to set up, easy to get it working. In this 3-part blog, by far the most challenging part was creating a custom Kafka connector. Once the Connector was created, setting it up and then getting the data source working in Spark was smooth sailing.

One thing to keep in mind when working with streams is - they are different from RDDs, which are static, immutable data sources. Not so with DStreams, which by their nature are changing, dynamic.

The challenging bit in the code is the

sparkStreamingContext.start() // start the computation
sparkStreamingContext.awaitTermination() // await termination

code block and its interaction with the foreachRDD function  - to somebody not familiar with how Spark Streaming works, the code can be hard to understand.

The ease of creating a DataFrame from the original RDD was a pleasant surprise.

So, is using Spark and Kafka with Scala a good idea? Definitely yes. It works like a charm. However, in real life, additional considerations like the availability and cost of Python vs Scala developers as well as your existing code base will come into play. I hate real life.

Categories: BI & Warehousing

Rittman Mead at Analytics and Data Summit 2019

Rittman Mead Consulting - Tue, 2019-03-05 09:49
Rittman Mead at Analytics and Data Summit 2019

The Analytics and Data Summit 2019 (formerly known as BIWA) is happening next week in Oracle HQ in Redwood Shores. I'm excited to participate since it's one of the best conferences for the Oracle Analytics crowd where you can get three days full of content from experts as well as hints on products future developments directly from the related Product Managers!

I'll be representing Rittman Mead in two sessions:

This two-hour workshop will cover all the details of OAC: Product Overview, Instance Creation and Management, Moving from on-prem OBIEE to OAC, Data Preparation and Data Visualization, Advanced Analytics.  With interactive labs where participants can experience Data Visualization and Data Flows.

Rittman Mead at Analytics and Data Summit 2019

Become a Data Scientist with OAC! This session will explain how Oracle Analytics Cloud acts as an  enabler for the transformation from a Data Analyst to a Data Scientist. Connection to the Data, Cleaning, Transformation, and Analysis will be the intermediate steps before training of several machine learning models which then will be evaluated and used to predict outcomes on unseen data. With a demo showing all the steps in a real example based on a wine dataset!

Rittman Mead at Analytics and Data Summit 2019

There is a full list of all sessions here. You can follow the conference on twitter with the hashtag #AnDSummit2019, and I'll be tweeting about it too as @ftisiot.

The presentations that I'm delivering will be available to download on speakerdeck.

Categories: BI & Warehousing

Spark Streaming and Kafka - Configuring a Kafka Connector

Rittman Mead Consulting - Wed, 2019-02-27 05:14

In my previous blog post, I covered the development of a custom Kafka Source Connector, written in Scala. This blog post is about deploying that Connector. We are getting closer to analysing the stream data in Spark - as promised in the title.

If you are installing a Connector from Confluent Hub, instead of installing a custom one, please follow this guide instead.

Setting up Kafka Server

This blog post does not cover setting up a Kafka instance. However, let me give a quick overview of the environment I am using:

  • OS: Ubuntu server 18.10.
  • Kafka version - at the time of writing, the latest version is 2.1.0. So that is the one I am using. Please note that the Kafka Connector API interface is slightly different for different Kafka versions. (In fact, early in my Connector development I saw Kafka throwing strange missing function errors. It turned out I was using a slightly older API for Kafka 2.0.0 instead of the one for 2.1.0. - the SourceConnector class's function `taskConfigs` had been renamed in the latest version.)
  • Java: OpenJDK 11.0.1. It seems the OpenJDK is preferred for Kafka over the Oracle Java JDK.

I got the Kafka install file from:
http://www-eu.apache.org/dist/kafka/2.1.0/kafka_2.12-2.1.0.tgz

Setting up Kafka Topic

Confluent have an excellent user guide on how to set up Connectors. There is also a Quick Start guide in the Apache Kafka website, though it is much less detailed than the guide from Confluent.

The first step for me, once my Kafka instance is up and running, is to create a new topic for my Weather records. Let me call it 'JanisTest'. From Kafka root folder I execute:

$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic JanisTest
$ bin/kafka-topics.sh --list --zookeeper localhost:2181
JanisTest
__consumer_offsets
$

After creating it, I run the --list command to see that it is indeed created and available.

Install Connector

Copy jars from your Scala development environment to your Kafka instance. I chose to have a folder of jars, not one uber-jar with all library dependencies packaged in. But either approach should work fine.

$ cd ~/kafka
$ mkdir connectors
$ cd connectors
$ cp /VMSharedFolder/connectorJars/* .

Do not add the above folder to CLASSPATH! Instead, add it to the
connect-*.properties config files. In the later versions of Kafka, the server itself manages paths to plugins.

$ cd ~/kafka/config
$ vi connect-standalone.properties

We will run our connector in standalone mode, therefore the important file for us to edit is 'connect-standalone.properties'. But we can also edit
the 'connect-distributed.properties' config file if we ever decide to run Kafka on multiple nodes.

Add the following line to the connect-standalone.properties and connect-distributed.properties configuration files:

plugin.path=/home/kafka/kafka/connectors

If you need to add several connector folders, the plugin.path value is comma-separated.

Configure Connector

Similar to the connect-standalone.properties file, our newly installed Source Connector also requires a configuration file. Let us create one:

$ cd ~/kafka/config
$ vi connect-http-source.properties

We specify the following configuration attributes in the config file - these attributes were defined in the `HttpSourceConnectorConstants` object we discussed in my previous blog post.

name=http-source

http.url=http://api.openweathermap.org/data/2.5/weather
http.api.key=<YOUR API KEY>
http.api.params=units=metric&q=London,uk

service.name=CurrentWeatherData
topic=JanisTest

poll.interval.ms=60000
tasks.max=1
connector.class=com.rittmanmead.kafka.connector.http.HttpSourceConnector
  • name - name of the Connector.
  • http.url - the base URL for the HTTP API.
  • http.api.key - replace <YOUR API KEY> with the key you got when registering with OpenWeatherMap.
  • http.api.params - any additional parameters you would like to specify for your API request - as described in the API documentation. Follow the format name=value. In case you need to specify several attributes, separate them with a &.
  • service.name - you can name your service.
  • topic - give the topic name that was created previously, as described in this guide.
  • poll.interval.ms - the frequency at which the API will be queried. 60000ms = 1 minute. Because the weather does not change that quickly, this polling interval is frequent enough for us. The free API tier limits us to no more than 1 query per second.
  • tasks.max - the maximum number of concurrent tasks allowed. In our case it is 1.
  • connector.class - the full class path of the SourceConnector class we have implemented, which is described in my previous blog post.
Run Connector

Because we are only running the Connector in standalone mode, there is not much config to be done. We are ready to run it. To start our Source Connector, we run the connect-standalone.sh command:

$ cd ~/kafka
$ ./bin/connect-standalone.sh ./config/connect-standalone.properties ./config/connect-http-source.properties

The first argument passed to the connect-standalone.sh script is the worker properties configuration - only one configuration. It is followed by one or many (on our case - one) connector configuration.

Take a deep breath and hit Enter.

A long log output will follow. If you get any errors, the command will return to shell prompt. Some possible errors are missing classes - if that is the case, please check that you have successfully deployed all the jars and they are registered in the connect-standalone.properties file. Also, if any of the required configuration parameters are missing in the connect-http-source.properties file, you will get an error message.

If successful, the log listing will end with something like this:

[2019-02-19 14:14:55,294] INFO Kafka version : 2.1.0 (org.apache.kafka.common.utils.AppInfoParser:109)
[2019-02-19 14:14:55,297] INFO Kafka commitId : 809be928f1ae004e (org.apache.kafka.common.utils.AppInfoParser:110)
[2019-02-19 14:14:55,334] INFO Created connector http-source (org.apache.kafka.connect.cli.ConnectStandalone:104)
[2019-02-19 14:14:55,406] INFO Setting up an HTTP service for http://api.openweathermap.org/data/2.5/weather... (com.rittmanmead.kafka.connector.http.HttpSourceTask:40)
[2019-02-19 14:14:55,466] INFO Starting to fetch from http://api.openweathermap.org/data/2.5/weather each 60000ms... (com.rittmanmead.kafka.connector.http.HttpSourceTask:47)
[2019-02-19 14:14:55,469] INFO WorkerSourceTask{id=http-source-0} Source task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSourceTask:199)
[2019-02-19 14:14:56,560] INFO Http return code: 200 (com.rittmanmead.kafka.connector.http.HttpSourceTask:31)
[2019-02-19 14:14:56,569] INFO Weather Schema parser: JSON text to be parsed: {"coord":{"lon":-0.13,"lat":51.51},"weather":[{"id":802,"main":"Clouds","description":"scattered clouds","icon":"03d"}],"base":"stations","main":{"temp":10.09,"pressure":1019,"humidity":61,"temp_min":9,"temp_max":11},"visibility":10000,"wind":{"speed":4.6,"deg":250},"clouds":{"all":44},"dt":1550582400,"sys":{"type":1,"id":1414,"message":0.0038,"country":"GB","sunrise":1550559964,"sunset":1550597002},"id":2643743,"name":"London","cod":200} (com.rittmanmead.kafka.connector.http.HttpSourceTask:273)
[2019-02-19 14:14:57,481] INFO JSON parsed class content: WeatherSchema(Coord(-0.13,51.51),List(WeatherAtom(802.0,Clouds,scattered clouds,03d)),stations,Main(10.09,1019.0,61.0,9.0,11.0),10000.0,Wind(4.6,250.0),Clouds(44.0),1.5505824E9,Sys(1.0,1414.0,0.0038,GB,1.550559964E9,1.550597002E9),2643743.0,London,200.0) (com.rittmanmead.kafka.connector.http.HttpSourceTask:283)
[2019-02-19 14:14:57,499] INFO Got 1 results for CurrentWeatherData (com.rittmanmead.kafka.connector.http.HttpSourceTask:75)

Some of the above messages were generated by our Connector directly, like the last line above - 'Got 1 results for CurrentWeatherData'. Others are generated by Kafka itself.

Check Connector Output

Our ultimate goal is to analyse a Kafka Stream with Spark in Scala. However, that will have to wait till my next blog post. For now we can check if the topic is being populated. Please note that our polling interval is 1 minute - the topic will not be flooded with records right away.

$ cd ~/kafka
$ ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic JanisTest --from-beginning

If all is going well, you should see Weather records, generated according to our custom-defined schema:

Quite easy, was it not? Certainly setting up a Kafka connector is easier than developing one.

The next step will be analysing this data from Spark. But that will be my next blog post.

Categories: BI & Warehousing

Spark Streaming and Kafka - Creating a New Kafka Connector

Rittman Mead Consulting - Thu, 2019-02-21 08:39
More Kafka and Spark, please!

Hello, world!

Having joined Rittman Mead more than 6 years ago, the time has come for my first blog post. Let me start by standing on the shoulders of blogging giants, revisiting Robin's old blog post Getting Started with Spark Streaming, Python, and Kafka.

The blog post was very popular, touching on the subjects of Big Data and Data Streaming. To put my own twist on it, I decided to:

  • not use Twitter as my data source, because there surely must be other interesting data sources out there,
  • use Scala, my favourite programming language, to see how different the experience is from using Python.
Why Scala?

Scala is admittedly more challenging to master than Python. However, because Scala compiles into Java bytecode, it can be used pretty much anywhere where Java is being used. And Java is being used everywhere. Python is arguably even more widely used than Java, however it remains a dynamically typed scripting language that is easy to write in but can be hard to debug.

Is there a case for using Scala instead of Python for the job? Both Spark and Kafka were written in Scala (and Java), hence they should get on like a house on fire, I thought. Well, we are about to find out.

My data source: OpenWeatherMap

When it comes to finding sample data sources for data analysis, the selection out there is amazing. At the time of this writing, Kaggle offers freely available 14,470 datasets, many of them in easy-to-digest formats like CSV and JSON. However, when it comes to real-time sample data streams, the selection is quite limited. Twitter is usually the go-to choice - easily accessible and well documented. Too bad I decided not to use Twitter as my source.

Another alternative is the Wikipedia Recent changes stream. Although in the stream schema there are a few values that would be interesting to analyse, overall this stream is more boring than it sounds - the text changes themselves are not included.

Fortunately, I came across the OpenWeatherMap real-time weather data website. They have a free API tier, which is limited to 1 request per second, which is quite enough for tracking changes in weather. Their different API schemas return plenty of numeric and textual data, all interesting for analysis. The APIs work in a very standard way - first you apply for an API key. With the key you can query the API with a simple HTTP GET request (Apply for your own API key instead of using the sample one - it is easy.):

This request

https://samples.openweathermap.org/data/2.5/weather?q=London,uk&appid=b6907d289e10d714a6e88b30761fae22

gives the following result:

{
  "coord": {"lon":-0.13,"lat":51.51},
  "weather":[
    {"id":300,"main":"Drizzle","description":"light intensity drizzle","icon":"09d"}
  ],
  "base":"stations",
  "main": {"temp":280.32,"pressure":1012,"humidity":81,"temp_min":279.15,"temp_max":281.15},
  "visibility":10000,
  "wind": {"speed":4.1,"deg":80},
  "clouds": {"all":90},
  "dt":1485789600,
  "sys": {"type":1,"id":5091,"message":0.0103,"country":"GB","sunrise":1485762037,"sunset":1485794875},
  "id":2643743,
  "name":"London",
  "cod":200
}
Getting data into Kafka - considering the options

There are several options for getting your data into a Kafka topic. If the data will be produced by your application, you should use the Kafka Producer Java API. You can also develop Kafka Producers in .Net (usually C#), C, C++, Python, Go. The Java API can be used by any programming language that compiles to Java bytecode, including Scala. Moreover, there are Scala wrappers for the Java API: skafka by Evolution Gaming and Scala Kafka Client by cakesolutions.

OpenWeatherMap is not my application and what I need is integration between its API and Kafka. I could cheat and implement a program that would consume OpenWeatherMap's records and produce records for Kafka. The right way of doing that however is by using Kafka Source connectors, for which there is an API: the Connect API. Unlike the Producers, which can be written in many programming languages, for the Connectors I could only find a Java API. I could not find any nice Scala wrappers for it. On the upside, the Confluent's Connector Developer Guide is excellent, rich in detail though not quite a step-by-step cookbook.

However, before we decide to develop our own Kafka connector, we must check for existing connectors. The first place to go is Confluent Hub. There are quite a few connectors there, complete with installation instructions, ranging from connectors for particular environments like Salesforce, SAP, IRC, Twitter to ones integrating with databases like MS SQL, Cassandra. There is also a connector for HDFS and a generic JDBC connector. Is there one for HTTP integration? Looks like we are in luck: there is one! However, this connector turns out to be a Sink connector.

Ah, yes, I should have mentioned - there are two flavours of Kafka Connectors: the Kafka-inbound are called Source Connectors and the Kafka-outbound are Sink Connectors. And the HTTP connector in Confluent Hub is Sink only.

Googling for Kafka HTTP Source Connectors gives few interesting results. The best I could find was Pegerto's Kafka Connect HTTP Source Connector. Contrary to what the repository name suggests, the implementation is quite domain-specific, for extracting Stock prices from particular web sites and has very little error handling. Searching Scaladex for 'Kafka connector' does yield quite a few results but nothing for http. However, there I found Agoda's nice and simple Source JDBC connector (though for a very old version of Kafka), written in Scala. (Do not use this connector for JDBC sources, instead use the one by Confluent.) I can use this as an example to implement my own.

Creating a custom Kafka Source Connector

The best place to start when implementing your own Source Connector is the Confluent Connector Development Guide. The guide uses JDBC as an example. Our source is a HTTP API so early on we must establish if our data source is partitioned, do we need to manage offsets for it and what is the schema going to look like.

Partitions

Is our data source partitioned? A partition is a division of source records that usually depends on the source medium. For example, if we are reading our data from CSV files, we can consider the different CSV files to be a natural partition of our source data. Another example of partitioning could be database tables. But in both cases the best partitioning approach depends on the data being gathered and its usage. In our case, there is only one API URL and we are only ever requesting current data. If we were to query weather data for different cities, that would be a very good partitioning - by city. Partitioning would allow us to parallelise the Connector data gathering - each partition would be processed by a separate task. To make my life easier, I am going to have only one partition.

Offsets

Offsets are for keeping track of the records already read and the records yet to be read. An example of that is reading the data from a file that is continuously being appended - there can be rows already inserted into a Kafka topic and we do not
want to process them again to avoid duplication. Why would that be a problem? Surely, when going through a source file row by row, we know which row we are looking at. Anything above the current row is processed, anything below - new records. Unfortunately, most of the time it is not as simple as that: first of all Kafka supports concurrency, meaning there can be more than one Task busy processing Source records. Another consideration is resilience - if a Kafka Task process fails,
another process will be started up to continue the job. This can be an important consideration when developing a Kafka Source Connector.

Is it relevant for our HTTP API connector? We are only ever requesting current weather data. If our process fails, we may miss some time periods but we cannot recover then later on. Offset management is not required for our simple connector.

So that is Partitions and Offsets dealt with. Can we make our lives just a bit more difficult? Fortunately, we can. We can create a custom Schema and then parse the source data to populate a Schema-based Structure. But we will come to that later.
First let us establish the Framework for our Source Connector.

Source Connector - the Framework

The starting point for our Source Connector are two Java API classes: SourceConnector and SourceTask. We will put them into separate .scala source files but they are shown here together:

import org.apache.kafka.connect.source.{SourceConnector, SourceTask}

class HttpSourceConnector extends SourceConnector {...}
class HttpSourceTask extends SourceTask {...}

These two classes will be the basis for our Source Connector implementation:

  • HttpSourceConnector represents the Connector process management. Each Connector process will have only one SourceConnector instance.
  • HttpSourceTask represents the Kafka task doing the actual data integration work. There can be one or many Tasks active for an active SourceConnector instance.

We will have some additional classes for config and for HTTP access.
But first let us look at each of the two classes in more detail.

SourceConnector class

SourceConnector is an abstract class that defines an interface that our HttpSourceConnector needs to adhere to. The first function we need to override is config:

  private val configDef: ConfigDef =
      new ConfigDef()
          .define(HttpSourceConnectorConstants.HTTP_URL_CONFIG, Type.STRING, Importance.HIGH, "Web API Access URL")
          .define(HttpSourceConnectorConstants.API_KEY_CONFIG, Type.STRING, Importance.HIGH, "Web API Access Key")
          .define(HttpSourceConnectorConstants.API_PARAMS_CONFIG, Type.STRING, Importance.HIGH, "Web API additional config parameters")
          .define(HttpSourceConnectorConstants.SERVICE_CONFIG, Type.STRING, Importance.HIGH, "Kafka Service name")
          .define(HttpSourceConnectorConstants.TOPIC_CONFIG, Type.STRING, Importance.HIGH, "Kafka Topic name")
          .define(HttpSourceConnectorConstants.POLL_INTERVAL_MS_CONFIG, Type.STRING, Importance.HIGH, "Polling interval in milliseconds")
          .define(HttpSourceConnectorConstants.TASKS_MAX_CONFIG, Type.INT, Importance.HIGH, "Kafka Connector Max Tasks")
          .define(HttpSourceConnectorConstants.CONNECTOR_CLASS, Type.STRING, Importance.HIGH, "Kafka Connector Class Name (full class path)")

  override def config: ConfigDef = configDef

This is validation for all the required configuration parameters. We also provide a description for each configuration parameter, that will be shown in the missing configuration error message.

HttpSourceConnectorConstants is an object where config parameter names are defined - these configuration parameters must be provided in the connector configuration file:

object HttpSourceConnectorConstants {
  val HTTP_URL_CONFIG               = "http.url"
  val API_KEY_CONFIG                = "http.api.key"
  val API_PARAMS_CONFIG             = "http.api.params"
  val SERVICE_CONFIG                = "service.name"
  val TOPIC_CONFIG                  = "topic"
  val TASKS_MAX_CONFIG              = "tasks.max"
  val CONNECTOR_CLASS               = "connector.class"

  val POLL_INTERVAL_MS_CONFIG       = "poll.interval.ms"
  val POLL_INTERVAL_MS_DEFAULT      = "5000"
}

Another simple function to be overridden is taskClass - for the SourceConnector class to know its corresponding SourceTask class.

  override def taskClass(): Class[_ <: SourceTask] = classOf[HttpSourceTask]

The last two functions to be overridden here are start and stop. These are called upon the creation and termination of a SourceConnector instance (not Task instance). JavaMap here is an alias for java.util.Map - a Java Map, which is not to be confused with the native Scala Map - that cannot be used here. (If you are a Python developer, a Map in Java/Scala is similar to the Python dictionary, but strongly typed.) The interface requires Java data structures, but that is fine - we can convert them from one to another. By far the biggest problem here is the assignment of the connectorConfig variable - we cannot have a functional programming friendly immutable value here. The variable is defined at the class level

  private var connectorConfig: HttpSourceConnectorConfig = _

and is set in the start function and then referred to in the taskConfigs function further down. This does not look pretty in Scala. Hopefully somebody will write a Scala wrapper for this interface.

Because there is no logout/shutdown/sign-out required for the HTTP API, the stop function just writes a log message.

  override def start(connectorProperties: JavaMap[String, String]): Unit = {
    Try (new HttpSourceConnectorConfig(connectorProperties.asScala.toMap)) match {
      case Success(cfg) => connectorConfig = cfg
      case Failure(err) => connectorLogger.error(s"Could not start Kafka Source Connector ${this.getClass.getName} due to error in configuration.", new ConnectException(err))
    }
  }

  override def stop(): Unit = {
    connectorLogger.info(s"Stopping Kafka Source Connector ${this.getClass.getName}.")
  }

HttpSourceConnectorConfig is a thin wrapper class for the configuration.

We are almost done here. The last function to be overridden is taskConfigs.
This function is in charge of producing (potentially different) configurations for different Source Tasks. In our case, there is no reason for the Source Task configurations to differ. In fact, our HTTP API will benefit little from parallelism, so, to keep things simple, we can assume the number of tasks always to be 1.

  override def taskConfigs(maxTasks: Int): JavaList[JavaMap[String, String]] = List(connectorConfig.connectorProperties.asJava).asJava

The name of the taskConfigs function was changed in the Kafka version 2.1.0 - please consider that when using this code for older Kafka versions.

Source Task class

In a similar manner to the Source Connector class, we implement the Source Task abstract class. It is only slightly more complex than the Connector class.

Just like for the Connector, there are start and stop functions to be overridden for the Task.

Remember the taskConfigs function from above? This is where task configuration ends up - it is passed to the Task's start function. Also, similarly to the Connector's start function, we parse the connection properties with HttpSourceTaskConfig, which is the same as HttpSourceConnectorConfig - configuration for Connector and Task in our case is the same.

We also set up the Http service that we are going to use in the poll function - we create an instance of the WeatherHttpService class. (Please note that start is executed only once, upon the creation of the task and not every time a record is polled from the data source.)

  override def start(connectorProperties: JavaMap[String, String]): Unit = {
    Try(new HttpSourceTaskConfig(connectorProperties.asScala.toMap)) match {
      case Success(cfg) => taskConfig = cfg
      case Failure(err) => taskLogger.error(s"Could not start Task ${this.getClass.getName} due to error in configuration.", new ConnectException(err))
    }

    val apiHttpUrl: String = taskConfig.getApiHttpUrl
    val apiKey: String = taskConfig.getApiKey
    val apiParams: Map[String, String] = taskConfig.getApiParams

    val pollInterval: Long = taskConfig.getPollInterval

    taskLogger.info(s"Setting up an HTTP service for ${apiHttpUrl}...")
    Try( new WeatherHttpService(taskConfig.getTopic, taskConfig.getService, apiHttpUrl, apiKey, apiParams) ) match {
      case Success(service) =>  sourceService = service
      case Failure(error) =>    taskLogger.error(s"Could not establish an HTTP service to ${apiHttpUrl}")
                                throw error
    }

    taskLogger.info(s"Starting to fetch from ${apiHttpUrl} each ${pollInterval}ms...")
    running = new JavaBoolean(true)
  }

The Task also has the stop function. But, just like for the Connector, it does not do much, because there is no need to sign out from an HTTP API session.

Now let us see how we get the data from our HTTP API - by overriding the poll function.

The fetchRecords function uses the sourceService HTTP service initialised in the start function. sourceService's sourceRecords function requests data from the HTTP API.

  override def poll(): JavaList[SourceRecord] = this.synchronized { if(running.get) fetchRecords else null }

  private def fetchRecords: JavaList[SourceRecord] = {
    taskLogger.debug("Polling new data...")

    val pollInterval = taskConfig.getPollInterval
    val startTime    = System.currentTimeMillis

    val fetchedRecords: Seq[SourceRecord] = Try(sourceService.sourceRecords) match {
      case Success(records)                    => if(records.isEmpty) taskLogger.info(s"No data from ${taskConfig.getService}")
                                                  else taskLogger.info(s"Got ${records.size} results for ${taskConfig.getService}")
                                                  records

      case Failure(error: Throwable)           => taskLogger.error(s"Failed to fetch data for ${taskConfig.getService}: ", error)
                                                  Seq.empty[SourceRecord]
    }

    val endTime     = System.currentTimeMillis
    val elapsedTime = endTime - startTime

    if(elapsedTime < pollInterval) Thread.sleep(pollInterval - elapsedTime)

    fetchedRecords.asJava
  }

Phew - that is the interface implementation done. Now for the fun part...

Requesting data from OpenWeatherMap's API

The fun part is rather straightforward. We use the scalaj.http library to issue a very simple HTTP request and get a response.

Our WeatherHttpService implementation will have two functions:

  • httpServiceResponse that will format the request and get data from the API
  • sourceRecords that will parse the Schema and wrap the result within the Kafka SourceRecord class.

Please note that error handling takes place in the fetchRecords function above.

    override def sourceRecords: Seq[SourceRecord] = {
        val weatherResult: HttpResponse[String] = httpServiceResponse
        logger.info(s"Http return code: ${weatherResult.code}")
        val record: Struct = schemaParser.output(weatherResult.body)

        List(
            new SourceRecord(
                Map(HttpSourceConnectorConstants.SERVICE_CONFIG -> serviceName).asJava, // partition
                Map("offset" -> "n/a").asJava, // offset
                topic,
                schemaParser.schema,
                record
            )
        )
    }

    private def httpServiceResponse: HttpResponse[String] = {

        @tailrec
        def addRequestParam(accu: HttpRequest, paramsToAdd: List[(String, String)]): HttpRequest = paramsToAdd match {
            case (paramKey,paramVal) :: rest => addRequestParam(accu.param(paramKey, paramVal), rest)
            case Nil => accu
        }

        val baseRequest = Http(apiBaseUrl).param("APPID",apiKey)
        val request = addRequestParam(baseRequest, apiParams.toList)

        request.asString
    }
Parsing the Schema

Now the last piece of the puzzle - our Schema parsing class.

The short version of it, which would do just fine, is just 2 lines of class (actually - object) body:

object StringSchemaParser extends KafkaSchemaParser[String, String] {
    override val schema: Schema = Schema.STRING_SCHEMA
    override def output(inputString: String) = inputString
}

Here we say we just want to use the pre-defined STRING_SCHEMA value as our schema definition. And pass inputString straight to the output, without any alteration.

Looks too easy, does it not? Schema parsing could be a big part of Source Connector implementation. Let us implement a proper schema parser. Make sure you read the Confluent Developer Guide first.

Our schema parser will be encapsulated into the WeatherSchemaParser object. KafkaSchemaParser is a trait with two type parameters - inbound and outbound data type. This indicates that the Parser receives data in String format and the result is a Kafka's Struct value.

object WeatherSchemaParser extends KafkaSchemaParser[String, Struct]

The first step is to create a schema value with the SchemaBuilder. Our schema is rather large, therefore I will skip most fields. The field names given are a reflection of the hierarchy structure in the source JSON. What we are aiming for is a flat, table-like structure - a likely Schema creation scenario.

For JSON parsing we will be using the Scala Circle library, which in turn is based on the Scala Cats library. (If you are a Python developer, you will see that Scala JSON parsing is a bit more involved (this might be an understatement), but, on the flipside, you can be sure about the result you are getting out of it.)

    override val schema: Schema = SchemaBuilder.struct().name("weatherSchema")
        .field("coord-lon", Schema.FLOAT64_SCHEMA)
        .field("coord-lat", Schema.FLOAT64_SCHEMA)

        .field("weather-id", Schema.FLOAT64_SCHEMA)
        .field("weather-main", Schema.STRING_SCHEMA)
        .field("weather-description", Schema.STRING_SCHEMA)
        .field("weather-icon", Schema.STRING_SCHEMA)
        
        // ...
        
        .field("rain", Schema.FLOAT64_SCHEMA)
        
        // ...

Next we define case classes, into which we will be parsing the JSON content.

   case class Coord(lon: Double, lat: Double)
   case class WeatherAtom(id: Double, main: String, description: String, icon: String)

That is easy enough. Please note that the case class attribute names match one-to-one with the attribute names in JSON. However, our Weather JSON schema is rather relaxed when it comes to attribute naming. You can have names like type and 3h, both of which are invalid value names in Scala. What do we do? We give the attributes valid Scala names and then implement a decoder:

    case class Rain(threeHours: Double)
    object Rain {
        implicit val decoder: Decoder[Rain] = Decoder.instance { h =>
            for {
                threeHours <- h.get[Double]("3h")
            } yield Rain(
                threeHours
            )
        }
    }

The rain case class is rather short, with only one attribute. The corresponding JSON name was 3h. We map '3h' to the Scala attribute threeHours.

Not quite as simple as JSON parsing in Python, is it?

In the end, we assemble all sub-case classes into the WeatherSchema case class, representing the whole result JSON.

    case class WeatherSchema(
                                coord: Coord,
                                weather: List[WeatherAtom],
                                base: String,
                                mainVal: Main,
                                visibility: Double,
                                wind: Wind,
                                clouds: Clouds,
                                dt: Double,
                                sys: Sys,
                                id: Double,
                                name: String,
                                cod: Double
                            )

Now, the parsing itself. (Drums, please!)

structInput here is the input JSON in String format. WeatherSchema is the case class we created above. The Circle decode function returns a Scala Either monad, error on the Left(), successful parsing result on the Right() - nice and tidy. And safe.

        val weatherParsed: WeatherSchema = decode[WeatherSchema](structInput) match {
            case Left(error) => {
                logger.error(s"JSON parser error: ${error}")
                emptyWeatherSchema
            }
            case Right(weather) => weather
        }

Now that we have the WeatherSchema object, we can construct our Struct object that will become part of the SourceRecord returned by the sourceRecords function in the WeatherHttpService class. That in turn is called from the HttpSourceTask's poll function that is used to populate the Kafka topic.

        val weatherStruct: Struct = new Struct(schema)
            .put("coord-lon", weatherParsed.coord.lon)
            .put("coord-lat", weatherParsed.coord.lat)

            .put("weather-id", weatherParsed.weather.headOption.getOrElse(emptyWeatherAtom).id)
            .put("weather-main", weatherParsed.weather.headOption.getOrElse(emptyWeatherAtom).main)
            .put("weather-description", weatherParsed.weather.headOption.getOrElse(emptyWeatherAtom).description)
            .put("weather-icon", weatherParsed.weather.headOption.getOrElse(emptyWeatherAtom).icon)

            // ...

Done!

Considering that Schema parsing in our simple example was optional, creating a Kafka Source Connector for us meant creating a Source Connector class, a Source Task class and a Source Service class.

Creating JAR(s)

JAR creation is described in the Confluent's Connector Development Guide. The guide mentions two options - either all the library dependencies can be added to the target JAR file, a.k.a an 'uber-Jar'. Alternatively, the dependencies can be copied to the target folder. In that case they must all reside in the same folder, with no subfolder structure. For no particular reason, I went with the latter option.

The Developer Guide says it is important not to include the Kafka Connect API libraries there. (Instead they should be added to CLASSPATH.) Please note that for the latest Kafka versions it is advised not to add these custom JARs to CLASSPATH. Instead, we will add them to connectors' plugin.path. But that we will leave for another blog post.

Scala - was it worth using it?

Only if you are a big fan. The code I wrote is very Java-like and it might have been better to write it in Java. However, if somebody writes a Scala wrapper for the Connector interfaces, or, even better, if a Kafka Scala API is released, writing Connectors in Scala would be a very good choice.connector

Categories: BI & Warehousing

The Importance of Feature Engineering and Selection

Rittman Mead Consulting - Tue, 2019-02-19 10:27

In machine learning your model is only ever as good as the data you train it on. As such a significant proportion of your effort should be focused on creating a dataset that is optimised to maximise the information density of your data. Feature engineering and selection are the methods used for achieving this goal.

In this context, the definition of a feature will be a column or attribute of the data.

Feature engineering is a broad term that covers a number of manipulations that may be carried out on your dataset. There are therefore many processes that could be considered part of feature engineering. In this post I introduce some of the high-level activities carried out as a part of feature engineering, as well as, some of the most common methods of feature selection, but this is by no means an exhaustive list.

Engineering Features

Feature engineering is the process by which knowledge of data is used to construct explanatory variables, features, that can be used to train a predictive model. Engineering and selecting the correct features for a model will not only significantly improve its predictive power, but will also offer the flexibility to use less complex models that are faster to run and more easily understood.

At the start of every machine learning project the raw data will be inevitably messy and unsuitable for training a model. The first step is always data exploration and cleaning, which involves changing data types and removing or imputing missing values. With an understanding of the data gained through exploration, it can be prepared in such a way that it is useful for the model. This may include removing outliers or specific features you don’t want the model to learn; as well as creating features from the data that better represent the underlying problem, facilitating the machine learning process and resulting in improved model accuracy.

Unprocessed data will likely contain features with the following problems:

Issue Solution Missing values Imputed in data cleaning Does not belong to the same dimension Normalisation/standardisation Information redundancy Filtered out in feature selection Decomposing or Splitting Features

One form of feature engineering is to decompose raw attributes into features that will be easier to interpret patterns from. For example, decomposing dates or timestamp variables into a variety of constituent parts may allow models to discover and exploit relationships. Common time frames for which trends occur include: absolute time, day of the year, day of the week, month, hour of the day, minute of the hour, year, etc. Breaking dates up into new features such as this will help a model better represent structures or seasonality in the data. For example, if you were investigating ice cream sales, and created a “Season of Sale” feature, the model would recognise a peak in the summer season. However, an “Hour of Sale” feature would reveal an entirely different trend, possibly peaking in the middle of each day.

Your data can also be binned into buckets and converted into factors (numerical categories) or flattened into a column per category with flags. Which of these will work best for your data depends on a number of factors including how many categorical values you have, and their frequency. (A similar process can be utilised for natural language processing or textual prediction see bag of words.)

Data Enrichment

Data enrichment is the process of creating new features by introducing data from external sources. Externally collated data is invaluable in prediction success, there is a plethora of publicly accessible datasets that will in most situations create impactful features.

Third party datasets could include attributes that are challenging or costly to collect directly; or are possibly more accurately available online.

It is important when enriching a dataset to consider the relevance of sources, as irrelevant features will unnecessarily complicate the model adding to the noise and increasing the chance of overfitting. For example, when working with dates it is generally insightful to introduce data on national holidays. In the case of our ice cream sales example, you may want to include national holidays, temperature and weather features, as these would be expected to influence sales. However, adding temperature or weather data from another country or other areas will definitely not be relevant and will in the best case have no relation to the data, but in the worst case have a spurious correlation and mislead the model when training.

Feature Transformations

Feature transformations can include aggregating or combining attributes to create new features. Useful and relevant features will depend on the problem at hand but averages, sums and ratios over different groupings can better expose trends to a model.

Multiplying or aggregating features to create new combined features can help with this. Categorical features can be combined into a single feature containing all combination of the two categories. This can easily be overdone and it is necessary to be careful as to not overfit due to misleading combined features.

It is possible to identify higher order interactions via a simple decision tree, the initial branches can be used to identify which features to combine.

A general requirement for some machine learning algorithms is standardisation/normalisation. This rescales the features so they represent a standard normal distribution (centred around 0 with a standard deviation of 1). The benefits of standardisation are that you do not emphasise variables with larger magnitudes and when comparing measurements with different units.

Automated Feature Engineering

Engineering features manually as described above can be very time consuming and requires a good understanding of the underlying data, structures in the data, the problem you are trying to solve and how best to represent the data to have the desired effect. Manual feature engineering is problem specific and cannot be applied to another dataset or problem.

There has been some progress made in the automation of feature engineering. FeatureTools for example is a python framework for transforming datasets into feature matrices. In my opinion there are positives and negatives to such an approach Feature engineering is time-consuming and any automation of this process would be beneficial. However, creating many useless features will lead to overfitting and automatically created features can result in loss of interpretability and understanding.

Feature Selection

Of the features now available in your data set, some will be more influential than others on the model accuracy. Feature selection aims to reduce the dimensionality of the problem by removing redundant or irrelevant features. A feature may be redundant if it is highly correlated with another feature, but does so because it is based on the same underlying information. These types of features can be removed from the data set without any loss of information. In our ice cream example, sales may be correlated with temperature and suncream usage, but the relationship with suncream is a result of this also being correlated with the confounding variable temperature.

Reducing the number of features through feature selection ensures training the model will require less memory and computational power, leading to shorter training times and will also help to reduce the chance of overfitting. Simplification of the training data will also make the model easier to interpret, which can be important when justifying real-world decision making as a result of model outputs.

Feature Selection Methods

Feature selection algorithms rank or score features based on a number of methods so that the least significant features can be removed. In general, the features are chosen from two perspectives; feature divergence and correlations between features and the dependent variable (the value being predicted). Some models have built-in feature selections, that aim to reduce or discount features as part of the model building process, for example LASSO Regression.

Methods that can be used to reduce features include: Correlation

A feature that is strongly correlated with the dependent variable may be important to the model. The correlation coefficients produced are univariate and therefore only correspond to each individual feature’s relationship to the dependent variable, as opposed to combinations of features.

Near Zero Variance

Depending on the problem you are dealing with you may want to remove constant and almost constant features across samples. There are functions that will remove these automatically such as nzv() in R. They can be tuned from removing only features which have a single unique value across all samples or those that have a few unique values across the set, to those with a large ratio of the most common value to the second most common.

Principal component analysis (PCA)

PCA is an unsupervised dimensionality reduction method, its purpose is to find the directions (the so-called principal components) in feature space that maximise the variance in the dataset. You are essentially finding the axes of feature space that are intuitive to the shape of the data, where there is the greatest variation, and therefore the most information. A very simple example would be a 3D feature space of x, y, z. If you look at the data through the x,y axis and all of your points were tightly clustered together this would not be a very good axis to view your data structure though. However, if you viewed it in the x, z plane and your data was spread out, this would be much more useful as you are able to observe a trend in the data. Principal components are dimensions along which your data points are most spread out, but as opposed to the example above, feature space will have n-dimensions not 3, and a principal component can be expressed a single feature or as a combination of many existing features.

Linear discriminant analysis (LDA)

LDA is a supervised dimensionality reduction method, using known class groupings. It achieves a similar goal to PCA, but instead of finding the axes that maximise the variance, it will represent the axes that maximise the separation between multiple classes. These are called linear discriminants.

For multi-class classification, it would be assumed that LDA would achieve better results than PCA, but this is not always the case.

Summary

The features in your data will influence the results that your predictive model can achieve.

Having and engineering good features will allow you to most accurately represent the underlying structure of the data and therefore create the best model.

Features can be engineered by decomposing or splitting features, from external data sources, or aggregating or combining features to create new features.

Feature selection reduces the computation time and resources needed to create models as well as preventing overfitting which would degrade the performance of the model. The flexibility of good features allows less complex models, which would be faster to run and easier to understand, to produce comparable results to the complex ones.

Complex predictive modelling algorithms perform feature importance and selection internally while constructing models. These models can also report on the variable importance determined during the model preparation process. However, this is computationally intensive and by first removing the most obviously unwanted features, a great deal of unnecessary processing can be avoided.

Categories: BI & Warehousing

Understanding grid disks in Exadata

Amardeep Sidhu - Mon, 2019-02-18 07:07

Use of Exadata storage cells seems to be a very poorly understood concept. A lot of people have confusions about how exactly ASM makes uses of disks from storage cells. Many folks assume there is some sort of RAID configured in the storage layer whereas there is nothing like that. I will try to explain some of the concepts in this post.

Let’s take an example of an Exadata quarter rack that has 2 db and 3 storage nodes (node means a server here). Few things to note:

  • The space for binaries installation on db nodes comes from the local disks installed in db nodes (600GB * 4 (expandable to 8) configured in RAID5). In case you are using OVM, same disks are used for keeping configuration files, Virtual disks for VMs etc.
  • All of the ASM space comes from storage cells. The minimum configuration is 3 storage cells.

So let’s try to understand what makes a storage cell. There are 12 disks in each storage cell (latest X7 cells are coming with 10 TB disks). As I mentioned above that there are 3 storage cells in a minimum configuraiton. So we have a total of 36 disks. There is no RAID configured in the storage layer. All the redundancy is handled at ASM level. So to create a disk group:

  • First of all cell disks are created on each storage cell. 1 physical disk makes 1 cell disk. So a quarter rack has 36 cell disks.
  • To divide the space in various disk groups (by default only two disk groups are created : DATA & RECO; you can choose how much space to give to each of them) grid disks are created. grid disk is a partition on the cell disk. slice of a disk in other words. Slice from each cell disk must be part of both the disk groups. We can’t have something like say DATA has 18 disks out of 36 and the RECO has another 18. That is not supported. Let’s say you decide to allocate 5 TB to DATA grid disks and 4 TB to RECO grid disks (out of 10 TB on each disk, approx 9 TB is what you get as usable). So you will divide each cell disk into 2 parts – 5 TB and 4 TB and you would have 36 slices of 5 TB each and 36 slices of 4 TB each.
  • DATA disk group will be created using the 36 5 TB slices where grid disks from each storage cell constitute one failgroup.
  • Similarly RECO disk group will be created using the 36 4 TB slices.

What we have discussed above is a quarter rack scenario with High Capacity (HC) disks. There can be somewhat different configurations too:

  • Instead of HC disks, you can have the Extreme Flash (EF) configuration which uses flash cards in place of disks. Everything remains the same except the number. Instead of 12 HC disks there will be 8 flash cards.
  • With X3 I think, Oracle introduced an eighth rack configuration. In an eighth rack configuration db nodes come with half the cores (of quarter rack db nodes) and storage cells come with 6 disks in each of the cell. So here you would have only 18 disks in total. Everything else works in the same way.

Hope it clarified some of the doubts about grid disks.


Categories: BI & Warehousing

Is ETL still necessary?

Dylan's BI Notes - Tue, 2019-01-22 23:10
ETL stands for Extract, Transform, and Load. Extract and Load, their existence itself implies that the source data and target data are stored separately, so you need to extract from source and load the data into the target data store. Extract and Load won’t go away if the data used for reporting is not stored […]
Categories: BI & Warehousing

ORA-04080: trigger ‘PRICE_HISTORY_TRIGGERV1’ does not exist

Amardeep Sidhu - Tue, 2019-01-22 07:45

It is actually a dumb one. I was disabling triggers in a schema and ran this SQL to generate the disable statements. (Example from here)

HR@test> select 'alter trigger '||trigger_name|| ' disable;' from user_triggers where table_name='PRODUCT';

'ALTERTRIGGER'||TRIGGER_NAME||'DISABLE;'
--------------------------------------------------------------------------------
alter trigger PRICE_HISTORY_TRIGGERv1 disable;

HR@test> alter trigger PRICE_HISTORY_TRIGGERv1 disable;
alter trigger PRICE_HISTORY_TRIGGERv1 disable
*
ERROR at line 1:
ORA-04080: trigger 'PRICE_HISTORY_TRIGGERV1' does not exist


HR@test>

WTF ? It is there but the disable didn’t work. I was in hurry, tried to connect through SQL developer and disable and it worked ! Double WTF ! Then i spotted the problem. Someone created it with one letter in the name in small. So to make it work, we need to use double quotes.

HR@test> alter trigger "PRICE_HISTORY_TRIGGERv1" disable;

Trigger altered.

HR@test>

One of the reasons why you shouldn’t use case sensitive names in Oracle. That is stupid.

Categories: BI & Warehousing

Oracle OpenWorld Europe : London 2019

Rittman Mead Consulting - Tue, 2019-01-22 07:39

Some eleven thousand people descended on Oracle OpenWorld Europe in London last week for two days of business and technical sessions delivered by a mixture of members of Oracle’s product team and end users giving real-world case studies of adoption of Oracle’s Cloud offerings and product roadmaps.

Screen-Shot-2019-01-22-at-13.16.24

Something that may not surprise anyone is that at OpenWorld, to speak of anything other than Cloud or Autonomous would be somewhat blasphemous.

It’s a shrewd move this by Oracle to branch outside of their flagship annual conference held in Redwood Shores in October and the attendance backed up the rationale that offering free entry was the right thing to do.

Some of the observations that I made after attending were:

The future is Autonomous

Oracle’s Autonomous Database offering is being heavily pushed despite being a relatively immature product with very few real-world examples yet. The concept is certainly valid and it’s worth new and existing customers of Oracle seriously considering trialling.

There are two autonomous offerings. The autonomous data warehouse (ADW) and autonomous transaction processing (ATP).

Both are fully cloud managed by Oracle, are elastic so that they can be scaled up and down on demand, and most importantly - are autonomous. So the marketing spiel goes, they are self driving, self securing, self repairing. You’ll see this a lot but basically it means that the manual tasks that a DBA would normally perform are taken care of by Oracle. Think patching etc…

AI & ML

You can tell that Oracle are really getting behind the latest trends in the technology market. AI will be a feature of all of their Cloud applications with Mark Hurd (Oracle CEO) predicting that by 2025 all applications on the market with have AI factored in (fair prediction)

Further more Oracle's 2018 acquisiton of DataScience.com show's the strategic vision of the companies board.

Blockchain

Also picking up on the cyber security side of things, Oracle spoke a lot about the role that Blockchain will play in enterprises going forwards. Oracle’s Blockchain cloud platform offering gives enterprises a rapid and simplified deployment of blockchain networks.

Final Thoughts

In summary, this was a really good event for Oracle to run and I really hope they continue to do so. It gave a chance for the Oracle community to come together again and in a growingly competitive market for Cloud, Oracle needs to keep investing in its community going forwards.

Conceptually Oracle has some very timely cloud offerings in their armoury and it will be interesting to come back in 12 months time and see how the adoption of these applications & platforms is going.

Categories: BI & Warehousing

Is Star Schema necessary?

Dylan's BI Notes - Fri, 2019-01-18 12:30
A star schema describes the data by fact and dimension. From one angle, it is a data modeling technique for designing the data warehouse based on relational database technology.  In the old OLAP world, even though a cube is also links to the dimensions that describe the measure, we typically won’t call them Star Schema. […]
Categories: BI & Warehousing

Error while running ggsci

Amardeep Sidhu - Sat, 2019-01-12 08:21

This was another issue that I faced while trying to configure GoldenGate in HA mode. ggsci was working fine after normal installation but after configuring it in HA mode and trying to run ggsci, it resulted in this:

[oragg@node2 product]$ ggsci
Oracle GoldenGate Command Interpreter for Oracle
Version 12.3.0.1.4 OGGCORE_12.3.0.1.0_PLATFORMS_180415.0359_FBO
Linux, x64, 64bit (optimized), Oracle 12c on Apr 16 2018 00:53:30
Operating system character set identified as UTF-8.
Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.
2019-01-08 16:28:37.913
CLSD: An error occurred while attempting to generate a full name. Logging may not be active for this process
Additional diagnostics: CLSU-00100: operating system function: sclsdgcwd failed with error data: -1
CLSU-00103: error location: sclsdgcwd2
(:CLSD00183:)
GGSCI (exadatadb02.industowers.com) 1>

No obvious clues in the error message but little searching revealed that it had something to do with permissions. It was on Exadata so i tried to do a strace of ggsci and see if it could give some clues. There we go:

[oragg@node2 product]$ strace ggsci
.
.
mkdir("/u01/app/oracle/product/12.1.0.2/dbhome_4/log/exadatadb02", 01777) = -1 EACCES (Permission denied)

That is the Oracle database home, GoldenGate is supposed to use. It is trying to create a directory at the mentioned path and not able to do it. There was another directory called client needed inside this. I created both of them and set the needed permissions & the sticky bit and it worked fine. It was working fine on the other node, so i could check the permissions over there and do the same on this node.

Categories: BI & Warehousing

Failed to execute the command “”/u01/app/xag/bin/clsecho”

Amardeep Sidhu - Tue, 2019-01-08 11:22

I was configuring GoldenGate in HA mode by following this document. Everything worked ok but in the end while running agctl config goldengate to view the configuration of GoldenGate resource, it was failing with the following error:

[oracle@exadatadb02 ~]$ agctl config goldengate GG_TARGET
Failed to execute the command ""/u01/app/xag/bin/clsecho" -p xag -f xag -m 5080 "GG_TARGET"" (rc=134), with the message:
Oracle Clusterware infrastructure fatal error in clsecho.bin (OS PID 126367_140570897783808): Internal error (ID (:CLSB00107:)) - Error -1 (ORA-08275) determining Oracle base
/u01/app/xag/bin/clsecho: line 45: 126367 Aborted (core dumped) ${CRS_HOME}/bin/clsecho.bin "$@"
Failed to execute the command ""/u01/app/xag/bin/clsecho" -p xag -f xag -m 5081 "/u01/app/oragg/product"" (rc=134), with the message:

If you look at the error in bold it sounds kinda obvious that it is not able to figure our where the ORACLE_BASE is. But somehow it didn’t strike me at that moment. So started looking around. If we look at the command it is running, it runs clsecho. This is simply a shell script which in turn calls $CRS_HOME/bin/clsecho.bin . In the script, it sets various environment variables and that is where the problem was. There are lines like:

ORACLE_BASE=
export ORACLE_BASE

Nowhere in the script, it is setting the value of ORACLE_BASE. That was causing an issue. I changed the first line to set the ORACLE_BASE location and it worked fine after that. There was another issue i faced with ggsci after doing xag configuration. Will do another blog post on that.

Categories: BI & Warehousing

dbca doesn’t list diskgroups

Amardeep Sidhu - Wed, 2018-12-26 09:31

This is an Exadata machine running GI version 18.3.0.0.180717 and DB version 12.1.0.2.180717. On one of the DB nodes while running dbca, it doesn’t list the diskgroups. it works fine on the other node.

I cheked the dbca trace and found that the kfod command was failing. I tried to run it manually and got the same error:

[oracle@exadb01 ~]$ /u01/app/18.0.0.0/grid/bin/kfod op=groups verbose=true
KFOD-00300: OCI error [-1] [OCI error] [Could not fetch details] [-105777048]

KFOD-00105: Could not open pfile 'init@.ora'
[oracle@exadb01 ~]$

I ran it with strace then:

[oracle@exadb01 ~]$ strace /u01/app/18.0.0.0/grid/bin/kfod op=groups verbose=true
execve("/u01/app/18.0.0.0/grid/bin/kfod", ["/u01/app/18.0.0.0/grid/bin/kfod", "op=groups", "verbose=true"], [/* 18 vars */]) = 0
brk(0) = 0x2641000
.
.
.
.
.
open("/u01/app/18.0.0.0/grid/dbs/ab_+ASM1.dat", O_RDONLY) = -1 EACCES (Permission denied)
geteuid() = 1003
open("/u01/app/18.0.0.0/grid/rdbms/mesg/kfodus.msb", O_RDONLY) = 13
fcntl(13, F_SETFD, FD_CLOEXEC) = 0
lseek(13, 0, SEEK_SET) = 0
read(13, "\25\23\"\1\23\3\t\t\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"…, 280) = 280
lseek(13, 512, SEEK_SET) = 512
read(13, "\352\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"…, 512) = 512
lseek(13, 1024, SEEK_SET) = 1024
read(13, ".\1=\1E\1M\1X\1\352\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"…, 512) = 512
lseek(13, 1536, SEEK_SET) = 1536
read(13, "\n\0d\0\0\0D\0e\0\1\0e\0f\0\1\0\230\0g\0\1\0\306\0h\0\2\0\325\0"…, 512) = 512
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), …}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f43f85f2000
write(1, "KFOD-00300: OCI error [-1] [OCI "…, 78KFOD-00300: OCI error [-1] [OCI error] [Could not fetch details] [-132605848]
) = 78

The text in bold just before the kfod error caught my attention. When I checked actually oracle user wasn’t able to read the file. The permissions looked like this:

[root@exadb01 dbs]# ls -ltr
total 20
-rw-r--r-- 1 oragrid oinstall 3079 May 14 2015 init.ora
-rw-r--r-- 1 oragrid oinstall 587 Dec 12 15:33 initbackuppfile.ora
-rw-rw---- 1 oragrid asmadmin 1656 Dec 20 14:26 ab_+ASM1.dat
-rw-rw---- 1 oragrid oinstall 1544 Dec 20 14:26 hc_+APX1.dat
-rw-rw---- 1 oragrid oinstall 1544 Dec 21 16:57 hc_+ASM1.dat
[root@exadb01 dbs]#

Whereas on node2 they were like:

[oracle@exadb02 dbs]$ ls -ltr 
total 16
-rwxrwxrwx 1 oragrid oinstall 3079 Dec 12 14:52 init.ora
-rwxrwxrwx 1 oragrid oinstall 1544 Dec 21 16:57 hc_+ASM2.dat
-rw-rw---- 1 oragrid oinstall 1720 Dec 21 16:57 ab_+ASM2.dat
-rwxrwxrwx 1 oragrid oinstall 1544 Dec 21 16:57 hc_+APX2.dat
[oracle@exadb02 dbs]$

Since oracle user isn’t member of asmadmin group, it is not able to read the meniotned file. Changing the owner to oracle:oinstall fixed the issue.

Categories: BI & Warehousing

New web based OEDA for Exadata

Amardeep Sidhu - Wed, 2018-11-21 03:17

It started with an xls sheet (that was called dbm configurator) . Then OEDA (Oracle Exadata Deployment Assistant) was introduced that was a Java based GUI tool to enter all the information needed to configure an Exadata machine. Now with the latest patch released in Oct, OEDA has changed again; to become a web based tool. It is deployed on WebLogic and comes with some new features as well. SuperCluster deployments will continue to use the Java based OEDA tool.  The new interface has support for Exadata, ZDLRA and ExaCC. It is backward compatible and can import the XMLs generated by older versions of OEDA. Some of the new features include the ability to configure single instance homes, create more than 2 diskgroups, create more than 1 database homes and databases, allow ILOMs to have a different subnet etc.

To configure the OEDA application you need to unzip the contents and run the installWls script with -p switch (that mentions the port). It will deploy the application on WebLogic and give you the URL to access the OEDA. The interface is similar to the older version. Just that it runs in a browser and there are some new features added. MOS note 2460104.1 and the Exadata documentation has more details:

Using Oracle Exadata Deployment Assistant

 

 

Categories: BI & Warehousing

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

Rittman Mead Consulting - Mon, 2018-11-12 05:18
How Are My Users Connecting? Analyzing OAC and OBIEE entry points

Are you managing an OAC or OBIEE instance and your life is nice and easy since you feel like having everything in control: your users browse existing dashboards, create content via Analysis, Data Visualization or SmartView and deliver data via Agents or download dashboard content to use in Excel. You feel safe since you designed your platform to provide aggregated data and track every query via Usage Tracking.

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

But one day you start noticing new BI tools appearing in your company that provide similar KPIs to the ones you are already exposing and you start questioning where those data are coming from. Then suddently realize they are automagically sourcing data from your platform in ways you don't think you can control or manage.
Well, you're not alone, let me introduce you on how to monitor OAC/OBIEE connections via network sniffing and usage tracking in this new world of self-service BI platforms.

A Bit of History

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

Anybody who has been for some time in the Analytics market will be able to recognise the situation described in the image above as a direct experience: multiple people having different views on a KPI calculation and therefore results. Back in the days, that problem was strictly related to the usage of Excel as BI tool and the fact that everybody was directly accessing raw data to build up their own KPIs.

Centralised BI Solutions

The landscape started to change when Centralised Enterprise BI Solutions (like OBIEE or in more recent times OAC ) started appearing and being developed in the market. The Key point of those solutions was to provide a unique source of truth for a certain set of KPIs across the organization.

However, the fact that those tools were centralised in the hands of the IT department, meant most of the times a lack of agility for the Business Departments: every new KPI had to be well defined, understood, documented, implemented by IT, validated and delivered in a process that could take months. Even when the development phase was optimised, via DevOps practices for example, time was still burned due to the communication and coordination efforts which are necessary between Business and IT teams.

Self Service BI Platforms

In order to solve the agility problem, in the last few years a new bottom-up approach has been suggested by the latest set of self-service Analytics tools: a certain set of KPIs is developed locally directly by the Business Department and then, once the KPI has been validated and accepted, its definition and the related data model is certified to allow a broader audience to use it.

Oracle has historically been a leader on the Centralised BI platform space with OBIEE being the perfect tool for this kind of reporting. In recent years, Data Visualization closed the gap of the Self-Service Analytics, providing tools for data preparation, visualization and machine learning directly in the hands of Business Users. Oracle Analytics Cloud (OAC) combines in a unique tool both the traditional centralised BI as well as the self-service analytics providing the best option for each use case.

What we have seen at various customer is a proliferation of BI tools being acquired from various departments: most of the time a centralised BI tool is used side by side with one or more self-service with little or no control over data source usage or KPI calculation.

The transition from old-school centralised BI platform to the new bottom-up certified systems is not immediate and there is no automated solution for it. Moreover, centralised BI platforms are still key in most corporates with big investments associated with them in order to get fully automated KPI management. A complete rewrite of the well-working legacy BI solutions following the latest BI trends and tools is not a doable/affordable on short-term and definitively not a priority for the business.

A Mix of The Two

So, how can we make the old and the new world coexist in a solution which is efficient, agile, and doesn't waste all well defined KPIs that are already produced? The solution that we are suggesting more and more is the re-usage of the central BI solution as a curated data source for the self-service tools.

Just imagine the case where we have a very complex Churn Prediction formula, based on a series of fields in a star schema that has been already validated and approved by the Business. Instead of forcing a new user to rewrite the whole formula from the base tables we could just offer, based on the centralised BI system, something like:

Select "Dim Account"."Account Code", "Fact Churn"."Churn Prediction" from "Churn"

There are various benefits to this:

  • No mistakes in formula recalculation
  • No prior knowledge of joining Condition, filtering, aggregation needed
  • Security system inheritance if specific filters or security-sensitive fields were defined, those settings will still be valid.
  • No duplication of code, with different people accessing various versions of the same KPIs.

Using the centralised BI system to query existing KPIs and mashing-up with new datasources is the optimal way of giving agility to the business but at the same time certifying the validity of the core KPIs.

OBIEE as a datasource

A lot of our customers have OBIEE as their own centralised BI reporting tool and are now looking into expanding the BI footprint with a self-service tool. If the chosen tool is Oracle Data Visualization then all the hard work is already done: it natively interfaces with OBIEE's RPD and all the Subject Areas are available together with the related security constraints since the security system is shared.

But what if the self-service tool is not Oracle Data Visualization? How can you expose OBIEE's Data to an external system? Well, there are three main ways:

The first one is by using web-services: OAC (OBIEE) provides a set of SOAP web-services that can be called via python for example, with one of them being executeSQLQuery. After passing the SQL in a string the results are returned in XML format. This is the method used for example by Rittman Mead Insights. SOAP Web-services, however, can't directly be queried by BI tools this is why we created Unify to allow OBIEE connections from Tableau (which is now available for FREE!).
If you aren't using Tableau, a more generic connection method that can is accessible by most of BI tools is via ODBC: OBIEE's BIServer (the component managing the RPD) can be exposed via ODBC by installing the AdminTool Drivers and creating an ODBC connection.
How Are My Users Connecting? Analyzing OAC and OBIEE entry points

Please note that the ODBC method is only available if the BIServer port is not blocked by firewalls. Once the port is open, the ODBC datasource can be queried by any tool having ODBC querying capabilities.

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

The last method is obviously Smartview, that allows sourcing from pre-existing or the creation of new Analysis with the option of refreshing the data on demand. Smartview is the perfect choice if your target Analytical tool is one of the two supported: Excel or Powerpoint.

Good for all use-cases?

Are the above connection methods good in every situation?

via GIPHY

The solutions described above work really well if you let OBIEE do its job: KPI calculations, aggregations, group by and joins or, in other terms, if your aim is to extract aggregated data. OBIEE is not a massive data exporting tool, if your plan is to export 100k rows (just a random number) every time then you may need to rethink about the solution since you:

  • will experience poor performances since you're adding a layer (OAC) between where the data resides (DB) and yourself
  • put the OBIEE environment under pressure since it has to run the query and transform the resultset in XML before pushing it to you

If that's the use case you're looking for then you should think about alternative solutions like sourcing the data directly from the database and possibly moving your security settings there.

How Can You Monitor Who is Connecting?

Let's face the reality, in our days everyone tries to make his work as easy as it can. Business Analysts are tech savvy and configurations and connection options are just a google search away. Stopping people from finding alternative solutions to accelerate their work is counterproductive: there will be tension since the analyst work is slowed down thus the usage of the centralized BI platform will decline quickly since analysts will just move to other platforms giving them the required flexibility.

Blocking ports and access methods is not the correct way of providing a (BI) service that should be centrally controlled but used by the maximum amount of people in an organization. Therefore monitoring solutions should be created in order to:

  • Understand how users are interacting with the platform
  • Provide specific workarounds in cases when there is a misuse of the platform

But how can you monitor user's access? Well, you really have two options: network sniffing or usage tracking.

Network Sniffing

Let's take the example of ODBC connections directly to BI Server (RPD). Those connections can be of three main types:

  • From/To the Presentation Service in order to execute queries in the front-end (e.g. via analysis) and to retrieve the data
  • From OBI administrators Admin Tool to modify OAC/OBIEE's metadata but this shouldn't happen in Production systems
  • From End Users ODBC connections to query OAC/OBIEE data with other BI tools

In the type one connection both the sender and receiver (Presentation and BI server) share the same IP (or IPs in case of cluster), while in the second and third type (the one we are interested) the IP address of the packet sender/receiver is different from the IP of the OBIEE server.
We can then simply use a Linux network analysis tool like tcpdump to check the traffic. With the following command, we are able to listen on port 9516 (the BI Server one) and exclude all the traffic generated from the Presentation Server (IP 192.168.1.30)

sudo tcpdump  -i eth0 -ennA 'port 9516' | grep -v "IP 192.168.1.30" 

The following is a representation of the traffic

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

We can clearly see the traffic passing between the user's machine (IP ending with 161 and the BI Server port (IP ending with 30 and port 56639).
This is the first tracking effort and it already provides us with some information (like users IP address) however is limited to ODBC and doesn't tell us the username. Let's see now what can we get from Usage Tracking.

Usage Tracking

We wrote a lot about Usage Tracking, how to enhance and how to use it so I don't want to repeat that. A very basic description of it: is a database table containing statistics of every query generated by OBIEE.
The "every query" bit is really important: the query doesn't have to be generated by the standard front-end (analytics), but a record is created even if is coming from Smartview or with a direct ODBC access to the BIServer.

Looking into S_NQ_ACCT (the default table name) there is an interesting field named QUERY_SRC_CD that, from Oracle documentation contains

The source of the request.

Checking the values for that table we can see:
How Are My Users Connecting? Analyzing OAC and OBIEE entry points
Analysing the above data in Detail

  • DashboardPrompt and ValuePrompt are related to display values in Prompts
  • DisplayValueMap, Member Browser Display Values and Member Browser Path to Value seem related to items display when creating analysis
  • Report is an Analysis execution
  • SOAP is the webservices
  • rawSQL is the usage of Raw SQL (shouldn't be permitted)

So SOAP identifies the webservices, what about the direct ODBC connections? they don't seem to be logged! Not really, looking more in detail in a known dataset, we discovered that ODBC connections are marked with NULL value in QUERY_SRC_CD together with some other traffic.
Looking into the details of the Null QUERY_SRC_CD transactions we can see two types of logs:

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

  • The ones starting with SELECT are proper queries sent via an ODBC call
  • The ones starting with CALL are requests from the Presentation Server to the BI Server

Summarizing all the findings, the following query should give you the list of users accessing OBIEE via either ODBC, SOAP or using rawSQL.

SELECT DISTINCT 
  USER_NAME,
  NVL(QUERY_SRC_CD, 'RPD ODBC') SOURCE, 
  TRUNC(START_TS) TS
FROM S_NQ_ACCT 
WHERE 
   AND 
    (
     QUERY_SRC_CD IS NULL OR 
     UPPER(QUERY_SRC_CD) IN ('SOAP', 'RAWSQL')
    ) 
   AND QUERY_TEXT NOT LIKE '{CALL%'
ORDER BY 3 DESC;

You can, of course, do more than this, like analysing query volumes (ROW_COUNT column) and Subject Areas afflicted in order to understand any potential misuse of the platform!

Real Example

Let's see an example I'll try logging in via ODBC and executing a query. For this I'm using RazorSQL a SQL query tool and OBIEE, exactly the same logs can be found in Oracle Analytics Cloud (OAC) once the Usage Tracking is enabled so, administrators, don't afraid your job is not going to extinct right now.

Small note: Usage Tracking may be available only on non-Autonomous version of Oracle Analytics Cloud, since some parts of the setup need command line access and server configuration changes which may not available on the Autonomous version

Setup

First a bit of a setup: In order to connect to OAC all you need to do is to download OBIEE's Administration Tool, install it and create an ODBC connection. After this we can open RazorSQL and add create a connection.

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

Then we need to specify our connection details, by selecting Add Connection Profile, specifying OTHER as Connection Profile, then selecting ODBC as Connection Type and filling in the remaining properties. Please note that:

  • Datasource Name: Select the ODBC connection entry created with the Admin tool drivers
  • Login/Password: Enter the OAC/OBIEE credentials

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

Querying and Checking the Logs

Then it's time to connect. As expected we see in RazorSQL the list of Subject Areas as datapoints which depend on the security settings configured in Weblogic and RPD.

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

The Login action is not visible from Usage Tracking S_NQ_ACCT table, it should be logged in the S_NQ_INITBLOCK if you have Init Blocks associated with the login. Let's start checking the data and see what's going to happen. First of all, let's explore which Tables and Columns are part of the Usage Tracking Subject Area, by clicking on the + Icon next to it.

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

The various Dims and Facts are exposed as Tables by the ODBC driver, now let's see if this action is logged in the database with the query

SELECT USER_NAME, 
  QUERY_TEXT, 
  QUERY_SRC_CD, 
  START_TS, 
  END_TS, 
  ROW_COUNT 
FROM S_NQ_ACCT

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

We can clearly see that even checking the columns within the Measures table is logged as ODBC call, with the column QUERY_SRC_CD as Null as expected.
Now let's try to fire a proper SQL, we need to remember that the SQL we are writing needs to be in the Logical SQL syntax. An example can be

select `Topic`.`Repository Name` from `Usage Tracking`

Which in RazorSQL returns the row

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

And in the database is logged as

How Are My Users Connecting? Analyzing OAC and OBIEE entry points

We can see the user who run the query, the execution time (START_TS and END_TS) as well as the number of rows returned (ROW_COUNT).
We demonstrated that we now have all the info neccessary to start tracking any misuse of OAC/OBIEE as a datasource via ODBC connections.

Automating the Tracking

The easiest solution to properly track this type of OBIEE usage is to have an Agent that on daily basis reports users accessing OAC/OBIEE via ODBC. This solution is very easy to implement since all the Usage Tracking tables are already part of the Repository. Creating an Agent that reports on Usage Tracking rows having QUERY_SRC_CD field as Null, SOAP or rawSQL covers all the "non traditional" use-cases we have been talking about.

As mentioned above sourcing aggregated data from OAC/OBIEE should be considered a "good practice" since it provides the unique source of truth across the company. On the other side, exporting massive amount of data should be avoided since end-user performances will be slow and there will be an impact on OAC/OBIEE server. Thus setting an upper limit on the number of rows (e.g. ROW_COUNT > 100k) reported by the Agent could also mean identifying all the specific data-exports cases that should drive an impact assessment and a possible solution redesign.

Conclusion

Tools and Options in the Analytical Market are exploding and more and more we'll see companies using a number of different solutions for specific purposes. Centralised BI solutions, built over the years, provide the significant advantage of containing the unique source of truth across the company and should be preserved. Giving agility to Analysts and at the same time keeping the centrality of well defined and calculated KPIs is a challenge we'll face more and more often in the future.
OAC (or OBIEE on-premises) offers the duality of both Centralised and Self-Service Analytics methods together with a variety (webservices, ODBC, Smartview) of connecting methods which makes it the perfect cornerstone of a company analytical system.
Tracking down usage, discovering potential misuse of the platform is very easy so inefficiencies can be addressed quickly to provide adequate agility and performance to all analytical business cases!

Categories: BI & Warehousing

Exciting News for Unify

Rittman Mead Consulting - Tue, 2018-11-06 02:37
Announcement: Unify for Free

We are excited to announce we are going to make Unify available for free. To get started send an email to unify@rittmanmead.com, we will ask you to complete a short set of qualifying questions, then we can give you a demo, provide a product key and a link to download the latest version.

The free version of Unify will come with no support obligations or SLAs. On sign up, we will give you the option to join our Unify Slack channel, through which you can raise issues and ask for help.

If you’d like a supported version, we have built a special Expert Service Desk package for Unify which covers

  • Unify support, how to, bugs and fixes
  • Assistance with configuration issues for OBIEE or Tableau
  • Assistance with user/role issues within OBIEE
  • Ad-hoc support queries relating to OBIEE, Tableau and Unify

Beyond supporting Unify, the Expert Service Desk package can also be used to provide technical support and expert services for your entire BI and analytics platform, including:

  • An agreed number of hours per month for technical support of Oracle and Tableau's BI and DI tools
  • Advisory, strategic and roadmap planning for your platform
  • Use of any other Rittman Mead accelerators including support for our other Open Source tools and DevOps Developer Toolkits
  • Access to Rittman Mead’s On Demand Training
New Release: Unify 10.0.17

10.0.17 is the new version of Unify. This release doesn’t change how Unify looks and feels, but there are some new features and improvements under the hood.

The most important feature is that now you can get more data from OBIEE using fewer resources. While we are not encouraging you to download all your data from OBIEE to Tableau all time (please use filters, aggregation etc.), we realise that downloading the large datasets is sometimes required. With the new version, you can do it. Hundreds of thousands of rows can be retrieved without causing your Unify host to grind to a halt.

The second feature we would like to highlight is that now you can use OBIEE instances configured with self-signed SSL certificates. Self-signed certificates are often used for internal systems, and now Unify supports such configurations.

The final notable change is that you can now run Unify Server as a Windows service. It wasn't impossible to run Unify Server at system startup before, but it is even easier.

And, of course, we fixed some bugs and enhanced the logging. We would like to see our software function without bugs, but sometimes they just happen, and when they do, you will get a better explanation of what happened.

On most platforms, Unify Desktop should auto update, if it doesn’t, then please download manually.

Unify is 100% owned and maintained by Rittman Mead Consulting Ltd, and while this announcement makes it available for free, all copies must be used under an End User Licence Agreement (EULA) with Rittman Mead Consulting Ltd.

Categories: BI & Warehousing

Connect to DV Datasets and explore many more new features in OAC / OAAC 18.3.3.0

Tim Dexter - Wed, 2018-10-17 05:26

Greetings !

Oracle Analytics Cloud (OAC) and Oracle Autonomous Analytics Cloud (OAAC) version 18.3.3.0 (also known as V5) got released last month. A rich set of new features have been introduced in this release across different products (with product version 12.2.5.0.0) in the suite. You can check all the new features of OAC / OAAC in the video here.

The focus for BI Publisher on OAC / OAAC in this release has been to compliment Data Visualization for pixel perfect reporting, performance optimizations and adding self service abilities. Here is a list of new features added this release:

BI Publisher New Features in OAC V5.0

New Feature Description 1. DV Datasets

Now you can leverage a variety of data sources covered by Data Visualization data sets, including Cloud based data sources such as Amazon Redshift, Autonomous Data Warehouse Cloud; Big Data sources such as Spark, Impala, Hive; and Application data sources such as Salesforce, Oracle Applications etc. BI Publisher is here to compliment DV to create pixel perfect reports using DV datasets.

Check the documentation for additional details. Also, check this video to see how this feature works.

2. Upload Center

Now upload all files for custom configuration such as fonts, ICC Profile, Private Keys, Digital Signature etc.from the Upload Center as a self service feature available in the Administration page.

Additional details can be found in the documentation here.

3. Validate Data Model

Report Authors can now validate a data model before deploying the report in a production environment. This will help during a custom data model creation where data sets, LOVs and Bursting Queries can be validated against standard guidelines to avoid any undesired performance impact to the report server. 

Details available here.

4. Skip unused data sets

When a data model contains multiple data sets for different layouts, each layout might not use all the data sets defined in the data model. Now Report Authors can select data model property to skip the execution of the unused data sets in a layout. Setting this property reduces the data extraction time, memory usage and improves overall report performance.

Additional details can be found here.

5. Apply Digital Signature to PDF Documents

Digital Signature is widely used feature in on-prem deployments and now this has been added in OAC too, where in Digital Signature can be applied to a PDF output. Digital Signatures can be uploaded from the Upload Center, required signature can be selected under security center, and then applied to PDF outputs by configuring attributes under report properties or run-time properties. 

You can find the documentation here. Also check this video for a quick demonstration.

6. Password protect MS Office Outputs - DocX, PPTX, XLSX

Now protect your MS Office output files with a password defined at report or server level.

Check the PPTX output properties, DocX output properties, Excel 2007 output properties

7. Deliver reports in compressed format

You can select this option to compress the output by including the file in a zip file before delivery via email, FTP, etc.

Additional details can be found here.

8. Request read-receipt and delivery confirmation notification 

You can opt to get delivery and read-receipt notification for scheduled job delivery via email.

Check documentation for additional details. 

9. Add scalability mode for Excel Template to handle large data size

Now you can set up scalability mode for an excel template. This can be done at system level, report level or at template level. By setting this attribute to true, the engine will flush memory after a threshold value and when the data exceeds 65K rows it will rollover data into multiple sheets.

You can find the documentation here.

 

Stay tuned to hear more updates on features and functionalities ! Happy BIP'ing ...

 

Categories: BI & Warehousing

Fixing* Baseline Validation Tool** Using Network Sniffer

Rittman Mead Consulting - Wed, 2018-10-17 05:22

* Sort of
** Not exactly

In the past, Robin Moffatt wrote a number of blogs showing how to use various Linux tools for diagnosing OBIEE and getting insights into how it works (one, two, three, ...). Some time ago I faced a task which allowed me to continue Robin's cycle of posts and show you how to use Wireshark to understand how a certain Oracle tool works and how to search for the solution of a problem more effectively.

To be clear, this blog is not about the issue itself. I could simply write a tweet like "If you faced issue A then patch B solves it". The idea of this blog is to demonstrate how you can use somewhat unexpected tools and get things done.

Obviously, my way of doing things is not the only one. If you are good in searching at My Oracle Support, you possibly can do it even faster, but what is good about my way (except for it is mine, which is enough for me) is that it doesn't involve uneducated guessing. I do an observation and get a clarified answer.

Most of my blogs have disclaimers. This one is not an exception, while its disclaimer is rather small. There is still no silver bullet. This won't work for every single problem in OBIEE. I didn't say this.

Now, let's get started.

The Task

The problem was the following: a client was upgrading its OBIEE system from 11g to 12c and obviously wanted to test for regression, making sure that the upgraded system worked exactly the same as the old one. Manual comparison wasn't an option since they have hundreds or even thousands of analyses and dashboards, so Oracle Baseline Validation Tool (usually called just BVT) was the first candidate as a solution to automate the checks.

Using BVT is quite simple:

  • Create a baseline for the old system.
  • Upgrade
  • Create a new baseline
  • Compare them
  • ???
  • Profit! Congratulations. You are ready to go live.

Right? Well, almost. The problem that we faced was that BVT Dashboards plugin for 11g (a very old 11.1.1.7.something) gave exactly what was expected. But for 12c (12.2.1.something) we got all numbers with a decimal point even while all analyses had "no decimal point" format. So the first feeling we got at this point was that BVT doesn't work well for 12c and that was somewhat disappointing.

SPOILER That wasn't true.

I made a simple dashboard demonstrating the issue.

OBIEE 11g

11g-dash-vs-bvt
Measure values in the XML produced by BVT are exactly as on the dashboard. Looks good.

OBIEE 12c

12c-dash-vs-bvt-1
Dashboard looks good, but values in the XML have decimal digits.

failed

As you can see, the analyses are the same or at least they look very similar but the XMLs produced by BVT aren't. From regression point of view this dashboard must get "DASHBOARDS PASSED" result, but it got "DASHBOARDS DIFFERENT".

Reading the documentation gave us no clear explanation for this behaviour. We had to go deeper and understand what actually caused it. Is it BVT screwing up the data it gets from 12c? Well, that is a highly improbable theory. Decimals were not simply present in the result but they were correct. Correct as in "the same as stored in the database", we had to reject this theory.
Or maybe the problem is that BVT works differently with 11g and 12c? Well, this looks more plausible. A few years have passed since 11.1.1.7 was released and it would not be too surprising if the old version and the modern one had different APIs used by BVT and causing this problem. Or maybe the problem is that 12c itself ignores formatting settings. Let's find out.

The Tool

Neither BVT, nor OBIEE logs gave us any insights. From every point of view, everything was working fine. Except that we were getting 100% mismatch between the source and the target. My hypothesis was that BVT worked differently with OBIEE 11g and 12c. How can I check this? Decompiling the tool and reading its code would possibly give me the answer, but it is not legal. And even if it was legal, the latest BVT size is more than 160 megabytes which would give an insane amount of code to read, especially considering the fact I don't actually know what I'm looking for. Not an option. But BVT talks to OBIEE via the network, right? Therefore we can intercept the network traffic and read it. Shall we?

There are a lot of ways to do it. I work with OBIEE quite a lot and Windows is the obvious choice for my platform. And hence the obvious tool for me was Wireshark.

Wireshark is the world’s foremost and widely-used network protocol analyzer. It lets you see what’s happening on your network at a microscopic level and is the de facto (and often de jure) standard across many commercial and non-profit enterprises, government agencies, and educational institutions. Wireshark development thrives thanks to the volunteer contributions of networking experts around the globe and is the continuation of a project started by Gerald Combs in 1998.

What this "About" doesn't say is that Wireshark is open-source and free. Which is quite nice I think.

Installation Details

I'm not going to go into too many details about the installation process. It is quite simple and straightforward. Keep all the defaults unless you know what you are doing, reboot if asked and you are fine.

If you've never used Wireshark or analogues, the main question would be "Where to install it?". The answer is pretty simple - install it on your workstation, the same workstation where BVT is installed. We're going to intercept our own traffic, not someone else's.

A Bit of Wireshark

Before going to the task we want to solve let's spend some time familiarizing with Wireshark. Its starting screen shows all the network adapters I have on my machine. The one I'm using to connect to the OBIEE servers is "WiFi 2".

Screenshot-2018-10-09-13.50.44

I double-click it and immediately see a constant flow of network packets flying back and forth between my computer and local network machines and the Internet. It's a bit hard to see any particular server in this stream. And "a bit hard" is quite an understatement, to be honest, it is impossible.

wireshark

I need a filter. For example, I know that my OBIEE 12c instance IP is 192.168.1.226. So I add ip.addr==192.168.1.226 filter saying that I only want to see traffic to or from this machine. Nothing to see right now, but if I open the login page in a browser, for example, I can see traffic between my machine (192.168.1.25) and the server. It is much better now but still not perfect.

Screenshot-2018-10-09-14.08.52

If I add http to the filter like this http and ip.addr==192.168.1.226, I definitely can get a much more clear view.

For example, here I opened http://192.168.1.226:9502/analytics page just like any other user would do. There are quite a lot of requests and responses. The browser asked for /analytics URL, the server after a few redirects replied what the actual address for this URL is login.jsp page, then browser requested /bi-security-login/login.jsp page using GET method and got the with HTTP code 200. Code 200 shows that there were no issues with the request.

startpage

Let's try to log in.

login

The top window is a normal browser and the bottom one is Wireshark. Note that my credentials been sent via clear text and I think that is a very good argument in defence of using HTTPS everywhere.

That is a very basic use of Wireshark: start monitoring, do something, see what was captured. I barely scratched the surface of what Wireshark can do, but that is enough for my task.

Wireshark and BVT 12c

The idea is quite simple. I should start capturing my traffic then use BVT as usual and see how it works with 12c and then how it works with 11g. This should give me the answer I need.

Let's see how it works with 12c first. To make things more simple I created a catalogue folder with just one analysis placed on a dashboard.

bvt-dashboard-1

It's time to run BVT and see what happens.

Screenshot-2018-10-11-17.49.59

Here is the dataset I got from OBIEE 12c. I slightly edited and formatted it to make easier to read, but didn't change anything important.

dataset12--1

What did BVT do to get this result? What API did it use? Let's look at Wireshark.

Screenshot-2018-10-11-19.09.27

First three lines are the same as with a browser. I don't know why it is needed for BVT, but I don't mind. Then BVT gets WSDL from OBIEE (GET /analytics-ws/saw.dll/wsdl/v6/private). There are multiple pairs of similar query-response flying back and forth because WSDL is big enough and downloaded in chunks. A purely technical thing, nothing strange or important here.
But now we know what API BVT uses to get data from OBIEE. I don't think anyone is surprised that it is Web Services API. Let's take a look at Web Services calls.

First logon method from nQSessionService. It logs into OBIEE and starts a session.

Screenshot-2018-10-11-19.36.59

Next requests get catalogue items descriptions for objects in my /shared/BVT folder. We can see a set of calls to webCatalogServce methods. These calls are reading my web catalogue structure: all folders, subfolders, dashboard and analysis. Pretty simple, nothing really interesting or unexpected here.

ws01

Then we can see how BVT uses generateReportSQLResult from reportService to get logical SQL for the analysis.

Screenshot-2018-10-11-19.42.07

And gets analysis' logical SQL as the response.

Screenshot-2018-10-11-19.45.10

And the final step - BVT executes this SQL and gets the data. Unfortunately, it is hard to show the data on a screenshot, but the line starting with [truncated] is the XML I showed before.

Screenshot-2018-10-12-12.19.58

And that's all. That's is how BVT gets data from OBIEE.

I did the same for 11g and saw absolutely the same procedure.

Screenshot-2018-10-11-21.01.35

My initial theory that BVT may have been using different APIs for 11g and 12c was busted.

From my experiment, I found out that BVT used xmlViewService to actually get the data. And also I know now that it uses logical SQL for getting the data. Looking at the documentation I can see that xmlViewService has no options related to any formatting. It is a purely data-retrieval service. It can't preserve any formatting and supposed to give only the data. But hey, I've started with the statement "11g preserves formatting", how is that possible? Well, that was a simple coincidence. It doesn't.

In the beginning, I had very little understanding of what keywords to use on MoS to solve the issue. "BVT for 12c doesn't preserve formatting"? "BVT decimal part settings"? "BVT works differently for 11g and 12c"? Now I have something much better - "executeSQLQuery decimal". 30 seconds of searching and I know the answer.

mos-1

This was fixed in 11.1.1.9, but there is a patch for 11.1.1.7.some_of_them. The patch fixes an 11g issue which prevents BVT from getting decimal parts of numbers.

pass

As you may have noticed I had no chance of finding this using my initial problem description. Nether BVT, nor 12g or 11.1.1.7 were mentioned. This thread looks completely unrelated to the issue, I had zero chances to find it.

Conlusion

OBIEE is a complex software and solving issues is not always easy. Unfortunately, no single method is enough for solving all problems. Usually, log files will help you. But when something works but not the way you expect, log files can be useless. In my case BVT was working fine, 11g was working fine, 12c was working fine too. Nothing special to write to logs was happening. That is why sometimes you may need unexpected tools. Just like this. Thanks for reading!

Categories: BI & Warehousing

Pages

Subscribe to Oracle FAQ aggregator - BI &amp; Warehousing