BI & Warehousing

Using Jupyter Notebooks with Big Data Discovery 1.2

Rittman Mead Consulting - Wed, 2016-06-15 09:00
Using Jupyter Notebooks with Big Data Discovery 1.2

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

Whilst BDD Shell is command-line based, there's also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive "Notebook". This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a 'Notebook' enables you to modify and re-run commands, and then once correct retain them as a fully functioning script for future use.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you'd find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2.

For information how on to set up BDD Shell and Jupyter Notebooks, see this previous post. For the purpose of this article I'm running Jupyter on port 18888 so as not to clash with Hue:

  
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell  
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It's important that you run this from the bdd-shell folder, otherwise the BDD shell won't initialise properly
  • Jupyter by default only listens locally, so you need to use a web browser local to the server, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and from the New menu select a Python 2 notebook:

Using Jupyter Notebooks with Big Data Discovery 1.2

You should then see an empty notebook, ready for use:

Using Jupyter Notebooks with Big Data Discovery 1.2

The 'cell' (grey box after the In [ ]:) is where you enter code to run - type in execfile('ipython/00-bdd-shell-init.py') and press shift-Enter. This will execute it - if you don't press shift you just get a newline. Whilst it's executing you'll notice the line prefix changes from [ ] to [*], and in the terminal window from which you launched Jupyter you'll see some output related to the BDD Shell starting

  
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).  
WARNING: Running spark-class from user-defined location.  
spark.driver.cores is set but does not apply in client mode.  

Now back in the Notebook, enter the following - use Enter, not Shift-enter, between lines:

  
dss = bc.datasets()  
dss.count  

Using Jupyter Notebooks with Big Data Discovery 1.2

Now press shift-enter to execute it. This uses the pre-defined bc BDD context to get the datasets object, and return a count from it.

Using Jupyter Notebooks with Big Data Discovery 1.2

By clicking the + button on the toolbar, using the up and down arrows on the toolbar, and the Code/Markdown dropdown, it's possible to insert "cells" which are not code but instead commentary on what the code is. This way you can produce fully documented, but executable, code objects.

Using Jupyter Notebooks with Big Data Discovery 1.2

From the File menu give the notebook a name, and then Close and Halt, which destroys the Jupyter process ('kernel') that was executing the BDD Shell session. Back at the Jupyter main page, you'll note that a ipynb file has been created, which holds the notebook definition and can be downloaded, sent to colleagues, uploaded to blogs to share, saved in source control, and so on. Here's the file for the notebook above - note that it's hosted on gist, which automagically previews it as a Notebook, so click on Raw to see the actual code behind it.

The fantastically powerful thing about the Notebooks is that you can modify and re-run steps as you go -- but you never lose the history of how you got somewhere. Most people will be familar with learning or exploring a tool and its capabilities and eventually getting it to work - but no idea how they got there. Even for experienced users of a tool, being able to prove how to replicate a final result is important for (a) showing the evidence for how they got there and (b) enabling others to take that work and build on it.

With an existing notebook file, whether a saved one you created or one that someone sent you, you can reopen it in Jupyter and re-execute it, in order to replicate the results previously seen. This is an important tenet of [data] science in general - show your workings, and it's great that Big Data Discovery supports this option. Obviously, showing the count of datasets is not so interesting or important to replicate. The real point here is being able to take datasets that you've got in BDD, done some joining and wrangling on already taking advantage of the GUI, and then dive deep into the data science and analytics world of things like Spark MLLib, Pandas, and so on. As a simple example, I can use a couple of python libraries (installed by default with Anaconda) to plot a correlation matrix for one of my BDD datasets:

Using Jupyter Notebooks with Big Data Discovery 1.2

As well as producing visualisations or calculations within BDD shell, the real power comes in being able to push the modified data back into Hive, and thus continue to work with it within BDD.

With Jupyter Notebooks not only can you share the raw notebooks for someone else to execute, you can export the results to HTML, PDF, and so on. Here's the notebook I started above, developed out further and exported to HTML - note how you can see not only the results, but exactly the code that I ran in order to get them. In this I took the dataset from BDD, added a column into it using a pandas windowing function, and then saved it back to a new Hive table:
(you can view the page natively here, and the ipynb here)

https://gist.github.com/rmoff/f1024dd043565cb8a58a0c54e9a782f2

Once the data's been written back to Hive from the Python processing, I ran BDD's dataprocessingCLI to add the new table back into BDD

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table updated_accident_data

And once that's run, I can then continue working with the data in BDD:

Using Jupyter Notebooks with Big Data Discovery 1.2

Using Jupyter Notebooks with Big Data Discovery 1.2

This workflow enables a continual loop of data wrangling, enrichment, advanced processing, and visualisation - all using the most appropriate tools for the job.

You can also use BDD Shell/Jupyter as another route for loading data into BDD. Whilst you can import CSV and XLS files into BDD directly through the web GUI, there are limitations - such as an XLS workbook with multiple sheets has to be imported one sheet at a time. I had a XLS file with over 40 sheets of reference data in it, which was not going to be time-efficient to load one at a time into BDD.

Pandas supports a lot of different input types - including Excel files. So by using Pandas to pull the data in, then convert it to a Spark dataframe I can write it to Hive, from where it can be imported to BDD. As before, the beauty of the Notebook approach is that I could develop and refine the code, and then simply share the Notebook here

https://gist.github.com/rmoff/3fa5d857df8ca5895356c22e420f3b22

Categories: BI & Warehousing

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

Rittman Mead Consulting - Mon, 2016-06-13 09:19

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets, particularly when used in conjunction with Jupyter Notebooks. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you’d find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2. In this article we’ll see how to configure BDD Shell on Big Data Lite 4.5 (along with Jupyter Notebooks), and in a subsequent post dive into how to actually use them.

Setting up BDD Shell on Big Data Lite

You can find the BDD Shell installation document here.

Login to BigDataLite 4.5 (oracle/welcome1) and open a Terminal window. The first step is to download Anaconda, which is a distribution of Python that also includes “[…] over 100 of the most popular Python, R and Scala packages for data science” as well as Jupyter notebook, which we’ll see in a moment.

cd ~/Downloads/
wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh

Then install it: (n.b. bash is part of the command to enter)

bash Anaconda2-4.0.0-Linux-x86_64.sh

Accept the licence when prompted, and then select a install location – I used /u01/anaconda2 where the rest of the BigDataLite installs are

Anaconda2 will now be installed into this location:
/home/oracle/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/oracle/anaconda2] >>> /u01/anaconda2

After a few minutes of installation, you’ll be prompted to whether you want to prepend Anaconda’s location to the PATH environment variable. I opted not to (which is the default) since Python is used elsewhere on the system and by prepending it it’ll take priority and possibly break things.

Do you wish the installer to prepend the Anaconda2 install location
to PATH in your /home/oracle/.bashrc ? [yes|no]
[no] >>> no

Now edit the BDD Shell configuration file (/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.conf) in your favourite text editor to add/amend the following lines:

SPARK_EXECUTOR_PYTHON=/u01/anaconda2/bin/python
LOCAL_PYTHON_HOME=/u01/anaconda2

Amend the path if you didn’t install Anaconda into /u01

In the same configuration file, add/amend:

SPARK_HOME=/usr/lib/spark/
SPARK_EXTRA_CLASSPATH=/usr/lib/oozie/oozie-sharelib-yarn/lib/spark/spark-avro_2.10-1.1.0-cdh5.7.0.jar

Now run the BDD Shell setup:

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/setup.sh

This should succeed:

[bigdatalite.localdomain] Validating pre-requisites...
[bigdatalite.localdomain] Validation Success
[bigdatalite.localdomain] Setting up BDD Shell...
[bigdatalite.localdomain] Setup Success
[oracle@bigdatalite Downloads]$

Assuming it does, you can now launch the shell bdd-shell.sh:

[oracle@bigdatalite Downloads]$ /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.sh
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).
WARNING: Running spark-class from user-defined location.
spark.driver.cores is set but does not apply in client mode.
Welcome to
     ___   ___   ___       __   _     ____  _     _
    | |_) | | \ | | \     ( (` | |_| | |_  | |   | |
    |_|_) |_|_/ |_|_/     _)_) |_| | |_|__ |_|__ |_|__

SparkContext available as sc, HiveContext available as sqlContext.
BDD Context available as bc.

>>>

From the BDD Shell you can interact with BDD, for example to list out the datasets currently defined in the Catalog:

>>> bc.datasets().count
17
>>> for ds in bc.datasets():
...     print ds
...

media_demo_customer     edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        Hive    default.media_demo_customer

movie_genre     default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        Hive    default.movie_genre

media_demo_customer     default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        Hive    default.media_demo_customer

Whilst BDD Shell is command-line based, there’s also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive “Notebook”. This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a ‘Notebook’ enables you to modify and re-run commands, and then once correct, retain them as a fully functioning script for future use.

To launch it, run:

cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It’s important that you run this from the bdd-shell folder, otherwise the BDD shell won’t initialise properly
  • By default Jupyter uses 8888, which is already in use on BigDataLite by Hue, so use a different one by specifying --port
  • Jupyter by default only listens locally, so you need to either be using BigDataLite desktop to run Firefox, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and you should see the default Jupyter screen with a list of files:

In the next article, we’ll see how to use Jupyter Notebooks with Big Data Discovery, and get an idea of just how powerful the combination can be.

The post Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

Rittman Mead Consulting - Mon, 2016-06-13 08:19
Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets, particularly when used in conjunction with Jupyter Notebooks. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you'd find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2. In this article we'll see how to configure BDD Shell on Big Data Lite 4.5 (along with Jupyter Notebooks), and in a subsequent post dive into how to actually use them.

Setting up BDD Shell on Big Data Lite

You can find the BDD Shell installation document here.

Login to BigDataLite 4.5 (oracle/welcome1) and open a Terminal window. The first step is to download Anaconda, which is a distribution of Python that also includes "[...] over 100 of the most popular Python, R and Scala packages for data science" as well as Jupyter notebook, which we'll see in a moment.

cd ~/Downloads/  
wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
Then install it: (n.b. bash is part of the command to enter)
  
bash Anaconda2-4.0.0-Linux-x86_64.sh  

Accept the licence when prompted, and then select a install location - I used /u01/anaconda2 where the rest of the BigDataLite installs are

Anaconda2 will now be installed into this location:  
/home/oracle/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/oracle/anaconda2] >>> /u01/anaconda2

After a few minutes of installation, you'll be prompted to whether you want to prepend Anaconda's location to the PATH environment variable. I opted not to (which is the default) since Python is used elsewhere on the system and by prepending it it'll take priority and possibly break things.

Do you wish the installer to prepend the Anaconda2 install location  
to PATH in your /home/oracle/.bashrc ? [yes|no]  
[no] >>> no

Now edit the BDD Shell configuration file (/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.conf) in your favourite text editor to add/amend the following lines:

  
SPARK\_EXECUTOR\_PYTHON=/u01/anaconda2/bin/python  
LOCAL\_PYTHON\_HOME=/u01/anaconda2  

Amend the path if you didn't install Anaconda into /u01

In the same configuration file, add/amend:

  
SPARK_HOME=/usr/lib/spark/  
SPARK\_EXTRA\_CLASSPATH=/usr/lib/oozie/oozie-sharelib-yarn/lib/spark/spark-avro_2.10-1.1.0-cdh5.7.0.jar  

Now run the BDD Shell setup:

  
/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/setup.sh

This should succeed:

  
[bigdatalite.localdomain] Validating pre-requisites...
[bigdatalite.localdomain] Validation Success
[bigdatalite.localdomain] Setting up BDD Shell...
[bigdatalite.localdomain] Setup Success
[oracle@bigdatalite Downloads]$

Assuming it does, you can now launch the shell bdd-shell.sh:

  
[oracle@bigdatalite Downloads]$ /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.sh
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).  
WARNING: Running spark-class from user-defined location.  
spark.driver.cores is set but does not apply in client mode.  
Welcome to  
     ___   ___   ___       __   _     ____  _     _
    | |_) | | \ | | \     ( (` | |_| | |_  | |   | |
    |_|_) |_|_/ |_|_/     _)_) |_| | |_|__ |_|__ |_|__

SparkContext available as sc, HiveContext available as sqlContext.  
BDD Context available as bc.

>>>

From the BDD Shell you can interact with BDD, for example to list out the datasets currently defined in the Catalog:

>>> bc.datasets().count  
17  
>>> for ds in bc.datasets():
...     print ds
...

media\_demo\_customer     edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        Hive    default.media_demo_customer

movie_genre     default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        Hive    default.movie_genre

media\_demo\_customer     default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        Hive    default.media_demo_customer  

Whilst BDD Shell is command-line based, there's also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive "Notebook". This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a 'Notebook' enables you to modify and re-run commands, and then once correct, retain them as a fully functioning script for future use.

To launch it, run:

  
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell  
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It's important that you run this from the bdd-shell folder, otherwise the BDD shell won't initialise properly
  • By default Jupyter uses 8888, which is already in use on BigDataLite by Hue, so use a different one by specifying --port
  • Jupyter by default only listens locally, so you need to either be using BigDataLite desktop to run Firefox, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and you should see the default Jupyter screen with a list of files:

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

In the next article, we'll see how to use Jupyter Notebooks with Big Data Discovery, and get an idea of just how powerful the combination can be.

Categories: BI & Warehousing

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Rittman Mead Consulting - Wed, 2016-06-01 18:17

One of the big changes in OBIEE 12c for end users is the ability to upload their own data sets and start analysing them directly, without needing to go through the traditional data provisioning and modelling process and associated leadtimes. The implementation of this is one of the big architectural changes of OBIEE 12c, introducing the concept of the Extended Subject Areas (XSA), and the Data Set Service (DSS).

In this article we’ll see some of how XSA and DSS work behind the scenes, providing an important insight for troubleshooting and performance analysis of this functionality.

What is an XSA?

An Extended Subject Area (XSA) is made up of a dataset, and associated XML data model. It can be used standalone, or “mashed up” in conjunction with a “traditional” subject area on a common field

How is an XSA Created?

At the moment the following methods are available:

  1. “Add XSA” in Visual Analzyer, to upload an Excel (XLSX) document

  2. CREATE DATASET logical SQL statement, that can be run through any interface to the BI Server, including ‘Issue Raw SQL’, nqcmd, JDBC calls, and so on

  3. Add Data Source in Answers. Whilst this option shouldn’t actually be present according to a this doc, it will be for any users of 12.2.1 who have uploaded the SampleAppLite BAR file so I’m including it here for completeness.

Under the covers, these all use the same REST API calls directly into datasetsvc. Note that these are entirely undocumented, and only for internal OBIEE component use. They are not intended nor supported for direct use.

How does an XSA work?

External Subject Areas (XSA) are managed by the Data Set Service (DSS). This is a java deployment (datasetsvc) running in the Managed Server (bi_server1), providing a RESTful API for the other OBIEE components that use it.

The end-user of the data, whether it’s Visual Analyzer or the BI Server, send REST web service calls to DSS, storing and querying datasets within it.

Where is the XSA Stored?

By default, the data for XSA is stored on disk in SINGLETON_DATA_DIRECTORY/components/DSS/storage/ssi, e.g. /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi

[oracle@demo ssi]$ ls -lrt /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi|tail -n5
-rw-r----- 1 oracle oinstall    8495 2015-12-02 18:01 7e43a80f-dcf6-4b31-b898-68616a68e7c4.dss
-rw-r----- 1 oracle oinstall  593662 2016-05-27 11:00 1beb5e40-a794-4aa9-8c1d-5a1c59888cb4.dss
-rw-r----- 1 oracle oinstall  131262 2016-05-27 11:12 53f59d34-2037-40f0-af21-45ac611f01d3.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:04 a4fc922d-ce0e-479f-97e4-1ddba074f5ac.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:06 c93aa2bd-857c-4651-bba2-a4f239115189.dss

They’re stored using the format in which they were created, which is XLSX (via VA) or CSV (via CREATE DATASET)

[oracle@demo ssi]$ head 53f59d34-2037-40f0-af21-45ac611f01d3.dss
"7 Megapixel Digital Camera","2010 Week 27",44761.88
"MicroPod 60Gb","2010 Week 27",36460.0
"MP3 Speakers System","2010 Week 27",36988.86
"MPEG4 Camcorder","2010 Week 28",32409.78
"CompCell RX3","2010 Week 28",33005.91

There’s a set of DSS-related tables installed in the RCU schema BIPLATFORM, which hold information including the XML data model for the XSA, along with metadata such as the user that uploaded the file, when they uploaded, and then name of the file on disk:

How Can the Data Set Service be Configured?

The configuration file, with plenty of inline comments, is at ORACLE_HOME/bi/endpointmanager/jeemap/dss/DSS_REST_SERVICE.properties. From here you can update settings for the data set service including upload limits as detailed here.

XSA Performance

Since XSA are based on flat files stored in disk, we need to be very careful in their use. Whilst a database may hold billions of rows in a table with with appropriate indexing and partitioning be able to provide sub-second responses, a flat file can quickly become a serious performance bottleneck. Bear in mind that a flat file is just a bunch of data plopped on disk – there is no concept of indices, blocks, partitions — all the good stuff that makes databases able to do responsive ad-hoc querying on selections of data.

If you’ve got a 100MB Excel file with thousands of cells, and want to report on just a few of them, you might find it laggy – because whether you want to report on them on or not, at some point OBIEE is going to have to read all of them regardless. We can see how OBIEE is handling XSA under the covers by examining the query log. This used to be called nqquery.log in OBIEE 11g (and before), and in OBIEE 12c has been renamed obis1-query.log.

In this example here I’m using an Excel worksheet with 140,000 rows and 78 columns. Total filesize of the source XLSX on disk is ~55Mb.

First up, I’ll build a query in Answers with a couple of the columns:

The logical query uses the new XSA syntax:

SELECT
   0 s_0,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."first_name" s_1,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."foo" s_2
FROM XSA('prodney'.'MOCK_DATA_bigger_55Mb')
ORDER BY 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

The query log shows

Rows 144000, bytes 13824000 retrieved from database query
Rows returned to Client 200

So of the 55MB of data, we’re pulling all the rows (144,000) back to the BI Server for it to then perform the aggregation on it, resulting in the 200 rows returned to the client (Presentation Services). Note though that the byte count is lower (13Mb) than the total size of the file (55Mb).

As well as aggregation, filtering on XSA data also gets done by the BI Server. Consider this example here, where we add a predicate:

xsa14

In the query log we can see that all the data has to come back from DSS to the BI Server, in order for it to filter it:

Rows 144000, bytes 23040000 retrieved from database
Physical query response time 24.195 (seconds),
Rows returned to Client 0

Note the time taken by DSS — nearly 25 seconds. Compare this later on to when we see the XSA data served from a database, via the XSA Cache.

In terms of BI Server (not XSA) caching, the query log shows that a cache entry was written for the above request:

Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736113_56359_0.TBL'

If I refresh the query in Answers, the data is fetched anew (per this changed behaviour in OBIEE 12c), and the cache repopulated. If I clear the Presentation Services cache and re-open the analysis, I get the results from the BI Server cache, and it doesn’t have to refetch the data from the Data Set Service.

Since the cache has two columns in, an attribute and a measure, I wondered if running a query with just the fact rolled up might hit the cache (since it has all the data there that it needs)

Unfortunately it didn’t, and to return a single row of data required BI Server to fetch all the rows again – although looking at the byte count it appears it does prune the columns required since it’s now just over 2Mb of data returned this time:

Rows 144000, bytes 2304000 retrieved from database
Rows returned to Client 1

Interestingly if I build an analysis with several more of the columns from the file (in this example, ten of a total of 78), the data returned from the DSS to BI Server (167Mb) is greater than that of the original file (55Mb).

Rows 144000, bytes 175104000
Rows returned to Client 1000

And this data coming back from the DSS to the BI Server has to go somewhere – and if it’s big enough it’ll overflow to disk, as we can see when I run the above:

$ ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp
[...]
-rwxrwx--- 1 oracle oinstall 2910404 2016-06-01 14:08 nQS_AG_22345_7503_7c9c000a_50906091.TMP
-rwxrwx--- 1 oracle oinstall   43476 2016-06-01 14:08 nQS_AG_22345_7504_7c9c000a_50906091.TMP
-rw------- 1 oracle oinstall 6912000 2016-06-01 14:08 nQS_AG_22345_7508_7c9c000a_50921949.TMP
-rw------- 1 oracle oinstall  631375 2016-06-01 14:08 nQS_EX_22345_7506_7c9c000a_50921652.TMP
-rw------- 1 oracle oinstall 3670016 2016-06-01 14:08 nQS_EX_22345_7507_7c9c000a_50921673.TMP
[...]

You can read more about BI Server’s use of temporary files and the impact that it can have on system performance and particularly I/O bandwidth in this OTN article here.

So – as the expression goes – “buyer beware”. XSA is an excellent feature, but used in its default configuration with files stored on disk it has the potential to wreak havoc if abused.

XSA Caching

If you’re planning to use XSA seriously, you should set up the database-based XSA Cache. This is described in detail in the PDF document attached to My Oracle Support note OBIEE 12c: How To Configure The External Subject Area (XSA) Cache For Data Blending| Mashup And Performance (Doc ID 2087801.1).

In a proper implementation you would follow in full the document, including provisioning a dedicated schema and tablespace for holding the data (to make it easier to manage and segregate from other data), but here I’m just going to use the existing RCU schema (BIPLATFORM), along with the Physical mapping already in the RPD (10 - System DB (ORCL)):

In NQSConfig.INI, under the XSA_CACHE section, I set:

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "10 - System DB (ORCL)"."Catalog"."dbo";
CONNECTION_POOL = "10 - System DB (ORCL)"."UT Connection Pool";

And restart the BI Server:

/app/oracle/biee/user_projects/domains/bi/bitools/bin/stop.sh -i obis1 && /app/oracle/biee/user_projects/domains/bi/bitools/bin/start.sh -i obis1

Per the document, note that in the BI Server log there’s an entry indicating that the cache has been successfully started:

[101001] External Subject Area cache is started successfully using configuration from the repository with the logical name ssi.
[101017] External Subject Area cache has been initialized. Total number of entries: 0 Used space: 0 bytes Maximum space: 107374182400 bytes Remaining space: 107374182400 bytes. Cache table name prefix is XC2875559987.

Now when I re-run the test XSA analysis from above, returning three columns, the BI Server goes off and populates the XSA cache table:

-- Sending query to database named 10 - System DB (ORCL) (id: <<79879>> XSACache Create table Gateway), connection pool named UT Connection Pool, logical request hash b4de812e, physical request hash 5847f2ef:
CREATE TABLE dbo.XC2875559987_ZPRODNE1926129021 ( id3209243024 DOUBLE PRECISION, first_n[..]

Or rather, it doesn’t, because PHYSICAL_SCHEMA seems to want the literal physical schema, rather than the logical physical one (?!) that the USAGE_TRACKING configuration stanza is happy with in referencing the table.

Properties: description=<<79879>> XSACache Create table Exchange; producerID=0x1561aff8; requestID=0xfffe0034; sessionID=0xfffe0000; userName=prodney;
[nQSError: 17001] Oracle Error code: 1918, message: ORA-01918: user 'DBO' does not exist

I’m trying to piggyback on SA511’s existing configruation, which uses catalog.schema notation:

Instead of the more conventional approach to have the actual physical schema (often used in conjunction with ‘Require fully qualified table names’ in the connection pool):

So now I’ll do it properly, and create a database and schema for the XSA cache – I’m still going to use the BIPLATFORM schema though…

Updated NQSConfig.INI:

[ XSA_CACHE ]

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "XSA Cache"."BIEE_BIPLATFORM";
CONNECTION_POOL = "XSA Cache"."XSA CP";

After refreshing the analysis again, there’s a successful creation of the XSA cache table:

-- Sending query to database named XSA Cache (id: <<65685>> XSACache Create table Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash ccc0a410: [[
CREATE TABLE BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 ( id3209243024 DOUBLE PRECISION, first_name2360035083 VARCHAR2(17 CHAR), [...]

as well as a stats gather:

-- Sending query to database named XSA Cache (id: <<65685>> XSACache Collect statistics Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d73151bb:
BEGIN DBMS_STATS.GATHER_TABLE_STATS(ownname => 'BIEE_BIPLATFORM', tabname => 'XC2875559987_ZPRODNE1645894381' , estimate_percent => 5 , method_opt => 'FOR ALL COLUMNS SIZE AUTO' ); END;

Although I do note that it is used a fixed estimate_percent instead of the recommended AUTO_SAMPLE_SIZE. The table itself is created with a fixed prefix (as specified in the obis1-diagnostic.log at initialisation), and holds a full copy of the XSA (not just the columns in the query that triggered the cache creation):

With the dataset cached, the query is then run and the query log shows a XSA cache hit

External Subject Area cache hit for 'prodney'.'MOCK_DATA_bigger_55Mb'/Columns :
Cache entry shared_cache_key = 'prodney'.'MOCK_DATA_bigger_55Mb',
table name = BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357,
row count = 144000,
entry size = 201326592 bytes,
creation time = 2016-06-01 20:14:26.829,
creation elapsed time = 49779 ms,
descriptor ID = /app/oracle/biee/user_projects/domains/bi/servers/obis1/xsacache/NQSXSA_BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357_2.CACHE

with the resulting physical query fired at the XSA cache table (replacing what would have gone against the DSS web service):

-- Sending query to database named XSA Cache (id: <<65357>>), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d3ed281d: [[
WITH
SAWITH0 AS (select T1000001.first_name2360035083 as c1,
     T1000001.last_name3826278858 as c2,
     sum(T1000001.foo2363149668) as c3
from
     BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 T1000001
group by T1000001.first_name2360035083, T1000001.last_name3826278858)
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3, D1.c4 as c4 from ( select 0 as c1,
     D102.c1 as c2,
     D102.c2 as c3,
     D102.c3 as c4
from
     SAWITH0 D102
order by c2, c3 ) D1 where rownum <= 5000001

It’s important to point out the difference of what’s happening here: the aggregation has been pushed down to the database, meaning that the BI Server doesn’t have to. In performance terms, this is a Very Good Thing usually.

Rows 988, bytes 165984 retrieved from database query
Rows returned to Client 988

Whilst it doesn’t seem to be recorded in the query log from what I can see, the data returned from the XSA Cache also gets inserted into the BI Server cache, and if you open an XSA-based analysis that’s not in the presentation services cache (a third cache to factor in!) you will get a cache hit on the BI Server cache. As discussed earlier in this article though, if an analysis is built against an XSA for which a BI Server cache entry exists that with manipulation could service it (eg pruning columns or rolling up), it doesn’t appear to take advantage of it – but since it’s hitting the XSA cache this time, it’s less of a concern.

If you change the underlying data in the XSA

The BI Server does pick this up and repopulates the XSA Cache.

The XSA cache entry itself is 192Mb in size – generated from a 55Mb upload file. The difference will be down to data types and storage methods etc. However, that it is larger in the XSA Cache (database) than held natively (flat file) doesn’t really matter, particularly if the data is being aggregated and/or filtered, since the performance benefit of pushing this work to the database will outweigh the overhead of storage space. Consider this example here, where I run an analysis pulling back 44 columns (of the 78 in the spreadsheet) and hit the XSA cache, it runs in just over a second, and transfers from the database a total of 5.3Mb (the data is repeated, so rolls up):

Rows 1000, bytes 5576000 retrieved from database
Rows returned to Client 1000

If I disable the XSA cache and run the same query, we see this:

Rows 144000, bytes 801792000 Retrieved from database
Physical query response time 22.086 (seconds)
Rows returned to Client 1000

That’s 764Mb being sent back for the BI Server to process, which it does by dumping a whole load to disk in temporary work files:

$  ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp
[...]]
-rwxrwx--- 1 oracle oinstall 10726190 2016-06-01 21:04 nQS_AG_29733_261_ebd70002_75835908.TMP
-rwxrwx--- 1 oracle oinstall   153388 2016-06-01 21:04 nQS_AG_29733_262_ebd70002_75835908.TMP
-rw------- 1 oracle oinstall 24192000 2016-06-01 21:04 nQS_AG_29733_266_ebd70002_75862509.TMP
-rw------- 1 oracle oinstall  4195609 2016-06-01 21:04 nQS_EX_29733_264_ebd70002_75861716.TMP
-rw------- 1 oracle oinstall 21430272 2016-06-01 21:04 nQS_EX_29733_265_ebd70002_75861739.TMP

As a reminder – this isn’t “Bad”, it’s just not optimal (response time of 50 seconds vs 1 second), and if you scale that kind of behaviour by many users with many datasets, things could definitely get hairy for all users of the system. Hence – use the XSA Cache.

As a final point, with the XSA Cache being in the database the standard range of performance optimisations are open to us – indexing being the obvious one. No indexes are built against the XSA Cache table by default, which is fair enough since OBIEE has no idea what the key columns on the data are, and the point of mashups is less to model and optimise the data but to just get it up there in front of the user. So you could index the table if you knew the key columns that were going to be filtered against, or you could even put it into memory (assuming you’ve licensed the option).

The MoS document referenced above also includes further performance recommendations for XSA, including the use of RAM Disk for XSA cache metadata files, as well as the managed server temp folder

Summary

External Subject Areas are great functionality, but be aware of the performance implications of not being able to push down common operations such as filtering and aggregation. Set up XSA Caching if you are going to be using XSA properly.

If you’re interested in the direction of XSA and the associated Data Set Service, this slide deck from Oracle’s Socs Cappas provides some interesting reading. Uploading Excel files into OBIEE looks like just the beginning of what the Data Set Service is going to enable!

The post OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Rittman Mead Consulting - Wed, 2016-06-01 14:06
OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

One of the big changes in OBIEE 12c for end users is the ability to upload their own data sets and start analysing them directly, without needing to go through the traditional data provisioning and modelling process and associated leadtimes. The implementation of this is one of the big architectural changes of OBIEE 12c, introducing the concept of the Extended Subject Areas (XSA), and the Data Set Service (DSS).

In this article we'll see some of how XSA and DSS work behind the scenes, providing an important insight for troubleshooting and performance analysis of this functionality.

What is an XSA?

An Extended Subject Area (XSA) is made up of a dataset, and associated XML data model. It can be used standalone, or "mashed up" in conjunction with a "traditional" subject area on a common field

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

How is an XSA Created?

At the moment the following methods are available:

  1. "Add XSA" in Visual Analzyer, to upload an Excel (XLSX) document OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

  2. CREATE DATASET logical SQL statement, that can be run through any interface to the BI Server, including 'Issue Raw SQL', nqcmd, JDBC calls, and so on OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

  3. Add Data Source in Answers. Whilst this option shouldn't actually be present according to a this doc, it will be for any users of 12.2.1 who have uploaded the SampleAppLite BAR file so I'm including it here for completeness. OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Under the covers, these all use the same REST API calls directly into datasetsvc. Note that these are entirely undocumented, and only for internal OBIEE component use. They are not intended nor supported for direct use.

How does an XSA work?

External Subject Areas (XSA) are managed by the Data Set Service (DSS). This is a java deployment (datasetsvc) running in the Managed Server (bi_server1), providing a RESTful API for the other OBIEE components that use it.

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

The end-user of the data, whether it's Visual Analyzer or the BI Server, send REST web service calls to DSS, storing and querying datasets within it.

Where is the XSA Stored?

By default, the data for XSA is stored on disk in SINGLETONDATADIRECTORY/components/DSS/storage/ssi, e.g. /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi

[oracle@demo ssi]$ ls -lrt /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi|tail -n5  
-rw-r----- 1 oracle oinstall    8495 2015-12-02 18:01 7e43a80f-dcf6-4b31-b898-68616a68e7c4.dss
-rw-r----- 1 oracle oinstall  593662 2016-05-27 11:00 1beb5e40-a794-4aa9-8c1d-5a1c59888cb4.dss
-rw-r----- 1 oracle oinstall  131262 2016-05-27 11:12 53f59d34-2037-40f0-af21-45ac611f01d3.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:04 a4fc922d-ce0e-479f-97e4-1ddba074f5ac.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:06 c93aa2bd-857c-4651-bba2-a4f239115189.dss

They're stored using the format in which they were created, which is XLSX (via VA) or CSV (via CREATE DATASET)

[oracle@demo ssi]$ head 53f59d34-2037-40f0-af21-45ac611f01d3.dss  
"7 Megapixel Digital Camera","2010 Week 27",44761.88
"MicroPod 60Gb","2010 Week 27",36460.0
"MP3 Speakers System","2010 Week 27",36988.86
"MPEG4 Camcorder","2010 Week 28",32409.78
"CompCell RX3","2010 Week 28",33005.91

There's a set of DSS-related tables installed in the RCU schema BIPLATFORM, which hold information including the XML data model for the XSA, along with metadata such as the user that uploaded the file, when they uploaded, and then name of the file on disk:

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

How Can the Data Set Service be Configured?

The configuration file, with plenty of inline comments, is at ORACLEHOME/bi/endpointmanager/jeemap/dss/DSSREST_SERVICE.properties. From here you con update settings for the data set service including upload limits as detailed here.

XSA Performance

Since XSA are based on flat files stored in disk, we need to be very careful in their use. Whilst a database may hold billions of rows in a table with with appropriate indexing and partitioning be able to provide sub-second responses, a flat file can quickly become a serious performance bottleneck. Bear in mind that a flat file is just a bunch of data plopped on disk - there is no concept of indices, blocks, partitions -- all the good stuff that makes databases able to do responsive ad-hoc querying on selections of data.

If you've got a 100MB Excel file with thousands of cells, and want to report on just a few of them, you might find it laggy - because whether you want to report on them on or not, at some point OBIEE is going to have to read all of them regardless. We can see how OBIEE is handling XSA under the covers by examining the query log. This used to be called nqquery.log in OBIEE 11g (and before), and in OBIEE 12c has been renamed obis1-query.log.

In this example here I'm using an Excel worksheet with 140,000 rows and 78 columns. Total filesize of the source XLSX on disk is ~55Mb.

First up, I'll build a query in Answers with a couple of the columns:

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

The logical query uses the new XSA syntax:

SELECT  
   0 s_0,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."first_name" s_1,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."foo" s_2
FROM XSA('prodney'.'MOCK_DATA_bigger_55Mb')  
ORDER BY 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY
The query log shows
Rows 144000, bytes 13824000 retrieved from database query
Rows returned to Client 200
So of the 55MB of data, we're pulling all the rows (144,000) back to the BI Server for it to then perform the aggregation on it, resulting in the 200 rows returned to the client (Presentation Services). Note though that the byte count is lower (13Mb) than the total size of the file (55Mb). As well as aggregation, filtering on XSA data also gets done by the BI Server. Consider this example here, where we add a predicate: OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service In the query log we can see that all the data has to come back from DSS to the BI Server, in order for it to filter it:
Rows 144000, bytes 23040000 retrieved from database  
Physical query response time 24.195 (seconds),  
Rows returned to Client 0
Note the time taken by DSS -- nearly 25 seconds. Compare this later on to when we see the XSA data served from a database, via the XSA Cache. In terms of BI Server (not XSA) caching, the query log shows that a cache entry was written for the above request:
Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736113_56359_0.TBL'
If I refresh the query in Answers, the data is fetched anew (per this changed behaviour in OBIEE 12c), and the cache repopulated. If I clear the Presentation Services cache and re-open the analysis, I get the results from the BI Server cache, and it doesn't have to refetch the data from the Data Set Service. Since the cache has two columns in, an attribute and a measure, I wondered if running a query with just the fact rolled up might hit the cache (since it has all the data there that it needs) OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service Unfortunately it didn't, and to return a single row of data required BI Server to fetch all the rows again - although looking at the byte count it appears it does prune the columns required since it's now just over 2Mb of data returned this time:
Rows 144000, bytes 2304000 retrieved from database
Rows returned to Client 1
Interestingly if I build an analysis with several more of the columns from the file (in this example, ten of a total of 78), the data returned from the DSS to BI Server (167Mb) is greater than that of the original file (55Mb). OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service
Rows 144000, bytes 175104000
Rows returned to Client 1000
And this data coming back from the DSS to the BI Server has to go somewhere - and if it's big enough it'll overflow to disk, as we can see when I run the above:
  
$ ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp
[...]
-rwxrwx--- 1 oracle oinstall 2910404 2016-06-01 14:08 nQS_AG_22345_7503_7c9c000a_50906091.TMP
-rwxrwx--- 1 oracle oinstall   43476 2016-06-01 14:08 nQS_AG_22345_7504_7c9c000a_50906091.TMP
-rw------- 1 oracle oinstall 6912000 2016-06-01 14:08 nQS_AG_22345_7508_7c9c000a_50921949.TMP
-rw------- 1 oracle oinstall  631375 2016-06-01 14:08 nQS_EX_22345_7506_7c9c000a_50921652.TMP
-rw------- 1 oracle oinstall 3670016 2016-06-01 14:08 nQS_EX_22345_7507_7c9c000a_50921673.TMP
[...]

You can read more about BI Server's use of temporary files and the impact that it can have on system performance and particularly I/O bandwidth in this OTN article here.

So - as the expression goes - "buyer beware". XSA is an excellent feature, but used in its default configuration with files stored on disk it has the potential to wreak havoc if abused.

XSA Caching

If you're planning to use XSA seriously, you should set up the database-based XSA Cache. This is described in detail in the PDF document attached to My Oracle Support note OBIEE 12c: How To Configure The External Subject Area (XSA) Cache For Data Blending| Mashup And Performance (Doc ID 2087801.1).

In a proper implementation you would follow in full the document, including provisioning a dedicated schema and tablespace for holding the data (to make it easier to manage and segregate from other data), but here I'm just going to use the existing RCU schema (BIPLATFORM), along with the Physical mapping already in the RPD (10 - System DB (ORCL)):

In NQSConfig.INI, under the XSA_CACHE section, I set:

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "10 - System DB (ORCL)"."Catalog"."dbo";
CONNECTION_POOL = "10 - System DB (ORCL)"."UT Connection Pool";
And restart the BI Server:
  
/app/oracle/biee/user_projects/domains/bi/bitools/bin/stop.sh -i obis1 && /app/oracle/biee/user_projects/domains/bi/bitools/bin/start.sh -i obis1

Per the document, note that in the BI Server log there's an entry indicating that the cache has been successfully started:

[101001] External Subject Area cache is started successfully using configuration from the repository with the logical name ssi.
[101017] External Subject Area cache has been initialized. Total number of entries: 0 Used space: 0 bytes Maximum space: 107374182400 bytes Remaining space: 107374182400 bytes. Cache table name prefix is XC2875559987.
Now when I re-run the test XSA analysis from above, returning three columns, the BI Server goes off and populates the XSA cache table:
-- Sending query to database named 10 - System DB (ORCL) (id: <<79879>> XSACache Create table Gateway), connection pool named UT Connection Pool, logical request hash b4de812e, physical request hash 5847f2ef:  
CREATE TABLE dbo.XC2875559987_ZPRODNE1926129021 ( id3209243024 DOUBLE PRECISION, first_n[..]  

Or rather, it doesn't, because PHYSICALSCHEMA seems to want the literal physical schema, rather than the logical physical one (?!) that the USAGETRACKING configuration stanza is happy with in referencing the table.

Properties: description=<<79879>> XSACache Create table Exchange; producerID=0x1561aff8; requestID=0xfffe0034; sessionID=0xfffe0000; userName=prodney;  
[nQSError: 17001] Oracle Error code: 1918, message: ORA-01918: user 'DBO' does not exist

I'm trying to piggyback on SA511's existing configruation, which uses catalog.schema notation:

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Instead of the more conventional approach to have the actual physical schema (often used in conjunction with 'Require fully qualified table names' in the connection pool):

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

So now I'll do it properly, and create a database and schema for the XSA cache - I'm still going to use the BIPLATFORM schema though...

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Updated NQSConfig.INI:

[ XSA_CACHE ]

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "XSA Cache"."BIEE_BIPLATFORM";
CONNECTION_POOL = "XSA Cache"."XSA CP";
After refreshing the analysis again, there's a successful creation of the XSA cache table:
-- Sending query to database named XSA Cache (id: <<65685>> XSACache Create table Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash ccc0a410: [[  
CREATE TABLE BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 ( id3209243024 DOUBLE PRECISION, first_name2360035083 VARCHAR2(17 CHAR), [...]  

as well as a stats gather:

-- Sending query to database named XSA Cache (id: <<65685>> XSACache Collect statistics Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d73151bb:  
BEGIN DBMS_STATS.GATHER_TABLE_STATS(ownname => 'BIEE_BIPLATFORM', tabname => 'XC2875559987_ZPRODNE1645894381' , estimate_percent => 5 , method_opt => 'FOR ALL COLUMNS SIZE AUTO' ); END;  

Although I do note that it is used a fixed estimatepercent instead of the recommended AUTOSAMPLE_SIZE. The table itself is created with a fixed prefix (as specified in the obis1-diagnostic.log at initialisation), and holds a full copy of the XSA (not just the columns in the query that triggered the cache creation):

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

With the dataset cached, the query is then run and the query log shows a XSA cache hit

External Subject Area cache hit for 'prodney'.'MOCK_DATA_bigger_55Mb'/Columns :
Cache entry shared_cache_key = 'prodney'.'MOCK_DATA_bigger_55Mb',
table name = BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357,
row count = 144000,
entry size = 201326592 bytes,
creation time = 2016-06-01 20:14:26.829,
creation elapsed time = 49779 ms,
descriptor ID = /app/oracle/biee/user_projects/domains/bi/servers/obis1/xsacache/NQSXSA_BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357_2.CACHE
with the resulting physical query fired at the XSA cache table (replacing what would have gone against the DSS web service):
-- Sending query to database named XSA Cache (id: <<65357>>), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d3ed281d: [[  
WITH  
SAWITH0 AS (select T1000001.first_name2360035083 as c1,  
     T1000001.last_name3826278858 as c2,
     sum(T1000001.foo2363149668) as c3
from  
     BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 T1000001
group by T1000001.first_name2360035083, T1000001.last_name3826278858)  
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3, D1.c4 as c4 from ( select 0 as c1,  
     D102.c1 as c2,
     D102.c2 as c3,
     D102.c3 as c4
from  
     SAWITH0 D102
order by c2, c3 ) D1 where rownum <= 5000001  

It's important to point out the difference of what's happening here: the aggregation has been pushed down to the database, meaning that the BI Server doesn't have to. In performance terms, this is a Very Good Thing usually.

Rows 988, bytes 165984 retrieved from database query
Rows returned to Client 988
Whilst it doesn't seem to be recorded in the query log from what I can see, the data returned from the XSA Cache also gets inserted into the BI Server cache, and if you open an XSA-based analysis that's not in the presentation services cache (a third cache to factor in!) you will get a cache hit on the BI Server cache. As discussed earlier in this article though, if an analysis is built against an XSA for which a BI Server cache entry exists that with manipulation could service it (eg pruning columns or rolling up), it doesn't appear to take advantage of it - but since it's hitting the XSA cache this time, it's less of a concern. If you change the underlying data in the XSA OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service The BI Server does pick this up and repopulates the XSA Cache. The XSA cache entry itself is 192Mb in size - generated from a 55Mb upload file. The difference will be down to data types and storage methods etc. However, that it is larger in the XSA Cache (database) than held natively (flat file) doesn't really matter, particularly if the data is being aggregated and/or filtered, since the performance benefit of pushing this work to the database will outweigh the overhead of storage space. Consider this example here, where I run an analysis pulling back 44 columns (of the 78 in the spreadsheet) and hit the XSA cache, it runs in just over a second, and transfers from the database a total of 5.3Mb (the data is repeated, so rolls up):
Rows 1000, bytes 5576000 retrieved from database
Rows returned to Client 1000
If I disable the XSA cache and run the same query, we see this:
Rows 144000, bytes 801792000 Retrieved from database
Physical query response time 22.086 (seconds)
Rows returned to Client 1000
That's 764Mb being sent back for the BI Server to process, which it does by dumping a whole load to disk in temporary work files:
$  ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp  
[...]]
-rwxrwx--- 1 oracle oinstall 10726190 2016-06-01 21:04 nQS_AG_29733_261_ebd70002_75835908.TMP
-rwxrwx--- 1 oracle oinstall   153388 2016-06-01 21:04 nQS_AG_29733_262_ebd70002_75835908.TMP
-rw------- 1 oracle oinstall 24192000 2016-06-01 21:04 nQS_AG_29733_266_ebd70002_75862509.TMP
-rw------- 1 oracle oinstall  4195609 2016-06-01 21:04 nQS_EX_29733_264_ebd70002_75861716.TMP
-rw------- 1 oracle oinstall 21430272 2016-06-01 21:04 nQS_EX_29733_265_ebd70002_75861739.TMP

As a reminder - this isn't "Bad", it's just not optimal (response time of 50 seconds vs 1 second), and if you scale that kind of behaviour by many users with many datasets, things could definitely get hairy for all users of the system. Hence - use the XSA Cache.

As a final point, with the XSA Cache being in the database the standard range of performance optimisations are open to us - indexing being the obvious one. No indexes are built against the XSA Cache table by default, which is fair enough since OBIEE has no idea what the key columns on the data are, and the point of mashups is less to model and optimise the data but to just get it up there in front of the user. So you could index the table if you knew the key columns that were going to be filtered against, or you could even put it into memory (assuming you've licensed the option).

The MoS document referenced above also includes further performance recommendations for XSA, including the use of RAM Disk for XSA cache metadata files, as well as the managed server temp folder

Summary

External Subject Areas are great functionality, but be aware of the performance implications of not being able to push down common operations such as filtering and aggregation. Set up XSA Caching if you are going to be using XSA properly.

If you're interested in the direction of XSA and the associated Data Set Service, this slide deck from Oracle's Socs Cappas provides some interesting reading. Uploading Excel files into OBIEE looks like just the beginning of what the Data Set Service is going to enable!

Categories: BI & Warehousing

Announcing the Dodeca Spreadsheet Management System, Version 7 and the Dodeca Excel Add-In for Essbase

Tim Tow - Tue, 2016-05-31 14:54
After 18 months of hard work, Applied OLAP is proud to announce the general availability of the Dodeca Spreadsheet Management System, version 7, and the all-new Dodeca Excel Add-In for Essbase.

The Dodeca Spreadsheet Management System provides customers the ability to automate spreadsheet functionality, reducing the risk of spreadsheet errors while increasing productivity.  It combines unprecedented ease-of-use for business users using spreadsheets for planning, budgeting, forecasting, reporting and analysis tasks.  It also provides a robust, programmable development environment enabling companies to create spreadsheet applications tailored to their specific needs.

The new Dodeca Excel Add-In for Essbase was created as a drop-in replacement for the classic Essbase add-in and is the world’s only Excel add-in focused exclusively on Essbase.  The new add-in supports the most common actions used by Essbase traditionalists, supports the corresponding VBA functions, and includes a built-in Excel ribbon.  Early adopters have also been impressed by the speed of retrieving data, commenting they found the Dodeca Excel Add-In as fast as, or even faster, than the classic Excel Add-In for Essbase.  It is supported for Excel 2010, 2013, and 2016 and for Essbase 9.3.1 and higher.




Dodeca 7 includes new features and functionality for security, selectors, logging, and enhanced workbook scripting.

The enhanced security features add the ability to more easily manage users, roles, and permissions to access a Dodeca application.  The new security features include:
  • User Management – You can now track, monitor, and control user access to a Dodeca application and more easily monitor your Dodeca user base.  This feature enables customers to control individual user access to a specific application, enable users for admin access, and manage user mapping to roles while also tracking metrics such as the first, last, and count of user logins.  This feature also logs metrics on the users system to enable Dodeca administrators to more easily support their users.



    Here is an example of the metrics stored for each user record.

  • User Roles – In addition to provisioning roles via Essbase, Microsoft Active Directory, or LDAP groups using Dodeca authentication services, you can now create your own groups directly in Dodeca and map them to users.  In addition, these new roles can be configured to be additive to the roles provided by other authentication services.  
  • Application Session Timeout – You can now prevent users from keeping a Dodeca session open indefinitely by specifying an inactivity timeout or by specifying a designated shutdown time.

The new logging features provide the ability for customers to both easily track what their users are running in the system and assist our support team if and when support is needed.  The new logging features include:
  • View Usage Logging – View usage is now automatically logged in the server.  The logs include not only identifying information for the view and the user, but also include performance information, error information, and tokens used in the view generation.

  • Client-side Request and Response XML Logging – The XML traffic traveling between the client and the server may now be logged to a directory on the client machine.  This logging expands on the Server-side Request and Response XML Logging to make it easier to gather the XML traffic for a single user.  In turn, the XML captured can be used by our support team to easily replicate transactions for our developers if and when necessary.

The selector enhancements include improvements to both view selectors and member selectors.  The improvements include:
  • View Selector Relational Hierarchy Generation – You may now populate the entire View Hierarchy, or alternatively, populate one or more branches, based on the results of a relational query.

  • View Selector Hierarchy Item Access Filters – Previously, you could control the View Hierarchies available to a given user.  This new filter provides the ability to control which view hierarchy items are presented in the view selector based on the current user’s roles.

  • Relational Treeview Point-of-View Selector List– You may now configure a hierarchy of point-of-view items based on the results of a relational query.

  • Enhanced Essbase Treeview Point-of-View Selector List Expansion and Selection Filtering – You can now filter Essbase member content and selectable status based on one or more specified filters including generation number, level number, member name, alias, or shared status.
View editing enhancements make it even easier to create and modify Views in Dodeca:
  • Enhanced SQLPassthroughDataSet Query Editing – You can now define token values to use for testing tokenized SQL queries in the Test Data Set utility.
  • Improved Token Editor – You can now view and edit tokens and their values in an improved grid layout.

Workbook scripting functionality adds more flexibility to Dodeca via 108 events that provide the opportunity for customers to extend Dodeca, 116 configurable methods (actions that can be taken in response to the events), and 138 functions that provide extended information to the workbook script.

The new workbook script functionality includes:
  • New Events
    • BeforeBuildExecute - Allows a workbook script to cancel the build before the view is covered.
    • BeforeRefreshExecute - Allows a workbook script to cancel the refresh before the view is covered.
    • Shown - Raised when the view is initially shown.  The event is raised before the framework applies the view’s AutoBuildOnOpen property, which allows a workbook script to set the property dynamically.
  • New Methods
    • ExportToExcel – Provides the ability to export view and point-of-view information in an exported Excel file.  This information is used to regenerate the view upon Excel file import which enables off-line data entry in Excel with subsequent database updates in Dodeca.
    • SetSelectorConfiguration - Provides the ability to add or remove a point-of-view selector to/from a view dynamically.
  • Enhanced Methods
    • CallWebService – Added a new RESTRequest overload, which allows web service calls to made to RESTful web services.
    • SendEmail - Added a new ServletSMTP overload, which sends email from the Dodeca server rather than from the client.  This is useful in circumstances in which SMTP mail must come from an approved IP address.
    • SetEntry – Added a new Added FormulaArray overload which allows Excel array formulas to be entered programmatically.
    • SetFill - Added a new Clear overload, which clears the fill color and pattern from the specified range.
  • New Functions
    • ColumnWidth - Returns the column width based on the active cell or a given cell.
    • RowHeight - Returns the row height based on the active cell or a given cell.
    • DataPointHasCellNote - Returns a boolean indicating if the active or specified data cell has Essbase LRO cell notes associated with it.
    • IsInCharacterRange - Returns a boolean indicating whether the specified string is limited to the specified character range.  This function can be used to detect multi-byte characters in a string.
  • Enhanced Functions
    • SheetCount - Added an optional IncludeHiddenSheets argument, which controls whether the returned count includes both visible and hidden sheets or only visible sheets.
For more information, we recommend you download and read the latest Dodeca release notes from the registered section of our website.  As always, you may also reach out to us via email or call us at
256.885.4371.

Categories: BI & Warehousing

Changes in BI Server Cache Behaviour in OBIEE 12c : OBIS_REFRESH_CACHE

Rittman Mead Consulting - Tue, 2016-05-31 13:59
 OBIS_REFRESH_CACHE

The OBIEE BI Server cache can be a great way of providing a performance boost to response times for end users - so long as it's implemented carefully. Done wrong, and you're papering over the cracks and heading for doom; done right, and it's the 'icing on the cake'. You can read more about how to use it properly here, and watch a video I did about it here. In this article we'll see how the BI Server cache has changed in OBIEE 12c in a way that could prove somewhat perplexing to developers used to OBIEE 11g.

The BI Server cache works by inspecting queries as they are sent to the BI Server, and deciding if an existing cache entry can be used to provide the data. This can include direct hits (i.e. the same query being run again), or more advanced cases, where a subset or aggregation of an existing cache entry could be used. If a cache entry is used then a trip to the database is avoided and response times will typically be better - particularly if more than one database query would have been involved, or lots of additional post-processing on the BI Server.

When an analysis or dashboard is run, Presentation Services generates the necessary Logical SQL to return the data needed, and sends this to the BI Server. It's at this point that the cache will, or won't, kick in. The BI Server will accept Logical SQL from other sources than Presentation Services - in fact, any JDBC or ODBC client. This is useful as it enables us to validate behaviour that we're observing and see how it can apply elsewhere.

When you build an Analysis in OBIEE 11g (and before), the cache will be used if applicable. Each time you add a column, or hit refresh, you'll get an entry back from the cache if one exists. This has benefits - speed - but disadvantages too. When the data in the database changes, you will still get a cache hit, regardless. The only way to force OBIEE to show you the latest version of the data is to purge the cache first. You can target cache purges based on databases, tables, or even specific queries - but you do need to purge it.

What's changed in OBIEE 12c is that when you click "Refresh" on an Analysis or Dashboard, the query is re-run against the source and the cache re-populated. Even if you have an existing cache entry, and even if the underlying data has not changed, if you hit Refresh, the cache will not be used. Which kind of makes sense, since "refresh" probably should indeed mean that.

Digging into OBIEE Cache Behaviour

Let's prove this out. I've got SampleApp v506 (OBIEE 11.1.1.9), and SampleApp v511 (OBIEE 12.2.1). First off, I'll clear the cache on each, using call saPurgeAllCache();, run via Issue SQL:

 OBIS_REFRESH_CACHE

 OBIS_REFRESH_CACHE

Then I can use another BI Server procedure call to view the current cache contents (new in 11.1.1.9), call NQS_GetAllCacheEntries(). For this one particularly make sure you've un-ticked "Use Oracle BI Presentation Services Cache". This is different from the BI Server cache which is the subject of this article, and as the name implies is a cache that Presentation Services keeps.

 OBIS_REFRESH_CACHE

I've confirmed that the BI Server cache is enabled on both servers, in NQSConfig.INI

###############################################################################
#
#  Query Result Cache Section
#
###############################################################################


[CACHE]

ENABLE = YES;  # This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control
Now I create a very simple analysis in both 11g and 12c, showing a list of Airline Carriers and their Codes:  OBIS_REFRESH_CACHE After clicking Results, a cache entry is inserted on each respective system:  OBIS_REFRESH_CACHE  OBIS_REFRESH_CACHE Of particular interest is the create time, last used time, and number of times used:  OBIS_REFRESH_CACHE If I now click Refresh in the Analysis window:  OBIS_REFRESH_CACHE We see this happen to the caches:  OBIS_REFRESH_CACHE In OBIEE 11g the cache entry is used - but in OBIEE 12c it's not. The CreatedTime is evidently not populated correctly, so instead let's dive over to the query log (nqquery/obis1-query in 11g/12c respectively). In OBIEE 11g we've got:
-- SQL Request, logical request hash:  
7c365697  
SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"  
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

-- Cache Hit on query: [[
Matching Query: SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"  
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY
Whereas 12c is:
-- SQL Request, logical request hash:  
d53f813c  
SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT  
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"  
ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST  
FETCH FIRST 5000001 ROWS ONLY

-- Sending query to database named X0 - Airlines Demo Dbs (ORCL) (id: <<320369>>), connection pool named Aggr Connection, logical request hash d53f813c, physical request hash a46c069c: [[
WITH  
SAWITH0 AS (select T243.CODE as c1,  
     T243.DESCRIPTION as c2
from  
     BI_AIRLINES.UNIQUE_CARRIERS T243 /* 30 UNIQUE_CARRIERS */ )
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select 0 as c1,  
     D1.c1 as c2,
     D1.c2 as c3
from  
     SAWITH0 D1
order by c3, c2 ) D1 where rownum <= 5000001

-- Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736117_52416_27.TBL'.
Looking closely at the 12c output shows three things:
  1. OBIEE has run a database query for this request, and not hit the cache
  2. A cache entry has clearly been created again as a result of this query
  3. The Logical SQL has a request variable set: OBIS_REFRESH_CACHE=1 This is evidently added it by Presentation Services at runtime, since the Advanced tab of the analysis shows no such variable being set:  OBIS_REFRESH_CACHE
Let's save the analysis, and experiment further. Evidently, the cache is being deliberately bypassed when the Refresh button is clicked when building an analysis - but what about when it is opened from the Catalog? We should see a cache hit here too:  OBIS_REFRESH_CACHE Nope, no hit.  OBIS_REFRESH_CACHE But, in the BI Server query log, no entry either - and the same on 11g. The reason being .... Presentation Service's cache. D'oh! From Administration > Manage Sessions I select Close All Cursors which forces a purge of the Presentation Services cache. When I reopen the analysis from the Catalog view, now I get a cache hit, in both 11g and 12c:  OBIS_REFRESH_CACHE The same happens (successful cache hit) for the analysis used in a Dashboard being opened, having purged the Presentation Services cache first.  OBIS_REFRESH_CACHE So at this point, we can say that OBIEE 11g and 12c both behave the same with the cache when opening analyses/dashboards, but differ when refreshing the analysis. In OBIEE 12c when an analysis is refreshed the cache is deliberately bypassed. Let's check on refreshing a dashboard:  OBIS_REFRESH_CACHE Same behaviour as with analyses - in 11g the cache is hit, in 12c the cache is bypassed and repopulated  OBIS_REFRESH_CACHE To round this off, let's doublecheck the behaviour of the new request variable that we've found, OBIS_REFRESH_CACHE. Since it appears that Presentation Services is adding it in at runtime, let's step over to a more basic way of interfacing with the BI Server - nqcmd. Whilst we could probably use Issue SQL (as we did above for querying the cache) I want to avoid any more behind-the-scenes funny business from Presentation Services. In OBIEE 12c, I run nqcmd:
/app/oracle/biee/user_projects/domains/bi/bitools/bin/nqcmd.sh -d AnalyticsWeb -u prodney -p Admin123
Enter Q to enter a query, as follows:
SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY
In `obis1-query.log' there's the cache bypass and populate:
Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736117_53779_29.TBL'.
If I run it again without the OBIS_REFRESH_CACHE variable:
SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY
We get the cache hit as expected:
-------------------- Cache Hit on query: [[
Matching Query: SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY
Created by:     prodney
Out of interest I ran the same two tests on 11g -- both resulted in a cache hit, since it presumably ignores the unrecognised variable. Summary In OBIEE 12c, if you click "Refresh" on an analysis or dashboard, OBIEE Presentation Services forces a cache-bypass and cache-reseed, ensuring that you really do see the latest version of the data from source. It does this using the request variable, new in OBIEE 12c, OBIS_REFRESH_CACHE. Footnote Courtesy of Steve Fitzgerald: As per the presentation server xsd, you can also revert the behavior (not sure why one would, but you can) in the instanceconfig.xml
<Cache>  
<Query>
<RefreshIncludeBIServerCache>false</RefreshIncludeBIServerCache>
</Query>
</Cache>

Categories: BI & Warehousing

Changes in BI Server Cache Behaviour in OBIEE 12c : OBIS_REFRESH_CACHE

Rittman Mead Consulting - Tue, 2016-05-31 10:36

The OBIEE BI Server cache can be a great way of providing a performance boost to response times for end users – so long as it’s implemented carefully. Done wrong, and you’re papering over the cracks and heading for doom; done right, and it’s the ‘icing on the cake’. You can read more about how to use it properly here, and watch a video I did about it here. In this article we’ll see how the BI Server cache has changed in OBIEE 12c in a way that could prove somewhat perplexing to developers used to OBIEE 11g.

The BI Server cache works by inspecting queries as they are sent to the BI Server, and deciding if an existing cache entry can be used to provide the data. This can include direct hits (i.e. the same query being run again), or more advanced cases, where a subset or aggregation of an existing cache entry could be used. If a cache entry is used then a trip to the database is avoided and response times will typically be better – particularly if more than one database query would have been involved, or lots of additional post-processing on the BI Server.

When an analysis or dashboard is run, Presentation Services generates the necessary Logical SQL to return the data needed, and sends this to the BI Server. It’s at this point that the cache will, or won’t, kick in. The BI Server will accept Logical SQL from other sources than Presentation Services – in fact, any JDBC or ODBC client. This is useful as it enables us to validate behaviour that we’re observing and see how it can apply elsewhere.

When you build an Analysis in OBIEE 11g (and before), the cache will be used if applicable. Each time you add a column, or hit refresh, you’ll get an entry back from the cache if one exists. This has benefits – speed – but disadvantages too. When the data in the database changes, you will still get a cache hit, regardless. The only way to force OBIEE to show you the latest version of the data is to purge the cache first. You can target cache purges based on databases, tables, or even specific queries – but you do need to purge it.

What’s changed in OBIEE 12c is that when you click “Refresh” on an Analysis or Dashboard, the query is re-run against the source and the cache re-populated. Even if you have an existing cache entry, and even if the underlying data has not changed, if you hit Refresh, the cache will not be used. Which kind of makes sense, since “refresh” probably should indeed mean that.

Digging into OBIEE Cache Behaviour

Let’s prove this out. I’ve got SampleApp v506 (OBIEE 11.1.1.9), and SampleApp v511 (OBIEE 12.2.1). First off, I’ll clear the cache on each, using call saPurgeAllCache();, run via Issue SQL:

Then I can use another BI Server procedure call to view the current cache contents (new in 11.1.1.9), call NQS_GetAllCacheEntries(). For this one particularly make sure you’ve un-ticked “Use Oracle BI Presentation Services Cache”. This is different from the BI Server cache which is the subject of this article, and as the name implies is a cache that Presentation Services keeps.

I’ve confirmed that the BI Server cache is enabled on both servers, in NQSConfig.INI

###############################################################################
#
#  Query Result Cache Section
#
###############################################################################


[CACHE]

ENABLE = YES;  # This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control

Now I create a very simple analysis in both 11g and 12c, showing a list of Airline Carriers and their Codes:

After clicking Results, a cache entry is inserted on each respective system:

Of particular interest is the create time, last used time, and number of times used:

If I now click Refresh in the Analysis window:

We see this happen to the caches:

In OBIEE 11g the cache entry is used – but in OBIEE 12c it’s not. The CreatedTime is evidently not populated correctly, so instead let’s dive over to the query log (nqquery/obis1-query in 11g/12c respectively). In OBIEE 11g we’ve got:

-- SQL Request, logical request hash:
7c365697
SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

-- Cache Hit on query: [[
Matching Query: SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

Whereas 12c is:

-- SQL Request, logical request hash:
d53f813c
SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT
   0 s_0,
   "X - Airlines Delay"."Carrier"."Carrier Code" s_1,
   "X - Airlines Delay"."Carrier"."Carrier" s_2
FROM "X - Airlines Delay"
ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

-- Sending query to database named X0 - Airlines Demo Dbs (ORCL) (id: <<320369>>), connection pool named Aggr Connection, logical request hash d53f813c, physical request hash a46c069c: [[
WITH
SAWITH0 AS (select T243.CODE as c1,
     T243.DESCRIPTION as c2
from
     BI_AIRLINES.UNIQUE_CARRIERS T243 /* 30 UNIQUE_CARRIERS */ )
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3 from ( select 0 as c1,
     D1.c1 as c2,
     D1.c2 as c3
from
     SAWITH0 D1
order by c3, c2 ) D1 where rownum <= 5000001

-- Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736117_52416_27.TBL'.

Looking closely at the 12c output shows three things:

  1. OBIEE has run a database query for this request, and not hit the cache
  2. A cache entry has clearly been created again as a result of this query
  3. The Logical SQL has a request variable set: OBIS_REFRESH_CACHE=1

    This is evidently added it by Presentation Services at runtime, since the Advanced tab of the analysis shows no such variable being set:

Let’s save the analysis, and experiment further. Evidently, the cache is being deliberately bypassed when the Refresh button is clicked when building an analysis – but what about when it is opened from the Catalog? We should see a cache hit here too:

Nope, no hit.

15239r

But, in the BI Server query log, no entry either – and the same on 11g. The reason being …. Presentation Service’s cache. D’oh!

From Administration > Manage Sessions I select Close All Cursors which forces a purge of the Presentation Services cache. When I reopen the analysis from the Catalog view, now I get a cache hit, in both 11g and 12c:

The same happens (successful cache hit) for the analysis used in a Dashboard being opened, having purged the Presentation Services cache first.

So at this point, we can say that OBIEE 11g and 12c both behave the same with the cache when opening analyses/dashboards, but differ when refreshing the analysis. In OBIEE 12c when an analysis is refreshed the cache is deliberately bypassed. Let’s check on refreshing a dashboard:

Same behaviour as with analyses – in 11g the cache is hit, in 12c the cache is bypassed and repopulated

To round this off, let’s doublecheck the behaviour of the new request variable that we’ve found, OBIS_REFRESH_CACHE. Since it appears that Presentation Services is adding it in at runtime, let’s step over to a more basic way of interfacing with the BI Server – nqcmd. Whilst we could probably use Issue SQL (as we did above for querying the cache) I want to avoid any more behind-the-scenes funny business from Presentation Services.

In OBIEE 12c, I run nqcmd:

/app/oracle/biee/user_projects/domains/bi/bitools/bin/nqcmd.sh -d AnalyticsWeb -u prodney -p Admin123

Enter Q to enter a query, as follows:

SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY

In `obis1-query.log’ there’s the cache bypass and populate:

Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736117_53779_29.TBL'.

If I run it again without the OBIS_REFRESH_CACHE variable:

SET VARIABLE QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY

We get the cache hit as expected:

-------------------- Cache Hit on query: [[
Matching Query: SET VARIABLE OBIS_REFRESH_CACHE=1,QUERY_SRC_CD='Report',PREFERRED_CURRENCY='USD';SELECT    0 s_0,    "X - Airlines Delay"."Carrier"."Carrier Code" s_1,    "X - Airlines Delay"."Carrier"."Carrier" s_2 FROM "X - Airlines Delay" ORDER BY 3 ASC NULLS LAST, 2 ASC NULLS LAST FETCH FIRST 5000001 ROWS ONLY
Created by:     prodney

Out of interest I ran the same two tests on 11g — both resulted in a cache hit, since it presumably ignores the unrecognised variable.

Summary

In OBIEE 12c, if you click “Refresh” on an analysis or dashboard, OBIEE Presentation Services forces a cache-bypass and cache-reseed, ensuring that you really do see the latest version of the data from source. It does this using the request variable, new in OBIEE 12c, OBIS_REFRESH_CACHE.

Footnote

Courtesy of Steve Fitzgerald:

As per the presentation server xsd, you can also revert the behavior (not sure why one would, but you can) in the instanceconfig.xml

<Cache>
<Query>
<RefreshIncludeBIServerCache>false</RefreshIncludeBIServerCache>
</Query>
</Cache>

The post Changes in BI Server Cache Behaviour in OBIEE 12c : OBIS_REFRESH_CACHE appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Best Practices for the cloud !!

Tim Dexter - Mon, 2016-05-30 05:19

Greetings!!

Wish everyone a very happy Memorial Day !! 

Last week we published in OTN a white paper with title "Oracle BI Publisher Best Practices for SaaS Environments". As the title suggests, this white paper is a compilation of all the best practices for BI Publisher that are relevant to a Cloud Platform, especially for Fusion Applications. You can find the link to this white paper here. We will soon have an updated best practices guide for on-premise installations, so stay tuned.

Have a nice day !! 

Categories: BI & Warehousing

#IHateYourFace, a BizIntel Love Story

Rittman Mead Consulting - Wed, 2016-05-25 13:53
#IHateYourFace, a BizIntel Love Story

Sometimes love isn't fair

A teenager, just emerging from an awkward growing phase drives a tendency to spew non sensical permutations of varying cringe worthy affections toward their love target.  The data points behind the intent are solid.  Let’s go Boolean now:

Intent of good = 1 ;

attracted to target = 1;

ability to provide for target = 1;

time available  = 1;

prior or current committed relationship = 0.

So here we are with all the solid intentions in the world and the variables all align to an optimal outcome.  Except one thing:

love target hates your face = 1;

Exit(0)

Assuming that this is a hierarchical value chain with face hating as the parent, we are at an impasse.

The above is not intended as an affront to teenagers spitting game at their prospective boo.  It is though, an analogy to the fantastic time and effort spent in integrating, transforming and presenting Business Intelligence to the enterprise with all of the proper best practices and technology.  Only to have the user subtly tell us that the information is “great but… I don’t know what I am looking at and I was looking for it to be in 3D” (AKA, I hate your face).  There have been some great books out there about visualization and we have read a LOT of them.  Best practices aside, visualization is about taste.  The love target above may well have a completely different opinion of facial beauty than Mr. or Ms. hottie-mc-toddy that sits next to you in your advanced Kafka course (seriously, it might happen).  Love at Kafka training happens, I swear it does, really and when it does, it involves unicorns and ends of rainbows.

So now I have to figure out my end user’s tastes in visualization?

Yep, accept and move on.  

You wouldn’t saddle up with a love interest if you hated their face.  Why would an end user sign up for something they don’t want to use.  Of course, you being the BI practitioner that you are know your BI stack up and down and are well aware of all of the available visualization options.  These end users don’t know what they are talking about…right?

—they don’t have to.  

Lots of the big BI players have been losing ground in the industry of viz because new competitors are building attractive viz capabilities.  The new guys in BI are going to the end users with something the end users want…pretty and flexible visualizations.  Both the viz provider and end user need not worry themselves with data integrity, governance, or how they got the data to start.

Welcome to Enterprise BI 2.0

What if I could use my existing enterprise platform and allow users to integrate more data sources, mash it up and then have an almost story telling approach to BI?  You can and the next releases from the big BI vendors are all focused on that.  Clearly at Rittman Mead we have an affinity for OBIEE and with 12c, the dream of viz has become a an out of the box reality.  Data visualization is core to new functionality with more on the way.  So my advice is upgrade to 12c ASAP to get your new face on.

Rittman Mead has less invasive facelift versus replace options with viz for 12c and 11g.  Visual Plug in Pack (VPP) extends the native charting in OBI into dynamic attractive hollywood level viz.  We also have an User Engagement offering  that comes with that visual facelift all the girls at the salon are talking about.  Shoot me an email if you are interested in our streamlined OBI 12c upgrade service or VPP. Jason.davis@rittmanmead.com

Categories: BI & Warehousing

A Business Intelligence Love Story: #IHateYourFace

Rittman Mead Consulting - Wed, 2016-05-25 05:00

Sometimes love isn’t fair

A teenager, just emerging from an awkward growing phase drives a tendency to spew non sensical permutations of varying cringe worthy affections toward their love target. The data points behind the intent are solid. Let’s go Boolean now:

Intent of good = 1 ;

attracted to target = 1;

ability to provide for target = 1;

time available = 1;

prior or current committed relationship = 0.

So here we are with all the solid intentions in the world and the variables all align to an optimal outcome. Except one thing:

love target hates your face = 1;

Exit(0)

Assuming that this is a hierarchical value chain with face hating as the parent, we are at an impasse.

The above is not intended as an affront to teenagers spitting game at their prospective boo. It is though, an analogy to the fantastic time and effort spent in integrating, transforming and presenting Business Intelligence to the enterprise with all of the proper best practices and technology. Only to have the user subtly tell us that the information is “great but… I don’t know what I am looking at and I was looking for it to be in 3D” (AKA, I hate your face). There have been some great books out there about visualization and we have read a LOT of them. Best practices aside, visualization is about taste. The love target above may well have a completely different opinion of facial beauty than Mr. or Ms. hottie-mc-toddy that sits next to you in your advanced Kafka course (seriously, it might happen). Love at Kafka training happens, I swear it does, really and when it does, it involves unicorns and ends of rainbows.

So now I have to figure out my end user’s tastes in visualization?

Yep, accept and move on.

You wouldn’t saddle up with a love interest if you hated their face. Why would an end user sign up for something they don’t want to use. Of course, you being the BI practitioner that you are know your BI stack up and down and are well aware of all of the available visualization options. These end users don’t know what they are talking about…right?

—they don’t have to.

Lots of the big BI players have been losing ground in the industry of viz because new competitors are building attractive viz capabilities. The new guys in BI are going to the end users with something the end users want…pretty and flexible visualizations. Both the viz provider and end user need not worry themselves with data integrity, governance, or how they got the data to start.

Welcome to Enterprise BI 2.0

What if I could use my existing enterprise platform and allow users to integrate more data sources, mash it up and then have an almost story telling approach to BI? You can and the next releases from the big BI vendors are all focused on that. Clearly at Rittman Mead we have an affinity for OBIEE and with 12c, the dream of viz has become a an out of the box reality. Data visualization is core to new functionality with more on the way. So my advice is upgrade to 12c ASAP to get your new face on.

Rittman Mead has less invasive facelift versus replace options with viz for 12c and 11g. Visual Plug in Pack (VPP) extends the native charting in OBI into dynamic attractive hollywood level viz. We also have an User Engagement offering that comes with that visual facelift all the girls at the salon are talking about. Shoot me an email if you are interested in our streamlined OBI 12c upgrade service or VPP. Jason.davis@rittmanmead.com

The post A Business Intelligence Love Story: #IHateYourFace appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

7 Signs Your EPM is Lagging Behind Your Competition

Look Smarter Than You Are - Mon, 2016-05-23 13:06
Regardless of industry, regardless of size, regardless of duration, all companies have similar issues in their financial analysis, planning, and consolidation areas. From building budgets to financial reporting, how can CFOs, VPs of Finance, Directors of FP&A and Controllers tell if their FP&A teams are falling behind their competitors? Here are seven signs that your Enterprise Performance Management (EPM) environments are stuck in the last decade:
  1. Strategy is planned verbally or in spreadsheets. While the majority of strategic CFO’s agree that Finance should be looking forward and not backward, most strat planning is done in Excel or worse, out loud in various meetings. There is no modeling unless someone comes up with a bunch of linked spreadsheet formulas. Strategies are agreed to in conference rooms and conveyed at a high-level via email (or they aren’t communicated at all). Strategies are evaluated by whomever has the best anecdote: “well, the last time that happened, we did this…” The only thing worse than not having a solution for strategic planning is not doing strategic planning at all. Speaking of spreadsheets…
  2. Excel is the key enabling technology in your FP&A department. One sure way to tell if your EPM function is falling behind is to ask “what is the single most important tool your department uses when running reports? Performing analysis? Coming up with a strategic plan? Preparing the budget? Modeling business changes?” If the answer to four-out-of-five of those is “Microsoft Excel”, ask yourself if that was by design or if people just used Excel because they didn’t have a better system. Excel is a wonderful tool (I open it every morning and don’t close it until I leave), but it was meant to be a way to look at grids of data. It was not meant to store business logic and it was never meant to be a database. Force your FP&A group to do everything with Excel and expect to be waiting for every answer… and then praying everyone got their formulas right when you make business decisions based on those answers.
  3. There is only one version of the budget. No one really thinks that there’s only one way that the year will end up, but most companies insist on a single version of a budget (and not even a range, but a specific number). Not only are EPM Laggards (companies with EPM trailing behind their peer groups) not planning multiple scenarios, they’re insisting that the whole company come up with a single number and then stick to it no matter what external factors are at play. Ron Dimon refers to scenario plans as “ready at hand plans” waiting to be used once we see how our strategic initiatives are enacted. EPM Laggards not only don’t have additional plans ready, they insist on holding everyone in the organization accountable to one single number, outside world be damned.
  4. Budgets favor precision over timeliness. Your competition realizes that a forecast that’s 95% accurate delivered today is more helpful than a budget that was 98% accurate 6 months ago. Yet EPM Laggards spend months coming up with a budget that’s precise to the dollar and then updating it periodically at a high level. It’s amazing how often FP&A groups end up explaining away budget vs. actual discrepancies by saying “the budget was accurate at the start of the year, but then things happened.” Budgets should be reforecasted continuously whenever anything material changes. Think about it: if you had one mapping app that gave you an estimate of your arrival time to the 1/100th of a second at the time you departed and another mapping app that constantly refined your arrival time as you drove, which one would you choose?
  5. No one takes actions on the reports. Edward’s Rule of Reporting: every report should either lead to a better question or a physical action. If your department is producing a report that doesn’t lead someone to ask a bigger, better, bolder question and doesn’t lead someone to take a physical action, change the report. Or stop producing the report entirely. EPM Laggards spend an inordinate amount of time collecting data and generating reports that don’t lead to any change in behavior. EPM Leaders periodically stop and ask themselves “if I arrived today, is this what I would build?” Half the time, the answer is “no,” and the other half the time, the answer is “if I arrived today, I actually wouldn’t build this report at all.”
  6. Most time is spent looking backwards. Imagine you’re driving a car. Put your hands on the wheel and look around. Notice that most of your visual space is the front windshield which shows you what’s coming up ahead of you. Some of what you see is taken up by the dashboard so you can get a real-time idea of where you are right now. And if you glance up, there’s a small rear-view mirror that tells you what’s behind you. A combination of all three of these (windshield, dashboard, and rearview mirror) gives you some idea of when you should steer right or left, brake, or accelerate. In a perfect EPM world, your time would be divided the same way: most would be spent looking ahead (budgeting and forecasting), some time would be spent glancing down to determine where you are at the moment, and very little would be spent looking backwards since, let’s face it, the past is really difficult to change. In your car, you’d only look at the mirror if you were changing lanes or you were worried about being hit from behind, and business is similar yet most EPM Laggards drive their cars by looking backwards.
  7. Labor is devoted to collecting & reporting and not planning & analyzing. If you spend all of your time gathering data, reconciling data, and reporting on data, you’re answering the question “what happened?” Your competition is spending their time analyzing (“why did this happen?”) and then planning to take action (“what should I do next?”). There is a finite amount of time in the world and sadly, that holds true in our FP&A departments too. If your EPM system is focused on collecting, consolidating, & reporting and your competition has their EPM focused on analyzing, modeling, & planning, who do you think will win in the long run?


What You Can Do
If you look at those seven top signs you’re lagging in your EPM functions and wonder how to improve, the first step is to stop building anything new. While this seems counterintuitive, if you take a tactical approach to solving any one area, you’re going to put in place a single point solution that will need to be thrown away or redone as you get closer to your overall vision for EPM. So what’s step 1? Have an EPM vision. Ask yourself where you want your company to be in three years. What do you want out of consolidation, reporting, analysis, modeling, and planning and how will all of those functions be integrated?

You are not alone. I have seen hundreds of FP&A departments in my time struggle with having a vision for just one area let alone a long-range vision. Even when leadership has a vision, it quite often focuses on system improvements (we’re not sure what to do, so let’s throw technology at it!) rather than try to improve processes too. Thankfully, there is hope and as my good friends at G.I. Joe always say, knowing is half the battle.

More Information
Wednesday, May 25, at 1PM Eastern, I’m holding a webcast to share lessons I’ve learned over the years on how to turn EPM Laggards into EPM Leaders. If you want help coming up with your three year EPM Roadmap, visit http://bit.ly/StrategyWC to sign up. It’s free and you’ll come away with some hopefully valuable ideas on where to go with performance management at your company.

If you have any questions, ask them in the comments or tweet them to me @ERoske.
Categories: BI & Warehousing

ORA-56841: Master Diskmon cannot connect to a CELL

Amardeep Sidhu - Thu, 2016-05-19 11:45

Faced this error while querying v$asm_disk after adding new storage cell IPs to cellip.ora on DB nodes of an existing cluster on Exadata. Query ends with ORA-03113 end-of-file on communication channel and ORA-56841 is reported in $ORA_CRS_HOME/log/<hostname>/diskmon/diskmon.log. Reason in my case was that the new cell was using different subnet for IB. It was pingable from the db nodes but querying v$asm_disk wasn’t working. Changing the subnet for IB on new cell to the one on existing cells fixed the issue.

 

 

Categories: BI & Warehousing

Under the Covers of OBIEE 12c Configuration with sysdig

Rittman Mead Consulting - Wed, 2016-05-18 13:50
Under the Covers of OBIEE 12c Configuration with sysdig

OBIEE 12c has changed quite a lot in how it manages configuration. In OBIEE 11g configuration was based around system MBeans and the biee-domain.xml as the master copy of settings - and if you updated a configuration directly that was centrally managed, it would get reverted back. Now in OBIEE 12c configuration can be managed directly in text files again - but also through EM still (not to mention WLST). Confused? Yep, I was.

In the configuration files such as NQSConfig.INI there are settings still marked with the ominous comment:

# This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control 

In 11g this meant - dragons be here; turn back all ye who don't want to have your configuration settings wiped next time the stack boots.

Now in 12c, I can make a configuration change (such as enabling BI Server caching), restart the affected component, and the change will take affect -- and persist through a restart of the whole OBIEE stack. All good.

Under the Covers of OBIEE 12c Configuration with sysdig
But ... the fly in the ointment. If I restart just the affected component (for example, BI Server for an NQSConfig.INI change), since I don't want to waste time bouncing the whole stack if I don't need to, then Enterprise Manager will continue to show the old setting:

Under the Covers of OBIEE 12c Configuration with sysdig

So even though in fact the cache is enabled (and I can see entries being populated in it), Enterprise Manager suggests that it's not. Confusing.

So ... if we're going to edit configuration files by hand (and personally I prefer to, since it saves firing up a web browser), we need to know how to make sure Enterprise Manager will to reflect the change too. Does EM poll the file whilst running? Or something direct to each component to request the configuration? Or maybe it just reads the file on startup only?

Enter sysdig! What I'm about to use it for is pretty darn trivial (and could probably be done with other standard *nix tools), but is still a useful example. What we want to know is which process reads NQSConfig.INI, and from there isolate the particular component that we need to restart to get it to trigger a re-read of the file and thus correctly show the value in Enterprise Manager.

I ran sysdig with a filter for filename and custom output format to include the process PID:

sudo sysdig -A -p "%evt.num %evt.time %evt.cpu %proc.name (%proc.pid) %evt.dir %evt.info" "fd.filename=NQSConfig.INI and evt.type=open"

Nothing was written (i.e. nothing was polling the file), until I bounced the full OBIEE stack ($DOMAINHOME/bitools/bin/stop.sh && $DOMAINHOME/bitools/bin/start.sh). During the startup of the AdminServer, sysdig showed:


32222110 12:00:49.912132008 3 java (10409) < fd=874(/app/oracle/biee/user_projects/domains/bi/config/fmwconfig/biconfig/OBIS/NQSConfig.INI) name=/app/oracle/biee/user_projects/domains/bi/config/fmwconfig/biconfig/OBIS/NQSConfig.INI flags=1(O_RDONLY) mode=0

So - it's the java process that reads it, PID 10409. Which is that?

$ ps -ef|grep 10409
oracle   10409 10358 99 11:59 ?        00:03:54 /usr/java/jdk1.8.0_51/bin/java -server -Xms512m -Xmx1024m -Dweblogic.Name=AdminServer [...]

It's AdminServer -- which makes sense, because Enterprise Manager is a java deployment hosted in AdminServer.

So, if you want to hack the config files by hand, restart either the whole OBIEE stack, or the affected component plus AdminServer in order for Enterprise Manager to pick up the change.

Categories: BI & Warehousing

Under the Covers of OBIEE 12c Configuration with sysdig

Rittman Mead Consulting - Wed, 2016-05-18 11:57

OBIEE 12c has changed quite a lot in how it manages configuration. In OBIEE 11g configuration was based around system MBeans and the biee-domain.xml as the master copy of settings – and if you updated a configuration directly that was centrally managed, it would get reverted back. Now in OBIEE 12c configuration can be managed directly in text files again – but also through EM still (not to mention WLST). Confused? Yep, I was.

In the configuration files such as NQSConfig.INI there are settings still marked with the ominous comment:

# This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control

In 11g this meant – dragons be here; turn back all ye who don’t want to have your configuration settings wiped next time the stack boots.

Now in 12c, I can make a configuration change (such as enabling BI Server caching), restart the affected component, and the change will take affect — and persist through a restart of the whole OBIEE stack. All good.

1__oracle_demo____ssh__and_training-material-obiee__Git_
But … the fly in the ointment. If I restart just the affected component (for example, BI Server for an NQSConfig.INI change), since I don’t want to waste time bouncing the whole stack if I don’t need to, then Enterprise Manager will continue to show the old setting:

54_170_157_117

So even though in fact the cache is enabled (and I can see entries being populated in it), Enterprise Manager suggests that it’s not. Confusing.

So … if we’re going to edit configuration files by hand (and personally I prefer to, since it saves firing up a web browser), we need to know how to make sure Enterprise Manager will to reflect the change too. Does EM poll the file whilst running? Or something direct to each component to request the configuration? Or maybe it just reads the file on startup only?

Enter sysdig! What I’m about to use it for is pretty darn trivial (and could probably be done with other standard *nix tools), but is still a useful example. What we want to know is which process reads NQSConfig.INI, and from there isolate the particular component that we need to restart to get it to trigger a re-read of the file and thus correctly show the value in Enterprise Manager.

I ran sysdig with a filter for filename and custom output format to include the process PID:

sudo sysdig -A -p "%evt.num %evt.time %evt.cpu %proc.name (%proc.pid) %evt.dir %evt.info" "fd.filename=NQSConfig.INI and evt.type=open"

Nothing was written (i.e. nothing was polling the file), until I bounced the full OBIEE stack ($DOMAIN_HOME/bitools/bin/stop.sh && $DOMAIN_HOME/bitools/bin/start.sh). During the startup of the AdminServer, sysdig showed:

32222110 12:00:49.912132008 3 java (10409) < fd=874(<f>/app/oracle/biee/user_projects/domains/bi/config/fmwconfig/biconfig/OBIS/NQSConfig.INI) name=/app/oracle/biee/user_projects/domains/bi/config/fmwconfig/biconfig/OBIS/NQSConfig.INI flags=1(O_RDONLY) mode=0

So – it’s the java process that reads it, PID 10409. Which is that?

$ ps -ef|grep 10409
oracle   10409 10358 99 11:59 ?        00:03:54 /usr/java/jdk1.8.0_51/bin/java -server -Xms512m -Xmx1024m -Dweblogic.Name=AdminServer [...]

It’s AdminServer — which makes sense, because Enterprise Manager is a java deployment hosted in AdminServer.

So, if you want to hack the config files by hand, restart either the whole OBIEE stack, or the affected component plus AdminServer in order for Enterprise Manager to pick up the change.

The post Under the Covers of OBIEE 12c Configuration with sysdig appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Data Integration Tips: ODI 12c – Who Changed My Table Names?

Rittman Mead Consulting - Mon, 2016-05-16 03:05

It’s Sunday night (well, technically Monday morning now), and we have just enough time for another one of my Data Integration Tips. This one, revolving around Oracle Data Integrator 12c, has been on my mind for some time now, so I figured we better just get it out there. Imagine this; you’ve upgraded from Oracle Data Integrator (ODI) 11g to ODI 12c and executed the first test Mapping. But hey…what happened to my C$ table names? And wait a minute, the I$ tables look a bit different as well! Let’s dive in and uncover the truth, shall we?

The Scenario

In the 11g version of Oracle Data Integrator, we could only load one single target table per mapping (or Interface, as they were called way back then). Now, in ODI 12c, we have the new flow-based mapping paradigm, allowing us to choose our sources, apply different components (joins, filters, pivot, aggregates, etc), and load as many targets as we like. Quite an upgrade, if you ask me! But with this redesign comes some minor, albeit important, change under the covers. The temporary tables used to store data that is loaded from a source, across the network, and into a target, known as C$ or Loading tables, are generated by the ODI Substitution API called from within a Knowledge Module step. The underlying code that creates the temp tables has changed to output a different format for the table names. What exactly does that mean for our C$ tables? And why do we care?

In the beginning, the C$ tables were named for the target table. If there were multiple source tables, the C$ name would be indexed – C$_0, C$_1, etc. For example, if your source to target mapping looked like this: F0010 —> ACCOUNT_MASTERS, then the loading table was named C$_0ACCOUNT_MASTERS. If there was a join between two tables executed on staging, then the second loading table would be named C$_1ACCOUNT_MASTERS.

So…what changed in ODI 12c? Let’s take a look at a few mapping examples.

one-to-one-mapping

In this mapping, the C$ table is now named after the source datastore. Instead of C$_0ACCOUNT_MASTERS, we have C$_0F0010. That can be an interesting challenge for data warehouse teams who rely on specific naming conventions for debugging, monitoring, etc. Ok, so let’s take a look at another example.

dataset-mapping

Ok, so normally I wouldn’t work with a Dataset component, but this is a look at the Mapping after an upgrade from ODI 11g. I could use the Convert to Flow feature, but as you’ll find out by reading on, it wouldn’t help with our temp table naming issues. In this example, the loading table is named C$_0DEFAULT. What’s this “default” business all about? Well, that is derived from the Dataset Component name. I must say, that’s much worse than just switching from the target table name to the source name. Yikes! Ok, one final test…

filter-mapping

Oh boy. In this case, the resulting table is called C$_0FILTER. The name? It’s based on the Filter Component name. I’m sensing a pattern here. Basically the name of any component that is mapped to the target table, and in the physical design mapped to an access point, will be used to generate the C$ loading table name.

Digging a bit deeper into the Knowledge Modules, we find that the create loading object step of the KMs invokes the following method.

<%=odiRef.getTable("L", "COLL_NAME", "W")%>

The COLL_NAME refers to the loading table name, while the other options “L” & “W” refer to the format and source of the schema name that will be prefixed to the resulting table name. As mentioned previously, this method would return the target table name with the C$ prefix. Now, it returns the source table or component name for the specific source dataset that is being extracted and loaded into the target work schema. Here’s another way to show these differences in naming conventions:

c$-table-naming

This image is based on a specific use case in which the Oracle Data Integrator customer was using the C$ tables in debugging. As you can see, the naming really doesn’t lend itself to understanding which target the C$ table was created to load.

Here’s the Tip…

Now that we understand what drives the C$ table name, we can workaround the issue. While the use case above is somewhat unique to folks who have upgraded from Oracle Data Integrator 11g, the use of components rather than tables in the naming of temporary objects can be quite confusing. We can easily change the output of <%=odiRef.getTable(“L”, “COLL_NAME”, “W”)%> by changing the component alias, or name, within the mapping. That’s an easy enough task for just a few mappings, but when you’ve upgraded hundreds, or even thousands, to ODI 12c – you’re in for some serious manual labor. Unless, of course, you dive into some Groovy script and the ODI SDK.

In the code snippet below, we first find the mapping we’re interested in editing. Then, work our way through the different components that may exist on the mapping and need a name change. This code was written specifically to handle Dataset, Filter, and source Datastore components only. Any additional components would need to be added to the list or, better yet, a different approach written in Groovy to find the last component before the final target Datastore. Hmm, next DI Tip?

Mapping mapToEdit = mapfinder.findByName(folder, mapName)

try {
 //fix filter name.
 filterComp = mapToEdit.findComponent("Filter") //find the filter named Filter.

 if(filterComp != null) {
  filterComp.setName(targName)
  out.println(mapName + " filter renamed.")
 } else {
  //fix dataset name.
  datasetComp = mapToEdit.findComponent("Default_DS") //find the dataset named Default.

  if(datasetComp != null) {
   datasetComp.setName(targName)
   out.println(mapName + " dataset renamed.")
  } else {
   //fix source datastore name.
   sources = mapToEdit.getSources()
   for(sourceComp in sources) {
    datastoreComp = sourceComp
   }

   datastoreComp.setName(targName)
   out.println(mapName + " source datastore renamed.")
  }
 }
} catch(MapComponentException e) {
 out.println e.toString()
}
tme.persist(mapToEdit)

The “targName” variable in this snippet is set to the target datastore name concatenated with the target data server name. That’s a specific use case, but the key takeaway is that the component name cannot be set to the target datastore name exactly. There must be a slight difference, since components cannot have the exact same name within a single mapping. Another caveat, if we had multiple target tables, this approach may not work out so well. But, again, coming from ODI 11g that’s a non-issue. This code can be run against a project, project folder, or even individual mappings, making it an easy way to change thousands of objects in seconds. Man I love Groovy and the ODI SDK!

That seems to solve our naming issue by modifying our loading table name into something more meaningful than C$_0FILTER. Groovy has come to the rescue and allowed us to batch change mappings in an instant. It seems we’ve completed this Data Integration Tip successfully.

But Wait, There’s More

I did mention earlier that the I$ table had an issue as well. Oh brother. The I$, or integration table, is the result of the mapping logic stored as a dataset in the I$ table just prior to loading into the final target. There is only a slight change to the ODI Substitution API method used in generating the integration table name, but again, just slight enough to bother processes built around the naming conventions.

In the past, the integration table name was based on the target table alias. But now in the latest version of ODI, the I$ table name is built based on the target datastore resource name. Again, this could potentially be problematic for those customers interested in using a different logical name for a physical target table. Something more readable, perhaps. Or maybe removing redundant characters that exist in all tables. Either way, we have to deal with a slight change in the code.

In researching a way to modify the way the I$ table is created, I came across an interesting issue. The call to odiRef.getTableName(“INT_SHORT_NAME”) is supposed to return the integration table name alone, without any schema prefix attached to it. So in the previous example, when our target table was named ACCOUNT_MASTERS, the resulting table should have been I$_ACCOUNT_MASTERS. The original call to odiRef.getTable(“L”, “INT_NAME”, “W”) actually returns ODISTAGE.I$_JDE_ACCOUNT_MASTERS, based on the resource name of the datastore object and prepending the work schema name. Using the INT_SHORT_NAME, we expected a different result. But instead, the code generated a name like this: %INT_PRFJDE_ACCOUNT_MASTERS. This must be a bug in ODI 12.2.1, but I haven’t found it yet in My Oracle Support.

To work around this whole mess, we just searched for the work schema name and removed it from the table name, while replacing the unnecessary characters as well. All of this was completed using Java within the Knowledge Module steps. In the “Define Java Variable” step, which was custom added to setup Java variables in the KM, the below function was included. It lets you perform a substring while specifying length as a parameter. Found and repurposed from here.

String mySubString(String myString, int start, int length) {
 return myString.substring(start, Math.min(start + length, myString.length()));

Then, in the “Set Java Variable” task, again custom, the code below was added to create the integration table name:

ITABLENAME ="<%=odiRef.getTable("L", "INT_NAME", "W")%>".replace("_JDE_","_");
ITABLENAME = mySubString(BMINAME, BMINAME.indexOf(".") + 1, 26);

The end result was a temporary integration table named I$_ACCOUNT_MASTERS, just as we were planning.

So there you have it, another Data Integration Tip shared with the ODI public. Hopefully this, or one of the other many DITips shared by Rittman Mead, can help you solve one of your challenging problems with Oracle Data Integration solutions. Please let me know of any Data Integration Tips you may need solved by commenting below or emailing me at michael.rainey@rittmanmead.com. And if you need a bit more help with your Data Integration setup, feel free to reach out and the experts at Rittman Mead will be glad to help!

The post Data Integration Tips: ODI 12c – Who Changed My Table Names? appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Oracle’s New Data Visualization Desktop

Rittman Mead Consulting - Tue, 2016-05-03 05:00

A recent addition to the Oracle lineup of visualization tools is the Oracle Data Visualization Desktop. Described by Oracle as a “single user desktop application that provides Oracle Data Visualization functionality to business users,” DVD is an easy-to-install data visualization tool for Windows 7, 8 or 10 that packs some very powerful features.

I recently had a chance to sit down and explore DVD and wanted to share some of my first impressions.

At its core, DVD is a stand-alone version of Oracle’s DVCS. If you are at all familiar with Visual Analyzer, you will feel right at home.

Screen Shot 2016-04-29 at 9.34.25 AM

Installation was a breeze on my Windows 10 VM and only took about 5 minutes and required no additional software or plugins for the standard VA functionality.

After installation, launching DVD is as easy as clicking on the desktop icon like any other stand-alone application.

Screen Shot 2016-04-29 at 2.05.26 PM

Screen Shot 2016-04-29 at 10.02.43 AM

After the ODV startup, I was presented with a home screen which contains a search field for finding projects, a list of user folders and a main window to select individual visualizations that have been created.

Screen Shot 2016-05-02 at 11.15.59 AM

Clicking on the hamburger Screen Shot 2016-04-29 at 2.24.46 PM icon in the top left corner brings up a menu where I can choose to start Visual Analyzer with the last data source selected, select new Data Sources or create a new VA Project.

Screen Shot 2016-04-29 at 2.02.58 PM

I chose to create a new VA project and selected the sample data from Oracle (the sample data is an optional install chosen during the DVD install process). Creating a dashboard was a fairly straightforward process. Following Visual Analyzer’s functionality of dragging and dropping columns, I was able to put together a simple sales and profit dashboard in a few minutes.

Screen Shot 2016-04-29 at 2.39.44 PM

While creating my dashboard, I noticed that Oracle has included some new visualization types. You can now choose Scatter (Cat.), Stacked Scatter (Cat.), Donut or Sunburst visualizations.

Screen Shot 2016-04-29 at 1.50.12 PM

Screen Shot 2016-04-29 at 10.15.46 AM

One other feature that Oracle added to DVD is the ability to insert images onto the dashboards. You can choose to upload your own image or link to a URL to pull images from the web.

Screen Shot 2016-05-02 at 8.21.19 AM

Screen Shot 2016-05-02 at 8.25.25 AM

I uploaded an image and changed the canvas layout to freeform, which allowed me to move the image anywhere on the dashboard. By adjusting the transparency it is possible to have the image underlay the entire dashboard and still be able to see the visualizations. This example is pretty extreme, and in a real world scenario, caution should be used as to not obstruct the visualizations.

Screen Shot 2016-05-02 at 8.33.38 AM

Next I decided to try to connect to my Oracle 12c sample database to pull in some new data to work with. Selecting “Create New Datasource” from the menu prompted me with three options: create from a file, from an existing app or from a database.

Screen Shot 2016-04-29 at 11.24.23 AM

Clicking on the “From Database” option, I was presented with a connection screen.

Screen Shot 2016-05-01 at 1.15.42 PM

On this screen I discovered one of the most impressive things about DVD. Clicking on “Database Type” reveals a dropdown menu which you can choose from a variety of database formats, including Spark, Hive and Mongo DB, among others.
Screen Shot 2016-05-01 at 1.16.15 PM

That’s awesome.

Because I already had 12c DB installed, I selected the Oracle Database Type and entered all my connection information.

Once a connection to the database is made, it shows up in the available connections list. Clicking on my sample database brought up a list of available schemas to choose from. In this case, I chose the sample HR schema which then brings up a list of tables available to add as data sources for visualizations.

Screen Shot 2016-04-29 at 3.11.22 PM

Screen Shot 2016-04-29 at 3.04.32 PM

I chose to add EMPLOYEES, JOBS and LOCATIONS and then started a new VA project. The HR tables now show up in the list of available data sources.

Screen Shot 2016-04-29 at 3.17.22 PM

I selected EMPLOYEES and JOBS and, within seconds, was able to create a simple table showing a list of employee names, their job titles, salaries and commission percentages.

Screen Shot 2016-04-29 at 11.35.09 AM

As you can see, adding new data sources is quick and easy and allows users to explore their data and create meaningful visualizations from that data in a very short amount of time.

Another feature is the Advanced Analytics portion of Oracle Data Visualization Desktop. This feature, which uses R, gives users the ability to do things like highlight outliers or show trend lines with a click of a button.

Screen Shot 2016-04-29 at 3.43.43 PM

Screen Shot 2016-04-29 at 3.48.50 PM

This feature does require an optional install located within the DVD application folder. The install process proved once again to be very quick and easy and completed in about 5 minutes.

Screen Shot 2016-04-29 at 10.02.23 AM

Screen Shot 2016-04-29 at 9.58.38 AM

After the installation was complete, I created a new VA project. Choosing the sample data provided by Oracle for DVD, I created a quick scatter chart and then, by right clicking anywhere on the visualization, clicked “Add Outliers.”

Screen Shot 2016-05-02 at 9.29.19 AM

Screen Shot 2016-05-02 at 9.29.39 AM

As you can see, outliers and non-outliers are easily distinguishable by the color key that DVD assigned automatically.

Next, I wanted to see how if I could change some of the colors in my visualization. DVD allows you to do this under the visualization menu.

Screen Shot 2016-05-02 at 9.35.17 AM

Screen Shot 2016-05-02 at 9.38.43 AM

Screen Shot 2016-05-02 at 9.39.33 AM

As with OBIEE, entering specific hex values is supported as well as selecting from pre-made color pallets is possible with DVD.

Using the same right-click functionality that I used for adding outliers, I was able to additionally add a polynomial trend line to show a gains and losses.

Screen Shot 2016-05-02 at 9.43.58 AM

Next, I decided to see if I could export this data and import it into Excel. Choosing export from the visualization menu, I was able to easily export the data as a .CSV and upload it into Excel.

Screen Shot 2016-05-02 at 9.54.35 AM

Overall, Oracle Data Visualization Desktop is a very impressive new addition to the to the DVCS lineup. The ability to collect data from multiple sources, its native adaptors for a variety of popular databases, and the ability to manipulate visualizations to convey the data in creative ways make it a strong contender against Tableau and Wave. It requires no remote server infrastructure and is a solid business solution for users Oracle Data Visualization functionality in a small and easily accessible package.

I feel as though I have just cracked the surface of everything this tool can do. Check back for future blogs and articles as we at Rittman Mead continue to explore the possibilities of DVD. The future of data visualization may be closer than we think.

If you would like more information about Visual Analyzer or the Oracle Cloud Service, see this blog post by Mark Rittman.

If you would like to watch the official Tech Demo of DVD, you can find it here.

Rittman Mead also offers in depth professional training courses for OBIEE 12c and Visual Analyzer.

The post Oracle’s New Data Visualization Desktop appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Experiments with Elastic’s Graph Tool

Rittman Mead Consulting - Fri, 2016-04-29 08:17

Elastic announced their Graph tool at ElastiCON 2016 (see presentation here). It’s part of the forthcoming X-Pack which bundles Graph along with other helper tools such as Shield and Marvel. Graph itself is two things; an extension of Elasticsearch’s capabilities, enabling the user to explore how items indexed in Elasticsearch are related, and a plugin for Kibana that acts as an optional front-end for this new functionality.

You can find a good introduction to Graph and the purpose and theory behind it in the documentation here. The installation of the components themselves is simple and documented here.

First Graph

To use Graph, you just point it at your existing data in Elasticsearch. The first data set I’m going to explore is one of the standard ones that everyone uses; Twitter. I’m streaming it in through Logstash (via Kafka for flexibility), but if you wanted you could ship it in via JDBC from any RDBMS, or from HDFS too.
See an important note at the end of this article about the slice of data within it, because it affects how the relationships visualised here should be viewed. 

On launching Kibana’s Graph plugin (http://localhost:5601/app/graph) I choose the index (note that index patterns, e.g. when partitioning by date, are not supported yet), and the field in the data that I want to use as my vertices. A point to note here – “vertices” are usually called “nodes” in Graph terminology, but since Elasticsearch already uses “nodes” as part of its infrastructure topology terminology, they had to pick a different term.

In the search box, I can put my search term from which I’m interested to see the related ‘vertices’.

Sounds baffling? It is, kinda – right up until you run it (hit enter from the search box or click the magnifying glass search icon) and see what happens:

Here we’re seeing the hashtags used in tweets that mention Kibana. The “connections” (Elastic term) or “edges” (general Graph term) show which vertices (nodes) are related, and the width indicates the strength of that relationship (based on Elasticsearch’s significant terms and scoring algorithm). For more details, see the “Behind the Scenes” section towards the end of this article.

We can add in a second set of vertices by running a second search (“Elasticsearch”) – the results for these are, in effect, appended to the existing ones:

Add Links

Since we’ve pulled back an additional set of vertices, it could be that there’s overlap between these and the first set (you’d kinda of expect it, Elasticsearch and Kibana being related). To visualise this, use the Add Links button

Note how the graph redraws itself with additional connections:

Blinked and you missed it? Use the Undo button to step back, and Redo button to re-apply.

Grouping Vertices

If you look closely at the graph you’ll see that Elasticsearch, ElasticSearch, and elasticsearch are all there as separate vertices. This is because I’m using a non-analyzed index field, so the strings are treated literally, case included. In this specific example, we’d probably re-run the graph using the analysed version of the field, which following the same two searches as above gives this:

But, sticking with our non-analysed example, we can use it to demonstrate Graph’s ability to group multiple terms together into a single vertex. Switch to Advanced Mode:

and then select the three vertices and click the group option

Now all three, and their connections, are as one:

Whilst the above analysed/non-analysed difference gave me excuse to show the group function (can you tell I’ve done many-a-failed-live-demo? ;-) ), I’m now going to switch over to a graph built on the analysed version of the hashtag field, as we saw briefly above:

Tidying up the Graph – Delete and Blacklist

There’s a few straglers on the Graph that are making it less easy to comprehend. We can temporarily remove them, or even blacklist them from appearing again in this session:

Expand Selection

One of the points of Graph analysis is visualising the relationships in your data in a way that standard relational methods may not lend themselves to so easily. We can now start to explore this further, by digging into the Graph that we’ve got so far. This process, along with the add links seen above, is often called “spidering“. By selecting the elasticsearch node and clicking on Expand selection we can see additional (by default, five) vertices related to this one:

So we see that kafka is related to Elasticsearch (in the view of the twitterati, at least), and let’s expand that Kafka vertex too:

By clicking the Expand selection button again for the same vertex we get further results added:

We can select one node (e.g. realtime) an using the Add Link see additional relationships:

But, there are many nodes, and we want to see any relationships. So, switch to Advanced Mode, select All

…add Add Link again:

Knob Twiddling

Let’s start with a blank canvas, in basic mode, showing hashtags related to … me (@rmoff)!

But, surely I do more than talk about OBIEE and ODI? Like, Elasticsearch? Let’s relax the Graph selection criteria, under Settings:

and run the search again (on top of the existing results):

There’s more results … but I know how much I tweet and it feels like I’m only seeing a part of the picture. By switching over to Advanced Mode, we can refine how many results each field returns:

I reset the workspace (undo to blank, or just reload), and run the search again, this time with a greater number of hashtag field values shown, and with the same relaxed search settings as shown above:

At this point I’m into “fiddling” territory, twiddling with the ‘Number of terms’, ‘Significant’ and ‘Certainty’ knobs to see how the results vary. You can read more about the algorithm behind the Significance setting here, and more about the Graph API here. The certainty setting is simply “The min number of documents that are required as evidence before introducing a related term”, so by lowering it we see more links, but potentially with more “noise” too, of terms that aren’t really related.

An important point to note here is the dataset that I’m using is already biased because of the terms I’m including in my twitter feed search, therefore I’d expect to see this skew in the results below. See the section at the end of this article for more details of the dataset.

  • 50 terms, significant unticked, certainty 1 (as above)
  • 50 terms, significant ticked, certainty 1
  • 50 terms, significant ticked, certainty 3
  • 20 terms, significant ticked, certainty 1
  • 20 terms, significant unticked, certainty 1

Based on the above, “Significant” seems to reduce the number of relationships discovered, but increase the level of weight shown in those that are there.

Adding Additional Vertex Fields

So we’ve seen a basic overview of how to generate Graphs, expand selections, and add relationships to those additional selections. Let’s look now at how multiple fields can be added to a Graph.

Starting with a blank workspace, I switched to Advanced Mode and added two fields from my twitter data:

  • user.screen_name
  • in_reply_to_screen_name

Note that you can customise the colour and icon of different fields.

Under Options I’ve left Significant Links enabled, and set Certainty to 1.

Let’s see who’s been interacting about the recent E4 summit:

Whilst it looks like Mark Rittman is the centre of everything, this is actually highlighting a skew in the source dataset – which includes everything Mark tweets but not all tweets about E4. See the section at the end of this article for more details of the dataset.

The lower cluster is Mark as the addressee of tweets (i.e. he is the in_reply_to_screen_name), whilst the upper cluster is tweets that Mark has sent addressing others (i.e. he is the user.screen_name).

If we click on Add Links a couple of times we can see that there’s other connections here – for example, Mark replies to Stewart (@stewartbryson), who Christian Berg (@Nephentur) talks to, who in turn talks to Mark.

This being twitter and the age of narcissism, I’ll click on my vertex and click Expand Selection to see the people who in turn talk to me:

And by using Add Link see how they relate to those already shown in the Graph:

Viewing Associated Records

Within Graph there’s the option to view the data associated with one or more vertices. We do this by selecting a vertex and clicking on View Example Docs (in Elasticsearch parlance, a document is akin to a ‘row’ as traditional RDBMS folk would know it). From here select the field – for twitter the text field has the contents of the tweet:

Adding Even more Vertex Fields

So, we’ve got a bit of a picture of who talks to whom, but can we see what they’re talking about? We could use the text field shown above to see the contents of tweets but that’s down in the weeds of individual tweets – we want to step back a notch and get a summarised view.

First I add in the hashtag field:

And then deselect the two username fields. This is so that I can expand existing vertices, and instead of showing related hashtags and users, instead I only expand it to show hashtags – and not additional users.

Now I select Mark as the orinator of a tweet, and Expand Selection followed by Add Links on all vertices until I get this:

The number of values selected is key in getting a representative Graph. Above I used a value of 10. Compare that to instead running the same process but with 50. Under Options I’ve left Significant Links enabled, and set Certainty to 1:

One interesting point we can see from this is that the user “itknowingness” in the cluster on the left seems to use all the hashtags, but doesn’t interact with anyone – from the Graph it’s easy to see, and a great example of where Graph gives you the answer to a question you didn’t necessarily know that you had, and which to get the answer out through a traditional RDBMS query would need a very specific query to do so. Looking at the source data via Kibana’s Discover panel shows that it is indeed a bot auto-retweeting anything and everything:

Building a Graph from Scratch

Now that we’ve seen all the salient functions, let’s start with a blank canvas, and see where we get.
The setttings I’m using are:

  • Significant Links unticked
  • Certainty = 1
  • Field entities.hashtags.text.analyzed max terms = 10
  • Field user.screen_name max terms = 10
  • Initial search term rmoff

Then I click on markrittman and Expand Selection, the same for mrainey, and also for the two hashtags e4 and hadoop:

Within the clusters, let’s see what links exist. With no vertices select I click on Add Links (which seems to be the same as selecting all vertices and doing the same). With each click additional links are added, all related to the hadoop/bigdata area:

I’m interested now in the E4 region of the Graph, and the vertices related to Mark Rittman. Clicking on his vertex and clicking “Select Neighbours” does exactly that:

Now I’m more interested in digging into the terms (hashtags) that are related that people, so I deselect the user.screen_name field, and then Expand Selection and Add Links again.

Note the width of the connections – a strong relationship between Mark Rittman, “Hadoop” and “SQL”, which is presumably from the tweets around the presentation he did recently on the subject of… SQL on Hadoop. Other terms, including Hive and Impala, are also related, as you’d expect.

Graphing Tweet Text Contents

By making sure that the tweet text is available as an analysed field we can produce a Graph based on the ‘tokens’ within the tweet, rather than the literal 140 characters. Whilst hashtags are there deliberately to help with the classification and grouping of tweets (so that other people can follow conversations on the same subject) there are two reasons why you’d want to look at the tweet text too:

  1. Not everyone uses hashtags
  2. Not all relationships are as boolean as a hashtag or not – maybe a general discussion in an area re-uses the same words which overall forms a relationship between the terms.

Here I’m going back to the default settings:

  • Significant Links ticked
  • Certainty = 3

And returning two fields – hashtag and tweet text

  • Field entities.hashtags.text.analyzed max terms = 20
  • Field text.analyzed max terms = 50
  • Initial search term kafka

I then tidy it up a bit :

  • Joining the same/near-same text and hashtags, such as “kafkasummit” hashtag and the same text. If you think about the contents of a tweet, hashtags are part of the text, therefore, there’s going to be a lot of this duplication.
  • Blacklisted text terms that are URL snippets. Here I’m using the Example Docs function to check the context of the term in the whole text field

    I also blacklisted common words (“the”, “of”, etc), and foreign ones (how British…).

Behind the Scenes

The Kibana Graph plugin is just a front-end for the Graph extension in Elasticsearch. It’s useful (and fun!) for exploring data, but in practice you’d be making direct REST API calls into Elasticsearch to retrieve a list of vertices and connections and relative weights for use in your application. You can see details of this from the Settings page and Last Request option

Looking at an example (the one used in the first example on this article), the request is pretty simple:

{
    "query": {
        "query_string": {
            "default_field": "_all",
            "query": "kibana"
        }
    },
    "controls": {
        "use_significance": true,
        "sample_size": 2000,
        "timeout": 5000
    },
    "connections": {
        "vertices": [
            {
                "field": "entities.hashtags.text.analyzed",
                "size": 5,
                "min_doc_count": 3
            }
        ]
    },
    "vertices": [
        {
            "field": "entities.hashtags.text.analyzed",
            "size": 5,
            "min_doc_count": 3
        }
    ]
}

and the response not too complex either, just long.

{
    "took": 201,
    "timed_out": false,
    "failures": [],
    "vertices": [
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "logstash",
            "weight": 0.1374238061561338,
            "depth": 0
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "timelion",
            "weight": 0.12719678206002483,
            "depth": 0
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "elasticsearch",
            "weight": 0.11733085557405047,
            "depth": 0
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "osdc",
            "weight": 0.00759026383038536,
            "depth": 1
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "letsencrypt",
            "weight": 0.006869972953128271,
            "depth": 1
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "kibana",
            "weight": 0.6699955212823048,
            "depth": 0
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "filebeat",
            "weight": 0.004700657388257993,
            "depth": 1
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "elk",
            "weight": 0.09717015256984456,
            "depth": 0
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "justsayin",
            "weight": 0.005724977460940227,
            "depth": 1
        },
        {
            "field": "entities.hashtags.text.analyzed",
            "term": "elasticsearch5",
            "weight": 0.004700657388257993,
            "depth": 1
        }
    ],
    "connections": [
        {
            "source": 0,
            "target": 3,
            "weight": 0.00759026383038536,
            "doc_count": 26
        },
        {
            "source": 7,
            "target": 5,
            "weight": 0.02004197094823259,
            "doc_count": 26
        },
        {
            "source": 5,
            "target": 4,
            "weight": 0.006869972953128271,
            "doc_count": 6
        },
        {
            "source": 5,
            "target": 0,
            "weight": 0.018289612748107368,
            "doc_count": 48
        },
        {
            "source": 0,
            "target": 6,
            "weight": 0.004700657388257993,
            "doc_count": 11
        },
        {
            "source": 7,
            "target": 0,
            "weight": 0.0038135609650491726,
            "doc_count": 10
        },
        {
            "source": 0,
            "target": 5,
            "weight": 0.0052711254217388415,
            "doc_count": 48
        },
        {
            "source": 0,
            "target": 9,
            "weight": 0.004700657388257993,
            "doc_count": 11
        },
        {
            "source": 5,
            "target": 1,
            "weight": 0.033204869273453314,
            "doc_count": 29
        },
        {
            "source": 1,
            "target": 5,
            "weight": 0.04492364819068228,
            "doc_count": 29
        },
        {
            "source": 5,
            "target": 8,
            "weight": 0.005724977460940227,
            "doc_count": 5
        },
        {
            "source": 2,
            "target": 5,
            "weight": 0.00015519515214322833,
            "doc_count": 80
        },
        {
            "source": 5,
            "target": 7,
            "weight": 0.022734810798933344,
            "doc_count": 26
        },
        {
            "source": 7,
            "target": 2,
            "weight": 0.0006823241440183544,
            "doc_count": 13
        }
    ]
}

Note how the connections are described using the relative (zero-based) instance number of the vertices. You can also see that the width of a connection is based on the weight (calculated from the significant terms algorithm), rather than document count. Compare the connection width of timelion/kibana (vertices 1 and 5 respectively), with a weighting of 0.33 (kibana -> timelion) and 0.045 (timelion -> kibana) but overlapping document count of 29:

with elasticsearch -> kibana that has an overlapping document count of 80 but only a weight of 0.0001.

Elasticsearch’s documentation describes the significant terms algorithm thus, using the example of suggesting “H5N1” when users search for “bird flu” in text:

In all these cases the terms being selected are not simply the most popular terms in a set. They are the terms that have undergone a significant change in popularity measured between a foreground and background set. If the term “H5N1” only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.

So from this, we can roughly say that Graph is looking at the number of documents in which timelion is mentioned as a proportion of the whole dataset, and then in the number of documents in which the hashtag Kibana exists and also timelion is mentioned. Since the former is a plugin of the latter, the close relationship would be expected. You can use Kibana to explore the significant terms concept further – for example, taking the same ‘seed’ as the original Graph query above, Kibana, gives a similar set of results as the Graph:

More information about the scoring can be found here, which includes the fact that the scoring is, in part, based on TF-IDF (Term Frequency-Inverse Document Frequency).

Licensing

Graph requires a licence – see here for details.

Conclusion

This tool is a great way to dip one’s toe into the waters of Graph analysis and visualisation. It’s another approach to consider in the data discovery phase of your analytics work, when you don’t even know the questions that you’ve got for the data in front of you. Your data can remain in Elasticsearch in the same format it’s always been, and the Graph function just runs on top of it.

I’ll not profess to be a Graph theory expert, so can’t pass much comment on the theoretical rigour of the results and techniques seen. One thing that struck me with it was that there’s no (apparent) way to manually influence the weight of connections and vertices – for example, based on the number of followers someone has one twitter consider them more (or less) relevant when determining relationships.

For a well-informed view on Graph theory and Social Network Analysis (SNA), see Jordan Meyer’s presentation here (and associated R code), as well as Mark Rittman’s presentation from BIWA this year.

Footnote: The Twitter Dataset

The dataset I’m using is a live stream from Twitter, via Logstash and Kafka, searching for a set of terms related to me and the field I work in. Therefore, there’s going to be a bunch of relationships missing (if I’ve not included the relevant term in my tweet search), and relationships over-stated (because as a proportion of all the records the terms I’ve selected will dominate).
An interesting use of Graph (or Elasticsearch’s significant terms aggregation in general) could be to identify all the relevant terms that I should be including in my twitter search, by sampling an ‘unpolluted’ feed for relationships. For example, if I’m interested in capturing Kafka tweets, perhaps I should also be capturing those related to Samza, Spark, and so on.

The post Experiments with Elastic’s Graph Tool appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Contemplating Upgrading to OBIEE 12c?

Rittman Mead Consulting - Thu, 2016-04-28 05:00
Where You Are Now

NewImage
OBIEE 12c has been out for some time, and it seems like most folks are delaying upgrading to OBIEE 12c until the very last minute. Or at least until Oracle decides to put out another major version change of OBIEE, which is understandable. You’ve already spent time and money and devoted hundreds of resource hours to system monitoring, maintenance, testing, and development. Maybe you’ve invested in staff training to try to maximize your ROI in your existing OBIEE purchase. And now, after all this time and effort, you and your team have finally gotten things just right. Your BI engine is humming along, user adoption and stickiness are up, and you don’t have a lot of dead objects clogging up the Web Catalog. Your report hacks and work-arounds have been worked and reworked to become sustainable and maintainable business solutions. Everyone is getting what they want.

Sure, this scenario is part fantasy, but it doesn’t mean that as a BI team lead or member, you’re not always working toward this end. It would be nice to think that the people designing the tools with which we do this work understood the daily challenges and processes we must undergo in order to maintain the precarious homeostasis of our BI ecosystems. That’s where Rittman Mead comes in. If you’re considering upgrading to OBIEE 12c, or are even curious, keep reading. We’re here to help.

So Why Upgrade

Let’s get right down to it. Shoot over here and here to check out what our very own Mark Rittman had to say about the good, the bad, and the ugly of 12c. Our Silvia Rauton did a piece on lots of the nuts and bolts of 12c’s new front-end features. They’re all worth a read. Upgrading to OBIEE 12c offers many exciting new features that shouldn’t be ignored.

Heat Map

How Rittman Mead Can Help

We understand what it is to be presented with so many project challenges. Do you really want to risk the potential perils and pitfalls presented by upgrading to OBIEE 12c? We work both harder and smarter to make this stuff look good. And we get the most out of strategy and delivery via a number of in-house tools designed to keep your OBIEE deployment in tip top shape.

Maybe you want to make sure all your Catalog and RPD content gets ported over without issue? Instead of spending hours on testing every dashboard, report, and other catalog content post-migration, we’ve got the Automated Regression Testing package in our tool belt. We deploy this series of proprietary scripts and dashboards to ensure that everything will work just the way it was, if not better, from one version to the next.

Maybe you’d like to make sure your system will fire on all cylinders or you’d like to proactively monitor your OBIEE implementation. For that we’ve got the Performance Analytics Dashboards, built on the open source ELK stack to give you live, active monitoring of critical BI system stats and the underlying database and OS.

OBIEE Performance Analytics

On top of these tools, we’ve got the strategies and processes in place to not only guarantee the success of your upgrade, but to ensure that you and your team remain active and involved in the process.

What to Expect

You might be wondering what kinds of issues you can expect to experience during upgrading to OBIEE 12c (which is to say, nothing’s going to break, right?). Are you going to have to go through a big training curve? Does upgrading to OBIEE 12c mean you’re going to experience considerable resource downtime as your team, or an even an outside company, manages this process? To answer this question, I’m reminded of a quote from the movie Fight Club: “Choose your level of involvement.”

While we always prefer to work alongside your BI or IT team to facilitate the upgrade process, we also know that resource time is valuable and that your crew can’t stop what they’re doing until things wraps up. We often find that the more clients are engaged with the process, however, the easier the hand-off is because clients better understand best practices, and IT and BI teams are more empowered for the future.

Learning More about OBIEE 12c

But if you’re like many organizations, maybe you have to stay more hands off and get training after the upgrade is complete. Check out the link here to look over the agenda of our OBIEE 12c Bootcamp training course. Like our hugely popular 11g course, this program is five days of back-to-front instruction taught via a selection of seminars and hands-on labs, designed to impart most everything your team will need to know to continue or begin their successful BI practice.

What we often find is that, in addition to being a thorough and informative course, the Bootcamp is a great way to bring together teams or team members, often dispersed among different offices, under one roof to gain common understanding about how each person plays an important role as a member of the BI process. Whether they handle the ETL, data modeling, or report development, everyone can benefit from what often evolves from a training session into some impromptu team building.

Feel Empowered

If you’re still on the fence about whether or not to upgrade, as I said before, you’re not alone. There are lots of things you need to consider, and rightfully so. You might be thinking, “What does this mean for extra work on the plates of my resources? How can I ensure the success of my project? Is it worth it to do it now, or should I wait for the next release?” Whatever you may be mulling over, we’ve been there, know how to answer the questions, and have some neat tools in our utility belt to move the process along. In the end, I hope to have presented you with some bits to aid you in making a decision about upgrading to OBIEE 12c, or at least the impetus to start thinking about it.

If you’d like any more information or just want to talk more about the ins and outs of what an upgrade might entail, send over an email or give us a call.

The post Contemplating Upgrading to OBIEE 12c? appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Data Integration Tips: Oracle Data Integrator 12c Password

Rittman Mead Consulting - Mon, 2016-04-25 00:12

Hey, everyone. It’s Sunday night, and we have just enough time for another Data Integration Tip from Rittman Mead. This one has originated from many years of Oracle Data Integrator experience—and a lost 12c password. Let me start first by stating there is never any blame placed when a password is lost, forgotten, or just never stored in a safe place. It happens more often than you might think! Unfortunately, there is no “Forgot password?” link in ODI 12c, which is why I wanted to share my approach to password recovery for these situations.

forgot-password

The Challenge: Lost 12c Password

There are typically two passwords used in Oracle Data Integrator 12c that are forgotten and difficult to recover:

  1. The Work Repository password, created during the setup of the ODI repositories.
  2. The SUPERVISOR user password.

Often there will be more than one ODI user with supervisor privileges, allowing the SUPERVISOR user account password to be reset and making everyone’s life a bit easier. With that, I’ll focus on the Work Repository password and a specific use case I ran into just recently. This approach will work for both lost 12c password instances and I have used it for each in the past.

work-repo-change-password

Now, yes, there is a feature that allows us to change the Work Repository password from within ODI Studio. But (assuming you do have the ability to edit the Work Repository object) as you can see in the image, you also need to know the “current password.” Therein lies the problem.

The Scenario

OK, here we go. The situation I ran into was related to an ODI 11g to 12c upgrade. During the upgrade, we cloned the master and work repositories and set them up on a new database instance in order to lessen the impact on the current 11g repositories. To make this work, a few modifications are required after cloning and before the ODI upgrade assistant can be run. Find more details on these steps in Brian Sauer’s post, Upgrade to ODI 12c: Repository and Standalone Agent.

  • Modify the Work repository connection from within the Master repository. The cloned Master repository is still pointed to the original ODI 11g Work Repository and the connection must be updated.
  • Update the SYSTEM.SCHEMA_VERSION_REGISTRY$ table to add an entry for the cloned ODI repository in the new database instance.
  • Detach the Work Repository from the original Master Repository.

Easy enough. The upgrade assistant completed successfully and everything was working great during testing, until we attempted to open the Work Repository object in ODI:

“Work repository is already attached to another master repository”

Uh-oh. It seems the last bullet point above was skipped. No worries. We have a simple solution to this problem. We can detach the Work Repository from the Master, then attach it once again. Interestingly enough, the action of detaching the repository cleans up the metadata and allows the Work Repository to be added to the cloned master with no problem.

Detaching is easy. Just confirm that you want to remove the Work Repository and poof, it’s gone. It’s the reattaching where we run into an issue…our lost 12c password issue (you knew I was going to bring that up, didn’t you?). Adding a Work repository requires a JDBC connection to a new or existing repository. In this case, we choose the existing repository in our cloned database. The same one we just detached from the Master. Just make sure that you choose to keep the repository contents or you’ll have a much bigger challenge ahead of you.

But then, out of nowhere, we’re prompted for the Work Repository password.

work-repo-password

Hmm…well, we set the ODI 11g repository up in 2011. Jim, who installed it for us, doesn’t work here any longer. “Hmm” is right!

Here’s the Tip

Before we go any further, full disclosure—this is most likely not considered a supported action in the eyes of Oracle, so beware. Warning SignAlso, I haven’t attempted to use the ODI SDK and a Groovy script to update a password, so that might be the way to go if you’re concerned about this being a hack. But desperate times require desperate measures, as they say.

In order to “recover” a password for the Work Repository, we must actually change it behind the scenes in the repository tables. There’s a great deal of metadata we can access via the repository schema, and the modification of this data via the schema is not typical nor recommended, but sometimes necessary.

Oracle Support has a Knowledge Base document, Oracle Data Integrator 11g and 12c Repository Description (Doc ID 1903225.1), which provides a nice data dictionary for the repositories. Looking at the ODI 12.2.1 version of the repository definition, we find that the table SNP_LOC_REPW in the Work Repository stores the value for the repository password in the column REP_PASSW. Now the password must be encoded to match the repository and environment, so it cannot simply be added to the table in plain text.

Encoding a password is something that Oracle Data Integrator developers and admins have been doing for years, most often when setting up a Standalone agent. As a part of the agent installation, there is a script called encode.sh (or encode.bat for Windows) that will accept a plain text password as a parameter and output the encoded string. Brilliant! Let’s try it out.

Browse to the ODI agent domain home and drill into the bin directory. From there, we can execute the encode command. A quick look at the script shows us the expected input parameters.

encode-syntax

The instance name is actually the Agent name. Ensure the agent is running and fire off the script:

[oracle@ODIGettingStarted bin]$ ./encode.sh -INSTANCE=OGG_ODI_AGENT
2016-04-24 22:00:50.791 TRACE JRFPlatformUtil:Unable to obtain JRF server platform. Probably because you are in JSE mode where oracle.jrf.ServerPlatformSupportFactory is not available which is expected.
2016-04-24 22:00:56.855 NOTIFICATION New data source: [OGG_ODI_REPO/*******@jdbc:oracle:thin:@//localhost:1521/ORCL]
2016-04-24 22:01:01.931 NOTIFICATION Created OdiInstance instance id=1
Enter password to encode:

Now you can enter a password to encode, hit return and boom! Here’s your encoded string.

Enter password to encode:
ejjYhIeqYp4xBWNUooF31Q==

Let’s take the entire string and write a quick update statement for the Work Repository SNP_LOC_REPW table. Even though I know there is only one Work Repository, I still use a where clause to ensure I’m updating the correct row.

update SNP_LOC_REPW
set REP_PASSW = 'ejjYhIeqYp4xBWNUooF31Q=='
where REP_NAME = ‘OGG_ODI_WREP’;

Commit the transaction and Bob’s your uncle! Now we can continue on with adding the Work Repository through ODI Studio. Just enter the password used in the encode.sh command and you’re in!

As I mentioned earlier, this same approach can be used to update the SUPERVISOR user password, or really any ODI user password (if they are stored in the repository). In this case, the use of encode.sh is the same, but this time we update the SNP_USER table in the Master repository. The column PASS stores the encoded password for each user. Just remember to change the password everywhere that the user is set to access ODI (agents, etc).

So there you have it. A quick, simple way to “recover” a lost ODI 12c password. Just be sure that this information doesn’t fall into the wrong hands. Lock down your ODI agent file directory to only those administrators who require access. Same goes for the repository schemas. And finally, use this approach in only the most dire situation of a completely lost 12c password. Thanks for reading and look here if you want more DI Tips. Enjoy your week!

The post Data Integration Tips: Oracle Data Integrator 12c Password appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Pages

Subscribe to Oracle FAQ aggregator - BI &amp; Warehousing