Rittman Mead Consulting

Subscribe to Rittman Mead Consulting feed
Rittman Mead consults, trains, and innovates within the world of Oracle Business Intelligence, data integration, and analytics.
Updated: 5 hours 23 min ago

Becky's BI Apps Corner: OBIA Back-to-Beginnings - Naming Conventions and Jargon

Wed, 2016-08-24 08:30

It's easy to talk about a technology using only jargon. It's much harder to talk about a technology without using jargon. I have seen many meetings between business and IT break down because of this communication barrier. I find it more discouraging when I see this communication breakdown happen between advanced IT staff and new IT staff. For those of us in any technological field, it's easy to forget how long it took to learn all of the ins and outs, the terminology and jargon.

During a recent project, I had another consultant shadowing me to get experience with OBIA. (Hi, Julia!) I was 'lettering' a lot so I decided it was time to diagram my jargon. My scribbles on a whiteboard gave me the idea that it might be helpful to do a bit of connecting the dots between OBIA and data warehousing jargon and naming conventions used in OBIA.

BI Applications Load Plan phases: SDE - Source Dependent Extract

SDE is the first phase in the ETL process that loads source data into the staging area. SDE tasks are source database specific. SDE mappings that run in the load plan will load staging tables. These tables end with _DS and _FS among others.

SIL - Source Independent Load

SIL is the second phase in the ETL process that takes the staged data from the staging tables and loads or transforms them into the target tables. SILOS mappings that run in the load plan will load dimension and fact tables. These tables end with _D and _F among others.

PLP - Post Load Process

This third and final phase in the ETL process occurs after the target tables have been loaded and is commonly used for loading aggregate fact tables. PLP mappings that run in the load plan will load aggregate tables ending with _A. Aggregate tables are often fact table data that has been summed up by a common dimension. For example, a common report might look at finance data by the month. Using the aggregate tables by fiscal period would help improve reporting response time.

For further information about any of the other table types, be sure to read Table Types for Oracle Business Analytics Warehouse. Additionally, this page has probably the best explanation for staging tables and incremental loads.

Source System Acronyms

Since the SDE tasks are source database specific, the SDE mappings' names also include an acronym for the source system in the mapping name. Below are the supported source database systems and the acronyms used in the names and an example for each.

  • Oracle E-Business Suite - ORA

    • SDE_ORA_DomainGeneral_Currency
  • Oracle Siebel - SBL

    • SDE_SBL_DOMAINGENERAL_CURRENCY
  • JD Edwards Enterprise One - JDEE

    • SDE_JDE_DomainGeneral_Currency
  • PeopleSoft - PSFT

    • SDE_PSFT_DomainGeneral_Currency_FINSCM
  • Oracle Fusion Applications - FUSION

    • SDE_FUSION_DomainGeneral_Currency
  • Taleo - TLO

    • SDE_TLO_DomainGeneral_Country
  • Oracle Service Cloud - RNCX

    • SDE_RNCX_DomainGeneral
  • Universal - Universal

    • SDE_Universal_DomainGeneral

This wraps up our quick "Back-to-Beginnings" refresher on naming conventions and the jargon used in relation to ETL and mappings. Let me know in the comments below if there are other topics you would like me to cover in my "Back-to-Beginnings" series. As always, be sure to check out our available training, which now includes remote training options, and our On Demand Training Beta Program. For my next post I'll be covering two new features in OBIA 11.1.1.10.2, Health Check and ETL Diagnostics, which are the missing pieces you didn't know you've been waiting for.

Categories: BI & Warehousing

How To Poke Around OBIEE on Linux with strace (Working with Unsupported ODBC Sources in OBIEE 12c)

Tue, 2016-08-23 08:30

OBIEE 12c (and 11g and 10g before it) supports three primary ways of connecting to data sources:

  1. Native Gateway, such as OCI for Oracle. This is always the preferred option as it gives the greatest support and performance.
  2. Data Direct ODBC Drivers, a set of which are bundled with OBIEE for enabling connectivity to sources such as SQL Server, MySQL, Hive, and Impala. The configuration of these is documented in the OBIEE manuals and is generally a supported configuration.
  3. Native ODBC Drivers.

There's also kind of a fourth option, which is JDBC, as explained by Fiston in this excellent post. It's OBIEE 12c only and not fully documented/supported.

In this article we're going to dig into the third option (native ODBC) and look at how it can be used, and also how to troubleshoot it (and the OBIEE stack in general) on Linux.

ODBC (Open Database Connectivity) is a documented API designed to enable applications to work with databases, without one explicitly supporting the other. A comparable protocol is JDBC, which is also widely used (but as yet not fully supported/documented within OBIEE). To use native ODBC drivers with OBIEE on *nix, you install the driver on your OBIEE server and then configure OBIEE to use it. Mark Rittman wrote an example of how to do this with the original Hive ODBC drivers here (before they were formally bundled with OBIEE), and the manual shows how to use it with a native driver for Teradata.

My interest in this is the Apache Drill tool, which enables querying with SQL against a multitude of datasources, including things like JSON files, Hive tables, RDBMS, and more. MapR have released an ODBC driver for it, and I thought it'd be neat to get Apache Drill to work with OBIEE. It evidently does work for OBIEE running on Windows, but I wanted to be able to use it on my standard environment, Linux. This blog post is a warts-and-all exploration of the process I went through to get it to work, since I thought it might be of interest to see some of the forensic methods available when trying to get things to work. For just the headlines, see Using Apache Drill with OBIEE 12c.

First Things First - Setting Up Apache Drill

Drill can be deployed in distributed configuration (with all the parallel processing goodness which that brings), but also run as a single instance locally. For the sake of simplicity that's what I'm going to do here. It's rather easy to do:

# Download Apache Drill
wget http://www.apache.org/dyn/closer.cgi/drill/drill-1.7.0/apache-drill-1.7.0.tar.gz
# Unpack
tar -xvf apache-drill-1.7.0.tar.gz
# Run
cd /opt/apache-drill-1.7.0/ && bin/sqlline -u jdbc:drill:zk=local

You need to make sure you've got a recent JDK available, and if you're running it on SampleApp, watch out for this odd problem that I had which was related to classpaths and manifested itself with the error java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.JavaType.isReferenceType()Z.

All being well, you'll now have a Drill prompt:

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0  
Aug 09, 2016 5:51:43 AM org.glassfish.jersey.server.ApplicationHandler initialize  
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...  
apache drill 1.7.0  
"say hello to my little drill"
0: jdbc:drill:zk=local>  

From here you can run a simple query to check the version:

0: jdbc:drill:zk=local> SELECT version FROM sys.version;  
+----------+
| version  |
+----------+
| 1.7.0    |
+----------+
1 row selected (0.392 seconds)  

or query one of the built-in sample data sets:

0: jdbc:drill:zk=local> select count(*) from cp.`employee.json`;  
+---------+
| EXPR$0  |
+---------+
| 1155    |
+---------+
1 row selected (0.977 seconds)  

For more examples of Drill, see the tutorials.

Setting up Drill ODBC on Linux

With Drill setup and running locally, let's now install the ODBC driver. This is all on SampleApp v511 / Oracle Linux 6.7.

sudo rpm -i http://package.mapr.com/tools/MapR-ODBC/MapR_Drill/MapRDrill_odbc_v1.2.1.1000/MapRDrillODBC-1.2.1.x86_64.rpm

Per the system requirements, we need to make sure that we're using one of the supported ODBC Driver Managers, so we'll install iODBC to start with (the other option being unixODBC):

sudo yum install -y unixodbc

Now follow the configuration instructions. To start with we'll do this in isolation of OBIEE to check that it works, and then bring it into OBIEE's world (for example, odbc.ini already exists in OBIEE).

  1. Set environment variables

    export ODBCINI=~/.odbc.ini
    export MAPRDRILLINI=~/.mapr.drillodbc.ini
    export LD_LIBRARY_PATH=/opt/mapr/drillodbc/lib/64
    
  2. Set up ODBC DSN in ~/.odbc.ini

    [ODBC Data Sources]
    DrillDSN=MapR Drill ODBC Driver 64-bit
    
    
    [DrillDSN]
    Driver=/opt/mapr/drillodbc/lib/64/libmaprdrillodbc64.so
    Description=Drill ODBC Driver
    ConnectionType=Direct
    HOST=localhost
    PORT=31010
    AuthenticationType=No Authentication
    
  3. Copy the default ODBC driver manager config

    cp /opt/mapr/drillodbc/Setup/odbcinst.ini ~/.odbcinst.ini
    
  4. Configure the Drill ODBC driver in ~/.mapr.drillodbc.ini

    Here I've set the log level to Trace, so that we can see what's going on in depth—in practice this would generate huge amounts of unnecessary log data so set it to a lower value (e.g. 0) for actual use.

    [Driver]
    DisableAsync=0
    DriverManagerEncoding=UTF-32
    ErrorMessagesPath=/opt/mapr/drillodbc/ErrorMessages
    LogLevel=6
    LogPath=/tmp/odbc.log
    SwapFilePath=/tmp
    
    
    ODBCInstLib=libiodbcinst.so.2
    # This is the ODBC Driver Manager library
    # Note the documentation - the DriverManagerEncoding (above) will vary depending
    # on the driver manager in use.
    # See https://drill.apache.org/docs/configuring-odbc-on-linux/#configuring-.mapr.drillodbc.ini
    

Having set this up, we'll now test it:

[oracle@demo ~]$ iodbctest "DSN=DrillDSN"

You should see:

iODBC Demonstration program  
This program shows an interactive SQL processor  
Driver Manager: 03.52.0709.0909  
Driver: 1.2.1.1001 (MapR Drill ODBC Driver)

SQL>  

From where you can enter a command (don't include the ; suffix as this is added automatically)

SQL> SELECT version FROM sys.version

version  
--------
1.7.0

result set 1 returned 1 rows.  
  • If you get Unable to locate SQLGetPrivateProfileString function then check that your LD_LIBRARY_PATH includes the location of the Driver Manager (libiodbc.so.2), and that the exact library specified by ODBCInstLib exists—in my installation it was called libiodbcinst.so.2 rather than libiodbc.so as shown in the docs. Also check that you've set MAPRDRILLINI environment variable.
    • libiodbc installs into /usr/lib64 which is one of the default paths checked for library files, hence not including it explicitly in the LD_LIBRARY_PATH environment variable.
  • If you just get Have a nice day. with no error but no SQL> prompt, check your ODBC DSN configuration. I hit this issue when inadvertently omitting the AuthenticationType parameter.

If you head over to /tmp/odbc.log/ you should see a file called driver.log with a bunch of trace data in it. Here's the data from the above session, excluding TRACE logs:

Aug 09 06:23:35.701 INFO  1148090112 Driver::LogVersions: SDK Version: 09.05.07.1008  
Aug 09 06:23:35.701 INFO  1148090112 Driver::LogVersions: DSII Version: 1.2.1.1001  
Aug 09 06:23:35.701 INFO  1148090112 Driver::Initialize: Database CHAR Encoding: UTF-8  
Aug 09 06:23:35.701 INFO  1148090112 Driver::Initialize: Database WCHAR Encoding: UTF-8  
Aug 09 06:23:35.701 INFO  1148090112 Driver::Initialize: ANSI CHAR Encoding: UTF-8  
Aug 09 06:23:35.701 INFO  1148090112 Driver::Initialize: Driver Manager WCHAR Encoding: UTF-32LE  
Aug 09 06:23:35.702 INFO  1148090112 Driver::Initialize: Detected Driver Manager: iODBC, 3.52.7  
Aug 09 06:23:35.702 INFO  1148090112 Driver::Initialize: Locale name: en_GB  
Aug 09 06:23:35.702 INFO  1148090112 Driver::Initialize: Bitness: 64-bit  
Aug 09 06:23:35.702 INFO  1148090112 CInterface::SQLAllocHandle: Allocating environment handle.  
Aug 09 06:23:35.702 INFO  1148090112 Environment::SQLSetEnvAttr: Attribute: SQL_ATTR_ODBC_VERSION (200)  
Aug 09 06:23:35.702 INFO  1148090112 EnvironmentAttributes::SetAttribute: Setting ODBC version to: 3  
Aug 09 06:23:35.702 INFO  1148090112 CInterface::SQLAllocHandle: Allocating connection handle.  
Aug 09 06:23:35.702 INFO  1148090112 Connection::SQLSetConnectAttr: Attribute: Unknown Attribute (1051)  
Aug 09 06:23:35.702 INFO  1148090112 ConnectionAttributes::SetAttribute: Invalid attribute: 1051  
Aug 09 06:23:35.704 ERROR 1148090112 Connection::SQLSetConnectAttr: [MapR][ODBC] (10210) Attribute identifier invalid or not supported: 1051  
Aug 09 06:23:35.708 INFO  1148090112 DiracClient::DiracClient: Create a new Dirac Client [194bd90] (handshakeTimeout = 5, queryTimeout = 180)  
Aug 09 06:23:35.708 INFO  1148090112 DiracClient::connect: Connection String 'local=localhost:31010' & Default Schema ''  
Aug 09 06:23:35.712 DEBUG 1148090112 DiracClient::connect: Connection successfully.  
Aug 09 06:23:35.713 INFO  1148090112 Connection::SQLGetInfoW: InfoType: SQL_CURSOR_COMMIT_BEHAVIOR (23)  
Aug 09 06:23:35.713 INFO  1148090112 Connection::SQLGetInfoW: InfoType: SQL_CURSOR_ROLLBACK_BEHAVIOR (24)  
Aug 09 06:23:35.713 INFO  1148090112 Connection::SQLGetInfoW: InfoType: SQL_DRIVER_VER (7)  
Aug 09 06:23:35.713 INFO  1148090112 Connection::SQLGetInfoW: InfoType: SQL_DRIVER_NAME (6)  
Aug 09 06:23:35.713 INFO  1148090112 CInterface::SQLAllocHandle: Allocating statement handle.  
Aug 09 06:23:35.714 INFO  1148090112 Statement::SQLGetStmtAttrW: Attribute: SQL_ATTR_APP_ROW_DESC (10010)  
Aug 09 06:23:35.714 INFO  1148090112 Statement::SQLGetStmtAttrW: Attribute: SQL_ATTR_APP_PARAM_DESC (10011)  
Aug 09 06:23:35.714 INFO  1148090112 Statement::SQLGetStmtAttrW: Attribute: SQL_ATTR_IMP_ROW_DESC (10012)  
Aug 09 06:23:35.714 INFO  1148090112 Statement::SQLGetStmtAttrW: Attribute: SQL_ATTR_IMP_PARAM_DESC (10013)  
Aug 09 06:25:39.409 INFO  1148090112 StatementState::InternalPrepare: Preparing query: SELECT version FROM sys.version  
Aug 09 06:25:39.412 INFO  1148090112 DiracClient::ExecuteStatementDirect: DRResultSet [19494c0] executes query 'SELECT * FROM (SELECT version FROM sys.version) T LIMIT 0'  
Aug 09 06:25:39.412 DEBUG 1148090112 DRResultSet::GetNextRecordBatch: START DEQUEUE. Current ResultQueue Size = 0  
Aug 09 06:25:40.219 DEBUG 979547904 DiracClient::QueryResultsListener: DRResultSet [19494c0]: Handle RecordBatch [34001c40], DrillClientError [0]'  
Aug 09 06:25:40.219 DEBUG 1148090112 DRResultSet::GetNextRecordBatch: END DEQUEUE. Get RecordBatch 34001c40 [m_recordCount = 0 ]. Now ResultQueue Size = 0  
Aug 09 06:25:40.219 DEBUG 1148090112 DRResultSet::InitializeColumns: Column 1 version: {DataMode: OPTIONAL, MinorType: VARCHAR, Precision = 1, Scale = 0}  
Aug 09 06:25:40.219 DEBUG 1148090112 DRResultSet::GetNextRecordBatch: START DEQUEUE. Current ResultQueue Size = 0  
Aug 09 06:25:40.276 DEBUG 979547904 DiracClient::QueryResultsListener: DRResultSet [19494c0]: Handle RecordBatch [0], DrillClientError [0]'  
Aug 09 06:25:40.276 DEBUG 1148090112 DRResultSet::GetNextRecordBatch: status = QRY_COMPLETED.  
Aug 09 06:25:40.276 INFO  1148090112 DiracClient::CloseStatement: Close statement and free query resources  
Aug 09 06:25:40.276 INFO  1148090112 DiracClient::ExecuteStatementDirect: DRResultSet [19494c0] executes query 'SELECT version FROM sys.version'  
Aug 09 06:25:40.277 DEBUG 1148090112 DRResultSet::GetNextRecordBatch: START DEQUEUE. Current ResultQueue Size = 0  
Aug 09 06:25:40.748 DEBUG 979547904 DiracClient::QueryResultsListener: DRResultSet [19494c0]: Handle RecordBatch [3400cb60], DrillClientError [0]'  
Aug 09 06:25:40.750 DEBUG 1148090112 DRResultSet::GetNextRecordBatch: END DEQUEUE. Get RecordBatch 3400cb60 [m_recordCount = 1 ]. Now ResultQueue Size = 0  
Aug 09 06:25:40.750 DEBUG 1148090112 DRResultSet::InitializeColumns: Column 1 version: {DataMode: OPTIONAL, MinorType: VARCHAR, Precision = 1, Scale = 0}  
Aug 09 06:25:40.751 DEBUG 1148090112 DRResultSet::GetNextRecordBatch: START DEQUEUE. Current ResultQueue Size = 0  
Aug 09 06:25:40.985 DEBUG 979547904 DiracClient::QueryResultsListener: DRResultSet [19494c0]: Handle RecordBatch [0], DrillClientError [0]'  
Aug 09 06:25:40.985 DEBUG 1148090112 DRResultSet::GetNextRecordBatch: status = QRY_COMPLETED.  
Aug 09 06:25:40.986 INFO  1148090112 DiracClient::CloseStatement: Close statement and free query resources  
Aug 09 06:25:40.991 ERROR 1148090112 Statement::SQLCloseCursor: [MapR][ODBC] (10510) Invalid cursor state.  
Aug 09 06:27:19.748 ERROR 1148090112 Statement::SQLCloseCursor: [MapR][ODBC] (10510) Invalid cursor state.  
Aug 09 06:27:19.748 INFO  1148090112 CInterface::SQLFreeHandle: Freeing statement handle.  
Aug 09 06:27:19.748 INFO  1148090112 DiracClient::~DiracClient: Close the Dirac Client [194bd90]  
Aug 09 06:27:19.748 INFO  1148090112 CInterface::SQLFreeHandle: Freeing connection handle.  
Aug 09 06:27:19.749 INFO  1148090112 CInterface::SQLFreeHandle: Freeing environment handle.  
Hooking it up to OBIEE

We've shown that we can connect to Apache Drill using ODBC and query it. Let's see if we get it to work for OBIEE. Our starting point is the 12c docs, Configuring Database Connections Using Native ODBC Drivers.

  1. Environment variables are defined per-component in a .properties file, which for the BI Server (OBIS / nqsserver) is BI_DOMAIN/config/fmwconfig/bienv/obis/obis.properties, so on SampleApp /app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/OBIS/obis.properties. To this file (which in 12.2.1 is empty by default) we add:

    MAPRDRILLINI=/home/oracle/.mapr.drillodbc.ini
    LD_LIBRARY_PATH=/opt/mapr/drillodbc/lib/64
    
  2. Add the Drill DSN to odbc.ini which for OBIEE already exists and is populated with other ODBC configurations. You'll find the file in BI_DOMAIN/config/fmwconfig/bienv/core, which on SampleApp is /app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/odbc.ini.

    1. Add to the [ODBC Data Sources] section

      DrillDSN=MapR Drill ODBC Driver 64-bit
      
    2. Add a section to the bottom of the file:

      [DrillDSN]
      Driver=/opt/mapr/drillodbc/lib/64/libmaprdrillodbc64.so
      AuthenticationType=No Authentication
      Description=Drill ODBC Driver
      ConnectionType=Direct
      HOST=localhost
      PORT=31010
      

Now restart the BI Server:

/app/oracle/biee/user_projects/domains/bi/bitools/bin/stop.sh -i obis1 && /app/oracle/biee/user_projects/domains/bi/bitools/bin/start.sh -i obis1

We can check that the BI Server has picked up our new environment variables by using the /proc pseudo file-system (you can also see an environment variable dump as part of the obis1.out logfile during startup):

[oracle@demo ~]$ strings /proc/$(pgrep nqsserver)/environ|grep mapr
LD_LIBRARY_PATH=/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib:/app/oracle/biee/bi/bifoundation/server/bin:/app/oracle/biee/bi/bifoundation/web/bin:/app/oracle/biee/bi/clients/Essbase/EssbaseRTC/bin:/app/oracle/biee/bi/lib:/app/oracle/biee/lib:/app/oracle/biee/oracle_common/adr:/app/oracle/biee/oracle_common/lib:/usr/lib:/lib:/opt/mapr/drillodbc/lib/64:::/app/oracle/app/oracle/product/12.1.0/dbhome_1/lib:/app/oracle/endeca/olt/bin:/app/oracle/biee/wlserver/server/native/linux/x86_64:/app/oracle/biee/wlserver/server/native/linux/x86_64/oci920_8:/app/oracle/biee/wlserver/server/native/linux/x86_64:/app/oracle/biee/wlserver/server/native/linux/x86_64/oci920_8
MAPRDRILLINI=/home/oracle/.mapr.drillodbc.ini

Over in the Administration Tool I set up a new Database (type: ODBC Basic) and associated connection pool (ODBC 2.0) pointing to DrillDSN.

Now to try it out! In Answers I build a Direct Database Request:

But...oh no!

Odbc driver returned an error (SQLExecDirectW).
State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. (HY000)
State: HY000. Code: 12010. [nQSError: 12010] Communication error connecting to remote end point: address = demo.us.oracle.com; port = 7792. (HY000)
SQL Issued: {call NQSGetQueryColumnInfo('EXECUTE PHYSICAL CONNECTION POOL "Apache Drill"."Apache Drill" SELECT version FROM sys.version;')}

Let's pick through this, because it's a common error and easily misinterpreted.

  1. Odbc driver returned an error (SQLExecDirectW).

    • When you run an analysis/dashboard, Presentation Services (OBIPS/sawserver) connects to the BI Server (OBIS/nqsserver) to send the request as Logical SQL, and it connects to the BI Server using ODBC. Therefore any kind of issue running the request will always show as an "ODBC error".

    • Learning: nothing, other than that Presentation Services hit an error.

  2. State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. (HY000)

    • Learning: nothing. Something went wrong, somewhere, and OBIEE was involved (nQSError).
  3. State: HY000. Code: 12010. [nQSError: 12010] Communication error connecting to remote end point: address = demo.us.oracle.com; port = 7792. (HY000)

    • Now we're getting somewhere. Port 7792 is, on this server, the BI Server (OBIS / nqsserver). And, there was an error connecting to it.
    • Learning: We failed to successfully connect to the BI Server. Is it running? Was it running but crashed? Is there a network problem? Lots to investigate.
  4. SQL Issued: {call NQSGetQueryColumnInfo('EXECUTE PHYSICAL CONNECTION POOL "Apache Drill"."Apache Drill" SELECT version FROM sys.version;')}

    • Learning: This is the logical SQL that Presentation Services was trying to run. Useful to know.

Let's go and have a look at the BI Server log, because something's evidently not right. We know that it was running, because otherwise we'd not have been able to login to OBIEE in the first place.

In /app/oracle/biee/user_projects/domains/bi/servers/obis1/logs/obis1.out there's some bad news:

<NodeManager ComponentManager> <The server 'obis1' with process id 7059 is no longer alive; waiting for the process to die.>
<NodeManager ComponentManager> <Process died.>
<NodeManager ComponentManager> <get latest startup configuration before deciding/trying to restart the server>
<NodeManager ComponentManager> <Server failed so attempting to restart (restart count = 1)>

So it looks like the BI Server crashed. In the folder above where obis1.out is kept, there's a corresponding crash report (note that the process ID matches the log message above), nqsserver_7059_crashreport.txt. Some bits of interest from it:

[...]
Beginning of crash dump...  
Signal: 11  
================================================================
[...]
Activity type: ExecutePhysical GatewayDbGateway Prepare  
DSN:Apache Drill  
User Name:weblogic  
SQL:SELECT version FROM sys.version;  
================================================================
Activity #2 ECID: 005EPt5qYm2Fw000jzwkno0001iJ0001KT,0  
================================================================
Activity type: Producer Executing Query  
Repository Name:ssi;Subject Area Name:;User Name:weblogic  
Logical Hash of SQL: 0x3d5c4ef2SQL:{call NQSGetQueryColumnInfo('EXECUTE PHYSICAL  CONNECTION POOL "Apache Drill"."Apache Drill" SELECT version FROM sys.version;')}  
[...]

BACKTRACE:  
/app/oracle/biee/bi/bifoundation/server/bin/libnqutilityserver64.so(+0x12a829)[0x7f554995b829]
/lib64/libpthread.so.0(+0xf790)[0x7f5547cde790]
/lib64/libc.so.6(cfree+0x1c)[0x7f554721693c]
/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbc.so(+0xa5c55)[0x7f550f2efc55]
/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbc.so(+0xa4fc0)[0x7f550f2eefc0]
/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbc.so(+0xa53f0)[0x7f550f2ef3f0]
/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbc.so(SQLDriverConnect+0x149)[0x7f550f2d8337]
/app/oracle/biee/bi/bifoundation/server/bin/libnqsdbgatewayodbc64.so(_ZN8NQSQLAPI18NQSQLDriverConnectEPvS0_RKN3sup6StringIwN7_SASSTL9allocatorIwEEEES8_S8_tbS8_+0x1c8)[0x7f54abfb5078]
/app/oracle/biee/bi/bifoundation/server/bin/libnqsdbgatewayodbc64.so(_ZN24OdbcDataSourceConnection7ConnectERK6SignonbjN10ServerInfo17TxnIsolationLevelEi+0x275)[0x7f54abfcbb45]
/app/oracle/biee/bi/bifoundation/server/bin/libnqutilityserver64.so(_ZN20DataSourceConnection7ConnectERK6SignonbRK14GatewayOptions+0x79)[0x7f55498d6819]
/app/oracle/biee/bi/bifoundation/server/bin/libnqsdbconnection64.so(_ZN12DbConnection9FixDamageERK14GatewayOptions+0x7c)[0x7f55524e12cc]
/app/oracle/biee/bi/bifoundation/server/bin/libnqsdbconnection64.so(_ZN20SingleThreadedDbPool22GetAvailableConnectionERK6SignonR17DbConnectionGuardRbRK14GatewayOptions+0x45e)[0x7f555250513e]
/app/oracle/biee/bi/bifoundation/server/bin/libnqsdbgateway64.so(_ZN11NQDbGateway7PrepareEv+0xb14)[0x7f5552294fa4]
/app/oracle/biee/bi/bifoundation/server/bin/libnqsexecutionlist64.so(_ZN8Producer11ExecuteOnceEv+0xb4c)[0x7f555146192c]
/app/oracle/biee/bi/bifoundation/server/bin/libnqutilitygeneric64.so(_ZN16NQExecutionState17ExecuteSystemMainEv+0x82)[0x7f5549c92032]
/app/oracle/biee/bi/bifoundation/server/bin/libnqutilitygeneric64.so(_ZN15NQThreadJobBase17ExecuteSystemMainEv+0x51)[0x7f5549cd9dd1]
/app/oracle/biee/bi/bifoundation/server/bin/libnqsdbconnection64.so(_ZN18NQThreadServerBaseI17ManagedJobFunctorE11ScheduleJobEv+0x40)[0x7f5552509650]
/app/oracle/biee/bi/bifoundation/server/bin/libnqutilitygeneric64.so(_ZN19NQThreadServerNoJob15ExecuteUserMainEv+0x16)[0x7f5549cda256]
/app/oracle/biee/bi/bifoundation/server/bin/libnqutilitygeneric64.so(_ZN16NQExecutionState17ExecuteSystemMainEv+0x82)[0x7f5549c92032]
/app/oracle/biee/bi/bifoundation/server/bin/libnqutilitygeneric64.so(_ZN8NQThread17ExecuteSystemMainEv+0x5b)[0x7f5549cd897b]
/app/oracle/biee/bi/bifoundation/server/bin/libnqutilitygeneric64.so(_ZN8NQThread15ThreadMainEntryEPv+0x2e)[0x7f5549cd8c9e]
/lib64/libpthread.so.0(+0x7a51)[0x7f5547cd6a51]
/lib64/libc.so.6(clone+0x6d)[0x7f554728393d]

Signal 11 is "SIGSEGV", or segmentation fault, which is bad, m'kay?

Tracing OBIEE's ODBC Connectivity with strace

Looking in at the ODBC driver.log we can see that something happened, but no error logged. Here's everything, minus TRACE logs:

Aug 09 07:13:43.283 INFO  2896611072 Driver::LogVersions: SDK Version: 09.05.07.1008  
Aug 09 07:13:43.288 INFO  2896611072 Driver::LogVersions: DSII Version: 1.2.1.1001  
Aug 09 07:13:43.288 INFO  2896611072 Driver::Initialize: Database CHAR Encoding: UTF-8  
Aug 09 07:13:43.288 INFO  2896611072 Driver::Initialize: Database WCHAR Encoding: UTF-8  
Aug 09 07:13:43.288 INFO  2896611072 Driver::Initialize: ANSI CHAR Encoding: UTF-8  
Aug 09 07:13:43.288 INFO  2896611072 Driver::Initialize: Driver Manager WCHAR Encoding: UTF-32LE  
Aug 09 07:13:43.289 INFO  2896611072 Driver::Initialize: Detected Driver Manager: DataDirect Driver Manager, 07.14.0119 (U0115)  
Aug 09 07:13:43.289 INFO  2896611072 Driver::Initialize: Locale name: en_US  
Aug 09 07:13:43.289 INFO  2896611072 Driver::Initialize: Bitness: 64-bit  
Aug 09 07:13:43.290 INFO  2896611072 CInterface::SQLAllocHandle: Allocating environment handle.  
Aug 09 07:13:43.293 INFO  2896611072 CInterface::SQLAllocHandle: Allocating connection handle.  
Aug 09 07:13:43.298 ERROR 2896611072 CInterface::SQLGetConnectAttr: [MapR][ODBC] (10210) Attribute identifier invalid or not supported: 1063  
Aug 09 07:13:43.300 INFO  2896611072 CInterface::SQLFreeHandle: Freeing connection handle.  
Aug 09 07:13:43.301 INFO  2896611072 CInterface::SQLFreeHandle: Freeing environment handle.  
Aug 09 07:13:43.312 INFO  2896611072 CInterface::SQLAllocHandle: Allocating environment handle.  
Aug 09 07:13:43.315 INFO  2896611072 Environment::SQLSetEnvAttr: Attribute: SQL_ATTR_ODBC_VERSION (200)  
Aug 09 07:13:43.315 INFO  2896611072 EnvironmentAttributes::SetAttribute: Setting ODBC version to: 2  
Aug 09 07:13:43.315 INFO  2896611072 CInterface::SQLAllocHandle: Allocating connection handle.  
Aug 09 07:13:43.317 INFO  2896611072 Environment::SQLGetEnvAttr: Attribute: Unknown Attribute (1065)  
Aug 09 07:13:43.317 INFO  2896611072 Connection::SQLGetInfoW: InfoType: SQL_DRIVER_ODBC_VER (77)  

One thing that stands out (with the benefit of hindsight because I've been hacking away at this for a while) is

Driver::Initialize: Detected Driver Manager: DataDirect Driver Manager, 07.14.0119 (U0115)

Whereas in my successful standalone test above it was

Driver::Initialize: Detected Driver Manager: iODBC, 3.52.7

So now let's do some digging. Using strace (which I've written about previously) we can see every system call that OBIEE makes, letting us audit the files that it reads and libraries that it loads. You can invoke strace at process startup or, easier, attach it to a running one. Since the BI Server also creates child processes, and also may crash (as here), we'll use a loop to make sure we've always got a strace running against it until we opt to cancel it:

while [ 1 -eq 1 ]; do strace -f -p $(pgrep nqsserver) -tt -y -s 4096 -o ~/nqs_trace_$(date +%Y%m%dT%H%M%S%N).out;done

If the BI Server isn't running, you'll just see

strace: Invalid process id: '-tt'

But once it starts, you'll get

Process 8457 attached

The output is written to the home folder, in files named nqs_trace_ followed by the timestamp at which the trace started. To cancel it, hit Ctrl-C several times to break out of strace and the while loop. If you leave this running for a long period of time you'll likely fill the filesystem and possibly make the server unusable…

Using your text editor of choice, you can search through the strace output to try and pick apart what's happening. Even if you don't find the cause of the error, you can at least rule other causes out this way. For example, whether a configuration file that you think is being used actually is, or where a library is actually being read from.

First up in searching the trace for odbc is the BI Server's own ODBC gateway library being opened and read:

7222  07:13:43.080212 open("/app/oracle/biee/bi/bifoundation/server/bin/libnqsdbgatewayodbc64.so", O_RDONLY) = 31

Next it checks for the odbc.ini configuration file and gets a success return code (0), but at this point it doesn't read it

8382  07:13:43.097901 stat("/app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/odbc.ini", {st_mode=S_IFREG|0640, st_size=2613, ...}) = 0

and then looks for the ODBC trace library, and this is interesting because it can't find it. One of the very useful things with strace is to be able to see all the paths that a program tries to read for a given file. Here you see a selection of the paths it's trying, and the failure (-1 ENOENT (No such file or directory)) each time:

8382  07:13:43.098135 stat("/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/odbctrac.so", 0x7f54aca68b90) = -1 ENOENT (No such file or directory)
8382  07:13:43.098190 stat("/usr/local/lib//app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/odbctrac.so", 0x7f54aca68b90) = -1 ENOENT (No such file or directory)
8382  07:13:43.098230 stat("/usr/local/lib/..//app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/odbctrac.so", 0x7f54aca68b90) = -1 ENOENT (No such file or directory)
8382  07:13:43.098274 stat("/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib//app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/odbctrac.so", 0x7f54aca68b90) = -1 ENOENT (No such file or directory)
8382  07:13:43.098312 stat("/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/..//app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/odbctrac.so", 0x7f54aca68b90) = -1 ENOENT (No such file or directory)

We can use one of my favourite linux utilities, locate, to find where this file actually is:

[oracle@demo ~]$ locate odbctrac.so
/app/oracle/biee/bi/bifoundation/odbc/lib/odbctrac.so
/app/oracle/biee/bi/common/ODBC/Merant/5.3/lib/odbctrac.so

You can install this on your system with sudo yum install -y mlocate, and if you've recently changed files (e.g. installed a new package) refresh the database with sudo updatedb before running locate.

This is one interesting point of note here, then, that the ODBC trace library isn't loading properly. Let's note this for now, and carry on through the trace file. Next up it opens and reads the odbc.ini file:

8382  07:13:43.102664 open("/app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/odbc.ini", O_RDONLY) = 31
8382  07:13:43.102786 read(31</app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/odbc.ini>, "[ODBC]\nTrace=1\nTraceFile=/app/oracle/biee/user_projects/domains/bi/odbctrace.out\nTraceDll=/app/oracle/biee/bi/common/ODBC/M [...]

Now we get into the meat of it. The Apache Drill ODBC driver gets checked that it exists:

8382  07:13:43.109475 stat("/opt/mapr/drillodbc/lib/64/libmaprdrillodbc64.so", {st_mode=S_IFREG|0645, st_size=38794216, ...}) = 0

and then opened and read. Next OBIEE tries to read the libdrillClient.so library, starting here:

8382  07:13:43.113071 open("/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)

Note that this fails (-1 ENOENT), and it works it's way through other paths until it succeeds:

8382  07:13:43.113071 open("/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113189 open("/app/oracle/biee/bi/bifoundation/server/bin/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113238 open("/app/oracle/biee/bi/bifoundation/web/bin/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113274 open("/app/oracle/biee/bi/clients/Essbase/EssbaseRTC/bin/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113309 open("/app/oracle/biee/bi/lib/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113341 open("/app/oracle/biee/lib/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113374 open("/app/oracle/biee/oracle_common/adr/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113408 open("/app/oracle/biee/oracle_common/lib/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113440 open("/usr/lib/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113484 open("/lib/libdrillClient.so", O_RDONLY) = -1 ENOENT (No such file or directory)
8382  07:13:43.113517 open("/opt/mapr/drillodbc/lib/64/libdrillClient.so", O_RDONLY) = 31

Where do these paths come from? LD_LIBRARY_PATH! And we can confirm that by comparing the above sequence of folders with the value of LD_LIBRARY_PATH that we saw above, split out onto lines here with a bit of sed magic:

[oracle@demo ~]$  strings /proc/$(pgrep nqsserver)/environ|grep LD_LIBRARY_PATH|sed 's/:/\n/g'|sed 's/=/\n/g'
LD_LIBRARY_PATH
/app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib
/app/oracle/biee/bi/bifoundation/server/bin
/app/oracle/biee/bi/bifoundation/web/bin
/app/oracle/biee/bi/clients/Essbase/EssbaseRTC/bin
/app/oracle/biee/bi/lib
/app/oracle/biee/lib
/app/oracle/biee/oracle_common/adr
/app/oracle/biee/oracle_common/lib
/usr/lib
/lib
/opt/mapr/drillodbc/lib/64

So...we've read the odbc.ini config file, loaded the Apache Drill ODBC driver (libmaprdrillodbc64.so) and associated library (libdrillClient.so). Now a further library is opened and read:

8382  07:13:43.271710 open("/opt/mapr/drillodbc/lib/64/SimbaDrillODBC.did", O_RDONLY) = 31

and then the Apache Drill ODBC driver configuration file that we created earlier (and referenced in the MAPRDRILLINI environment variable) is opened and read:

8382  07:13:43.281356 open("/home/oracle/.mapr.drillodbc.ini", O_RDONLY) = 31

Because we've enabled logging in the driver configuration, what happens next in the strace is that log file being created and written to:

8382  07:13:43.283527 mkdir("/tmp/odbc.log/", 0755) = -1 EEXIST (File exists)
8382  07:13:43.283834 open("/tmp/odbc.log/driver.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 31
8382  07:13:43.284037 fstat(31</tmp/odbc.log/driver.log>, {st_mode=S_IFREG|0640, st_size=174018, ...}) = 0
8382  07:13:43.284220 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f54ad460000
8382  07:13:43.284338 fstat(31</tmp/odbc.log/driver.log>, {st_mode=S_IFREG|0640, st_size=174018, ...}) = 0
8382  07:13:43.284389 lseek(31</tmp/odbc.log/driver.log>, 174018, SEEK_SET) = 174018
8382  07:13:43.284438 write(31</tmp/odbc.log/driver.log>, "Aug 09 07:13:43.283 INFO  2896611072 Driver::LogVersions: SDK Version: 09.05.07.1008\n", 85) = 85
[...]

After a bunch of trace logging is written, the ODBC messages translation file is read:

8382  07:13:43.290888 read(34</opt/mapr/drillodbc/ErrorMessages/en-US/ODBCMessages.xml>, "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!DOCTYPE Messages [\n    <!ELEMENT Messages (Package*)>\n\n    <!ELEMENT Package (Error*)>\n    <!ATTLI

The next file read is odbc.ini again

8382  07:13:43.304659 read(35</app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/odbc.ini>, "", 4096) = 0

followed by odbc.ini in the default /etc/ path (which exists but is zero bytes):

8382  07:13:43.306727 open("/etc/odbc.ini", O_RDONLY) = 35

The odbcinst.ini file is then read, from /etc/odbcinst.ini (which is created by unixODBC on installation, and has entries for non-existant mysql/postgresql drivers):

8382  07:13:43.307417 read(35</etc/odbcinst.ini>, "# Example driver definitions\n\n# Driver from the postgresql-odbc package\n# Setup from the unixODBC package\n[PostgreSQL]\nDescription\t= ODBC for PostgreSQL\nDriver\t\t= /usr/lib/psqlod

and then the version for Drill:

8382  07:13:43.308040 read(35</home/oracle/.odbcinst.ini>, "[ODBC Drivers]\nMapR Drill ODBC Driver 64-bit=Installed\n\n[MapR Drill ODBC Driver 64-bit]\nDescription=MapR Drill ODBC Driver(64-bit)\nDriver=/opt/mapr/drillodbc/lib/64/libmaprd

Back to the odbc.ini file, both OBIEE and default (empty) system:

8382  07:13:43.309065 open("/app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/odbc.ini", O_RDONLY) = 35
8382  07:13:43.309619 read(35</etc/odbc.ini>, "", 4096) = 0

After another read of the two odbcinst.ini files (per above), there's logging information written

8382  07:13:43.312345 write(31</tmp/odbc.log/driver.log>, "Aug 09 07:13:43.312 INFO  2896611072 CInterface::SQLAllocHandle: Allocating environment handle.\n", 96) = 96
[...]
8382  07:13:43.317106 write(31</tmp/odbc.log/driver.log>, "Aug 09 07:13:43.317 INFO  2896611072 Connection::SQLGetInfoW: InfoType: SQL_DRIVER_ODBC_VER (77)\n", 97) = 97
8382  07:13:43.317141 stat("/tmp/odbc.log/driver.log", {st_mode=S_IFREG|0640, st_size=179550, ...}) = 0

And then, bang.

8382  07:13:43.317198 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x1f} ---

After which, the crashreport is written:

8382  07:13:43.317249 open("/app/oracle/biee/user_projects/domains/bi/servers/obis1/nqsserver_7059_crashreport.txt", O_WRONLY|O_CREAT|O_APPEND, 0600) = 35
[...]

So from this we can see that the ODBC driver is being loaded from the path we expected, and the configuration files that we expected are being read too. So no smoking gun - but a couple of possible problems ruled out.

Reproducing the Crash on Demand

With this kind of problem it's often a case of working through lots of different configuration settings to obtain further diagnostic evidence - which requires reproducing the problem each time. Instead of manually running the query through Answers which triggers the problem, I moved this part of the diagnostics to the command line, with nqcmd. I copied the logical SQL to a file drill-test.lsql:

EXECUTE PHYSICAL CONNECTION POOL "Apache Drill"."Apache Drill" SELECT version FROM sys.version;

and then run with:

/app/oracle/biee/user_projects/domains/bi/bitools/bin/nqcmd.sh -d AnalyticsWeb -u weblogic -p Admin123 -s ~/drill-test.lsql

gives the same result (error) but easier to do.

Aug 09, 2016 7:59:53 PM oracle.bi.endpointmanager.impl.bienv.BIEnvComponents loadFromInputStream  
INFO: Reading components from: /app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/bienv-components.xml ...

-------------------------------------------------------------------------------
          Oracle BI ODBC Client
          Copyright (c) 1997-2015 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------


EXECUTE PHYSICAL CONNECTION POOL "Apache Drill"."Apache Drill" SELECT version FROM sys.version  
EXECUTE PHYSICAL CONNECTION POOL "Apache Drill"."Apache Drill" SELECT version FROM sys.version  
[unixODBC][nQSError: 12010] Communication error connecting to remote end point: address = demo.us.oracle.com; port = 7792.[unixODBC][NQODBC] [SQL_STATE: S1000] [nQSError: 10058] A general error has occurred.

Statement preparation failed


Processed: 1 queries  
Encountered 1  errors  
Stuff Ruled Out
  • The odbcinst.ini file doesn't have to be present - strace showed it is looked for, but the result is the same even if it's not (or is but doesn't have an entry for the Drill drivers):

    4270  20:00:06.692891 open("/home/oracle/.odbcinst.ini", O_RDONLY) = -1 ENOENT (No such file or directory)
    
  • Changing the connection pool call interface to ODBC 3.5 doesn't fix things, nor does changing the database type to Apache Hadoop (grasping at straws...)

Driver Manager

One of the routes of investigation is the Driver Manager. This is for two reasons; one is that on Windows ODBC is native, the driver manager is built in. As mentioned, @cfiston has got this to work - but on Windows. So platform differences aside (he says casually), this could be something. The second reason is the difference in the driver manager that shows in the ODBC log between a successful (via iodbctest) request

Driver::Initialize: Detected Driver Manager: iODBC, 3.52.7

versus an unsuccessful one (in OBIEE):

Driver::Initialize: Detected Driver Manager: DataDirect Driver Manager, 07.14.0119 (U0115)

Before we dig into this further let's take a look at the driver manager libraries that we've got on the system. DataDirect and unixODBC are both libodbcinst.so whilst iODBC is libiodbc.so. We'll use a bit of bash magic to find all the files, and list their sizes and checksums so we can spot which are the same:

[oracle@demo odbc.log]$ locate libodbcinst|xargs -Ifoo ls -l foo
-rw-r----- 1 oracle oinstall 1380087 2015-10-13 03:04 /app/oracle/biee/bi/bifoundation/odbc/lib/libodbcinst.so
-rw-r----- 1 oracle oinstall 1380087 2015-10-03 02:59 /app/oracle/biee/bi/common/ODBC/Merant/5.3/lib/libodbcinst.so
-rw-r----- 1 oracle oinstall 1203032 2015-10-03 02:59 /app/oracle/biee/bi/common/ODBC/Merant/7.0.1/lib/libodbcinst.so
-rw-r----- 1 oracle oinstall 1239728 2015-10-03 02:59 /app/oracle/biee/bi/common/ODBC/Merant/7.1.2/lib/libodbcinst.so
-rw-r----- 1 oracle oinstall 1244304 2015-10-03 02:59 /app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbcinst.so
lrwxrwxrwx 1 root root 20 2016-08-08 11:55 /usr/lib64/libodbcinst.so -> libodbcinst.so.2.0.0  
lrwxrwxrwx 1 root root 20 2016-08-08 11:55 /usr/lib64/libodbcinst.so.2 -> libodbcinst.so.2.0.0  
-rwxr-xr-x 1 root root 68928 2014-08-16 19:58 /usr/lib64/libodbcinst.so.2.0.0

[oracle@demo odbc.log]$ locate libodbcinst|xargs -Ifoo md5sum foo|sort
0bfd147ff6b41daee527861143040f1b  /app/oracle/biee/bi/common/ODBC/Merant/7.1.2/lib/libodbcinst.so  
7eaee346f92169fc2e2ba5900dceefa3  /app/oracle/biee/bi/common/ODBC/Merant/5.3/lib/libodbcinst.so  
b340968ee0a2188427a66203fb0a56b7  /app/oracle/biee/bi/bifoundation/odbc/lib/libodbcinst.so  
dadcb67d26d42b0c2535a9be44d2b46f  /usr/lib64/libodbcinst.so  
dadcb67d26d42b0c2535a9be44d2b46f  /usr/lib64/libodbcinst.so.2  
dadcb67d26d42b0c2535a9be44d2b46f  /usr/lib64/libodbcinst.so.2.0.0  
eccb81df3cdaaeb83faa86dfc6187844  /app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbcinst.so  
ed27493fd52534e181e0e6cd29c6a48a  /app/oracle/biee/bi/common/ODBC/Merant/7.0.1/lib/libodbcinst.so

[oracle@demo odbc.log]$ locate libiodbcinst|xargs -Ifoo ls -l foo
lrwxrwxrwx 1 root root 22 2016-08-08 11:24 /usr/lib64/libiodbcinst.so.2 -> libiodbcinst.so.2.1.19  
-rwxr-xr-x 1 root root 72896 2010-06-23 11:07 /usr/lib64/libiodbcinst.so.2.1.19

[oracle@demo odbc.log]$ locate libiodbcinst|xargs -Ifoo md5sum foo|sort
44a432e25d176079cf30e805c648fc86  /usr/lib64/libiodbcinst.so.2  
44a432e25d176079cf30e805c648fc86  /usr/lib64/libiodbcinst.so.2.1.19  

What's interesting from this is that the libodbcinst.so installed within OBIEE (bifoundation/odbc/lib/) is the same size as the DataDirect 5.3 one, but a different checksum.

OBIEE uses ODBC for its own internal connectivity between Presentation Services (OBIPS) and BI Server (OBIS). We saw in the strace output above that odbctrac.so wasn't successfully loaded, so let's fix that in /app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/odbc.ini:

[ODBC]
Trace=1  
TraceFile=/tmp/odbctrace.out  
TraceDll=/app/oracle/biee/bi/bifoundation/odbc/lib/odbctrac.so  
[...]

With this enabled there's now debug information written - although in this case it's about OBIEE using ODBC to connect to its internal MDS schemas (which in 12c it looks to use ODBC for):

ppid=24928:pid= 3bd8:754f8720   ENTER SQLConnect  
                HDBC                0x01385750
                UCHAR *             0x012fbfb0 [      15] "opss_datasource"
                SWORD                       15
                UCHAR *             0x012fcba0 [       9] "BIEE_OPSS"
                SWORD                        9
                UCHAR *             0x54665b84 [4294967293] "******"
                SWORD                       -3

ppid=24928:pid= 3bd8:754f8720   EXIT  SQLConnect  with return code 0 (SQL_SUCCESS)  
                HDBC                0x01385750
                UCHAR *             0x012fbfb0 [      15] "opss_datasource"
                SWORD                       15
                UCHAR *             0x012fcba0 [       9] "BIEE_OPSS"
                SWORD                        9
                UCHAR *             0x54665b84 [4294967293] "******"
                SWORD                       -3

So back to driver managers. The DataDirect manager is not listed in the system requirements - perhaps it's not supported, perhaps it's not been tested. Let's see if our iodbctest works when we force it to use the DataDirect driver. We'll do this by setting the ODBCInst parameter in ~/.mapr.drillodbc.ini, along with the accompanying LDLIBRARYPATH. The latter is necessary for dependent libraries used by the driver manager. We can use strace to verify the paths being picked up.

First with the unixODBC driver:

[oracle@demo odbc.log]$ export MAPRDRILLINI=~/.mapr.drillodbc.ini
[oracle@demo odbc.log]$ export LD_LIBRARY_PATH=/opt/mapr/drillodbc/lib/64
[oracle@demo odbc.log]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64
[oracle@demo odbc.log]$ grep ODBCInstLib ~/.mapr.drillodbc.ini
ODBCInstLib=libodbcinst.so
[oracle@demo odbc.log]$ iodbctest "DSN=DrillDSN"
iODBC Demonstration program
This program shows an interactive SQL processor
Driver Manager: 03.52.0709.0909
Driver: 1.2.1.1001 (MapR Drill ODBC Driver)
SQL>SELECT version FROM sys.version

version
---------
1.7.0

result set 1 returned 1 rows.

strace shows the libodbcinst.so is loaded from where we'd expect (/usr/lib64), but that two other folders are checked first.

open("/opt/mapr/drillodbc/lib/64/libodbcinst.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/libodbcinst.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/libodbcinst.so", O_RDONLY) = 5

As a debug aside, you can also use ltrace to show the library calls being made:

[oracle@demo ~]$ ltrace --demangle -s 128 -f --indent 2 iodbctest "DSN=DrillDSN"
(0, 0, 502528, -1, 0x1f25bc2)                                                                                                                      = 0x7fb61240f160
__libc_start_main(0x402110, 2, 0x7ffd6920ba08, 0x4021f0, 0x4021e0 <unfinished ...>  
  setlocale(6, "")                                                                                                                                 = "en_GB.UTF-8"
  __printf_chk(1, 0x4025ed, 0, 0x7fb60bba0174, 0x7fb611d92ee8iODBC Demonstration program
)                                                                                     = 28
  __printf_chk(1, 0x4023f8, 0x7fb611d92e10, 0x402609, 0xffffffffThis program shows an interactive SQL processor
)                                                                                  = 48
  SQLAllocHandle(1, 0, 0x602eb8, 0x402428, 0xa726f737365636f)                                                                                      = 0
  SQLSetEnvAttr(0xbdcfe0, 200, 3, 0xfffffffb, 0x7ffd6920ac10)                                                                                      = 0
  SQLAllocHandle(2, 0xbdcfe0, 0x602ec0, 0xfffffffb, 0x7ffd6920ac10)                                                                                = 0
  SQLSetConnectOption(0xbdd220, 1051, 0x402593, 273, 0x7fb611d92ee8)                                                                               = 0
  SQLGetInfo(0xbdd220, 171, 0x7ffd6920b6a0, 255, 0x7ffd6920b29c)                                                                                   = 0
  __printf_chk(1, 0x40259c, 0x7ffd6920b6a0, 0, 0Driver Manager: 03.52.0709.0909
)                                                                                                  = 32
  __strcpy_chk(0x7ffd6920b2a0, 0x7ffd6920be7a, 1024, 0x4025af, 1)                                                                                  = 0x7ffd6920b2a0
  SQLDriverConnect(0xbdd220, 0, 0x7ffd6920b2a0, 0xfffffffd, 0x602ee0 <unfinished ...>

From the above invocation, in which it seems the unixODBC driver is being used, the ODBC driver.log records otherwise:

Aug 09 21:30:32.112 INFO  1444292352 Driver::LogVersions: SDK Version: 09.05.07.1008
Aug 09 21:30:32.112 INFO  1444292352 Driver::LogVersions: DSII Version: 1.2.1.1001
Aug 09 21:30:32.112 INFO  1444292352 Driver::Initialize: Database CHAR Encoding: UTF-8
Aug 09 21:30:32.112 INFO  1444292352 Driver::Initialize: Database WCHAR Encoding: UTF-8
Aug 09 21:30:32.112 INFO  1444292352 Driver::Initialize: ANSI CHAR Encoding: UTF-8
Aug 09 21:30:32.112 INFO  1444292352 Driver::Initialize: Driver Manager WCHAR Encoding: UTF-32LE
Aug 09 21:30:32.112 INFO  1444292352 Driver::Initialize: Detected Driver Manager: iODBC, 3.52.7
Aug 09 21:30:32.112 INFO  1444292352 Driver::Initialize: Locale name: en_GB
Aug 09 21:30:32.112 INFO  1444292352 Driver::Initialize: Bitness: 64-bit

If I use unixODBC's test tool, from the same environment as above:

[oracle@demo bin]$ isql DrillDSN
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL> SELECT version FROM sys.version
+----------
| version
+----------
| 1.7.0
+----------
SQLRowCount returns -1
1 rows fetched

Then driver.log shows:

Aug 09 21:38:25.098 INFO  1851778816 Driver::LogVersions: SDK Version: 09.05.07.1008
Aug 09 21:38:25.098 INFO  1851778816 Driver::LogVersions: DSII Version: 1.2.1.1001
Aug 09 21:38:25.099 INFO  1851778816 Driver::Initialize: Database CHAR Encoding: UTF-8
Aug 09 21:38:25.099 INFO  1851778816 Driver::Initialize: Database WCHAR Encoding: UTF-8
Aug 09 21:38:25.099 INFO  1851778816 Driver::Initialize: ANSI CHAR Encoding: UTF-8
Aug 09 21:38:25.099 INFO  1851778816 Driver::Initialize: Driver Manager WCHAR Encoding: UTF-32LE
Aug 09 21:38:25.099 INFO  1851778816 Driver::Initialize: Detected Driver Manager: unixODBC, 2.2.14 or 2.3.x branch
Aug 09 21:38:25.099 INFO  1851778816 Driver::Initialize: Locale name: en_GB
Aug 09 21:38:25.099 INFO  1851778816 Driver::Initialize: Bitness: 64-bit

So ODBCInstLib is used, but the calling application also plays a role. We can see from strace that both applications are loading libodbcinst.so, but still recording different Driver Managers in the driver.log

[oracle@demo bin]$ strace isql DrillDSN 2>&1|grep odbcinst.so
open("/opt/mapr/drillodbc/lib/64/libodbcinst.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/libodbcinst.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/libodbcinst.so", O_RDONLY) = 5
[oracle@demo bin]$ ^C

[oracle@demo bin]$ strace iodbctest "DSN=DrillDSN" 2>&1|grep odbcinst.so
open("/opt/mapr/drillodbc/lib/64/libodbcinst.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/libodbcinst.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/libodbcinst.so", O_RDONLY) = 5
[oracle@demo bin]$ ^C

For the two above invocations, the driver is recorded thus:

[oracle@demo bin]$ grep "Driver::Initialize: Detected Driver Manager:" /tmp/odbc.log/driver.log
[...]
Aug 09 21:48:47.814 INFO  1812629248 Driver::Initialize: Detected Driver Manager: unixODBC, 2.2.14 or 2.3.x branch
Aug 09 21:49:08.840 INFO  1312765696 Driver::Initialize: Detected Driver Manager: iODBC, 3.52.7

Let's branch out a bit. nqcmd, as well as being useful for invoking OBIEE logical SQL against the BI Server, is also an ODBC client (in the same way that iodbctest and isql are).

[oracle@demo bin]$ /app/oracle/biee/user_projects/domains/bi/bitools/bin/nqcmd.sh -d DrillDSN -u foo -p bar
Aug 09, 2016 9:53:00 PM oracle.bi.endpointmanager.impl.bienv.BIEnvComponents loadFromInputStream  
INFO: Reading components from: /app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/bienv-components.xml ...

-------------------------------------------------------------------------------
          Oracle BI ODBC Client
          Copyright (c) 1997-2015 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------



        [T]able info
        [C]olumn info
        [D]ata type info
        [F]oreign keys info
        [P]rimary key info
        [K]ey statistics info
        [S]pecial columns info
        [Q]uery statement
Select Option: Q  
Give SQL Statement: SELECT version FROM sys.version  
SELECT version FROM sys.version  
--------------------------------
version  
-------
1.7.0  
------
Row count: 1  

Interesting. Now we're moving into the OBIEE stack, and a step closer to the BI Server itself. As I know you're dying to find out, the Driver Manager recorded by the above nqcmd call is:

Aug 09 21:53:03.064 INFO  706402080 Driver::Initialize: Detected Driver Manager: unixODBC, 2.2.14 or 2.3.x branch

Let's switch ODBCInstLib to iODBC now:

[oracle@demo bin]$ grep ODBCInstLib ~/.mapr.drillodbc.ini
ODBCInstLib=libiodbcinst.so.2
[oracle@demo bin]$ echo $LD_LIBRARY_PATH
/opt/mapr/drillodbc/lib/64:/usr/lib64
[oracle@demo bin]$ echo $MAPRDRILLINI
/home/oracle/.mapr.drillodbc.ini

Running the same strace tests as above we confirm that the iODBC library is being used:

[oracle@demo bin]$ strace iodbctest "DSN=DrillDSN" 2>&1|grep odbcinst.so
open("/opt/mapr/drillodbc/lib/64/libiodbcinst.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/libiodbcinst.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/libiodbcinst.so.2", O_RDONLY) = 5
^C
[oracle@demo bin]$ strace isql DrillDSN 2>&1|grep odbcinst.so
open("/opt/mapr/drillodbc/lib/64/libiodbcinst.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/tls/libiodbcinst.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib64/libiodbcinst.so.2", O_RDONLY) = 5
^C

and the same driver manager pattern recorded for the two above invocations respectively:

Aug 09 21:57:02.826 INFO  3571013376 Driver::Initialize: Detected Driver Manager: iODBC, 3.52.7
Aug 09 21:57:07.875 INFO  3464189696 Driver::Initialize: Detected Driver Manager: unixODBC, 2.2.14 or 2.3.x branch

Running nqcmd again works as before, including with the same driver manager recorded in driver.log

Looking at the dependencies for these programs with ldd shows:

  • nqcmd

    [oracle@demo ~]$ ldd /app/oracle/biee/bi/bifoundation/server/bin/nqcmd
    linux-vdso.so.1 =>  (0x00007fffef5b2000)
    libodbc.so => /usr/lib64/libodbc.so (0x00007f8ed3248000)
    libodbcinst.so => /usr/lib64/libodbcinst.so (0x00007f8ed3036000)
    libARicu27.so => not found
    libnqsgenericodbcinterface64.so => not found
    libnqportable64.so => not found
    libnqutilitygeneric64.so => not found
    libnqutilityserver64.so => not found
    libnqsfileutility64.so => not found
    libnqutilityclient64.so => not found
    libnqsclusterutility64.so => not found
    libxerces-c.so.28 => not found
    libnqstlport64.so => not found
    libnqsserventry64.so => not found
    libnqsobjectmodel64.so => not found
    libsamemoryallocator864.so => not found
    libz.so.1 => /lib64/libz.so.1 (0x00007f8ed2e06000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f8ed2be9000)
    libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f8ed28e3000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f8ed265f000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f8ed2448000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f8ed20b4000)
    libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x00007f8ed1eab000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f8ed34b0000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f8ed1ca6000)
    
  • isql

    [oracle@demo ~]$ ldd /usr/bin/isql
    linux-vdso.so.1 =>  (0x00007ffc8278f000)
    libodbc.so.2 => /usr/lib64/libodbc.so.2 (0x00007f214b44e000)
    libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x00007f214b244000)
    libreadline.so.6 => /lib64/libreadline.so.6 (0x00007f214afec000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f214adcf000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f214aa3a000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f214a836000)
    libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00007f214a615000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f214b6b6000)
    
  • iodbctest

    [oracle@demo ~]$ ldd /usr/bin/iodbctest
    linux-vdso.so.1 =>  (0x00007ffdf9103000)
    libiodbc.so.2 => /usr/lib64/libiodbc.so.2 (0x00007ff5ef388000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007ff5ef16e000)
    libc.so.6 => /lib64/libc.so.6 (0x00007ff5eedda000)
    /lib64/ld-linux-x86-64.so.2 (0x00007ff5ef5dc000)
    

Of note here is that only nqcmd depends on a driver manager (libodbcinst.so) - so at a guess the other two tools interact with the ODBC drivers directly? Although the strace above did show each one loading the library, so ... ?

At the moment libodbcinst.so is a soft link to the .2.0.0 unixODBC version

[oracle@demo lib64]$ ls -l /usr/lib64/libodbc.so
lrwxrwxrwx 1 root root 16 2016-08-08 11:55 /usr/lib64/libodbc.so -> libodbc.so.2.0.0

What happens if we change this to point to the iODBC one?

[oracle@demo lib64]$ sudo ln -sf /usr/lib64/libiodbcinst.so.2 /usr/lib64/libodbcinst.so
[oracle@demo lib64]$ ls -l /usr/lib64/libodbcinst.so
lrwxrwxrwx 1 root root 28 2016-08-09 22:20 /usr/lib64/libodbcinst.so -> /usr/lib64/libiodbcinst.so.2

Same behaviour from isql and iodbctest as above (both work, both log the same driver manager as before). But this time, nqcmd reports as a different driver manager in driver.log:

Aug 09 22:24:20.514 INFO  3097786112 Driver::Initialize: Detected Driver Manager: iODBC, 3.52.7

Now let's see if we can force nqcmd to use the DataDirect manager that nqsserver is doing when it crashes:

[oracle@demo lib64]$ sudo ln -sf /app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbcinst.so /usr/lib64/libodbcinst.so
[oracle@demo lib64]$ ls -l /usr/lib64/libodbcinst.so
lrwxrwxrwx 1 root root 63 2016-08-09 22:27 /usr/lib64/libodbcinst.so -> /app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbcinst.so

The DataDirect Driver Manager library is loaded according to strace:

30100 22:30:42.267851 open("libodbcinst.so", O_RDONLY) = 3
30100 22:30:42.267898 read(3</app/oracle/biee/bi/common/ODBC/Merant/7.1.4/lib/libodbcinst.so>, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\353\6\0\0\0\0\0@\0\0\0\0\0\0\0\220\365\22\0\0\0\0\0\0\0\0\0@\0008\0\5\0@\0\34\0\33\0\1\0\0\0\5

But the driver.log still shows unixODBC:

Aug 09 22:30:43.529 INFO  181618464 Driver::Initialize: Detected Driver Manager: unixODBC, 2.2.14 or 2.3.x branch

Hmmm. Not much conclusive from this.

DriverManagerEncoding

From https://drill.apache.org/docs/configuring-odbc-on-linux/#configuring-.mapr.drillodbc.ini:

Edit the DriverManagerEncoding setting if necessary. The value is typically UTF-16 or UTF-32, but depends on the driver manager used. iODBC uses UTF-32 and unixODBC uses UTF-16. Review your ODBC Driver Manager documentation for the correct setting.

Changing it to UTF-16 in ~/.mapr.drillodbc.ini has the following effect:

  • isql works as before
  • iodbctest fails

    [oracle@demo lib64]$ iodbctest "DSN=DrillDSN"
    iODBC Demonstration program
    This program shows an interactive SQL processor
    Driver Manager: 03.52.0709.0909
    1: SQLDriverConnect = ����� (10360) SQLSTATE=̭
    1: ODBC_Connect = ����� (10360) SQLSTATE=0�
    
    
    Have a nice day.[oracle@demo lib64]$
    

    This failure is to be expected per the documentation, since iodbctest is using the iODBC driver manager which requires UTF-32

  • nqcmd connecting directly to DrillDSN works as before

  • nqcmd running the Logical SQL now works!

    [oracle@demo lib64]$ /app/oracle/biee/user_projects/domains/bi/bitools/bin/nqcmd.sh -d AnalyticsWeb -u weblogic -p Admin123 -s ~/drill-test.lsql
    Aug 09, 2016 10:47:45 PM oracle.bi.endpointmanager.impl.bienv.BIEnvComponents loadFromInputStream
    INFO: Reading components from: /app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/bienv-components.xml ...
    
    
    -------------------------------------------------------------------------------
    Oracle BI ODBC Client
    Copyright (c) 1997-2015 Oracle Corporation, All rights reserved
    -------------------------------------------------------------------------------
    
    
    EXECUTE PHYSICAL CONNECTION POOL "Apache Drill"."Apache Drill" SELECT version FROM sys.version
    EXECUTE PHYSICAL CONNECTION POOL "Apache Drill"."Apache Drill" SELECT version FROM sys.version
    -----------------------------------------------------------------------------------------------
    version
    -----------------------------------------------------------------------------------------------
    1.7.0
    -----------------------------------------------------------------------------------------------
    Row count: 1
    -----------------------------------------------------------------------------------------------
    
    
    Processed: 1 queries
    

    The Driver Manager is DataDirect as before

    Aug 09 22:48:20.825 INFO  3402434304 Driver::Initialize: Detected Driver Manager: DataDirect Driver Manager, 07.14.0119 (U0115)
    

So after all that - an incorrect configuration of the MapR ODBC driver was the culprit.

Conclusion

Hindsight is a wonderful thing - at the end of an article like this it's easy enough to wonder if strace and other tools were really necessary to get to the bottom of the issue. However, for me, using these tools is a great way of ruling out what the problem isn't.

The trick is to judge when to jump into the deep-end of system call tracing, and when to maybe take a step back and RTFM closely. Particularly if you're doing something that is supported, that is documented, then you probably wouldn't be reaching for strace. But edging away from what is supported or documented, these can be great tools to help explore how applications are running and can be made to work with other components (such as 3rd party ODBC drivers).

And if you do want the tl;dr of getting OBIEE to work with Apache Drill, it's written up using the ODBC driver over here, and with the JDBC driver in Fiston's blog post here.

Categories: BI & Warehousing

Analysing Social Media Data for the Lightyear Foundation - Part 1

Mon, 2016-08-22 08:49

Outside of my job at Rittman Mead, I'm fortunate enough to be involved with a project called the Lightyear Foundation. We are a charitable organisation which aims to promote the techniques and philosophies of science, working with children and teachers in the UK and Ghana. Particularly, we try to exemplify the fact that fundamental science principles can be demonstrated without the need for formal labs and expensive equipment. We believe that in increasing the accessibility to science, it will increase the prevalence of broadly applicable scientific skills such as critical thinking, communication of information, and the perception of failure.

Lightyear in Ghana

As a charity, we have need for marketing and donations and are thus required to understand our donor base, just as any business might try to understand their customers. Which brings us neatly to this blog. It is common for businesses to analyse the efficacy of marketing campaigns with regards to revenue, which in the modern world can range from TV to Twitter. The collection and analysis of this data has then formed a multi-million dollar industry in and of itself. Unfortunately as a charity, we don't have multi-millions to spend on this analysis. Fortunately, we do have a search engine, a bit of ingenuity, and some elbow grease. Also, Rittman Mead was kind enough to donate a server to tinker with.

Looking for a Wheel

Before re-inventing the wheel, it seemed like a good idea to see if any wheels existed and indeed what they looked like. As you might expect, there are a great deal of analytics tools aimed at social media. The problem is that most of them require some sort of payment, and the vast majority of the free ones only allow analysis of a specific source. For example, Facebook provide an extensive insights platform for your page data, but this does not access anything from Twitter.

A notable exception is Cyfe, an absolutely fantastic platform that is free to use if you don't require historic analysis (queries are limited to the last 30 days). This is a very impressive application, and I will be revisiting this in a later blog. However, historic data is essential for holistic analysis rather than simply reporting, so I'm going to take a look at how difficult it would be to build something to allow that.

Exploring the Data

With any analytical project, exploring, collecting, and processing the data is the most important and time consuming part. To start with, a few data sources were identified as being useful for the investigation:

  • Website
  • Facebook
  • Twitter
  • YouTube
  • MailChimp

This would give us information about our reach across the most important social media platforms as well as the website statistics and e-mail subscriptions. One notable omission from this blog is the donation system itself which is currently being changed. Each of the sources has an API for accessing various levels of information, with differing limitations. The first step was to try out each of the APIs and see what kind of data was retrievable.

Website Data

The easiest way to collect website data is to use Google Analytics, a free and excellent service that can be used to monitor and analyse traffic to our WordPress site. It works by embedding a small amount of JavaScript to each page, which then executes Google's tracking function, sending a variety of information about the session.

Google Analytics

The above image is an excerpt of the kind of data provided by Google Analytics, showing the number of sessions activated on the website over a period of a few months. The platform has an incredibly detailed and vast store of data, including (among others):

  • Session
  • Clicks
  • Geographic
  • User Information (although this is often missing or incomplete)
  • Referrals
  • Network
  • Browser

Naturally it also has an API, allowing online access in real time or offline access secured by a key on the server. For my initial testing, I setup an API key and wrote a python script to query the service:

>>> import lyf
>>> service = lyf.google_api('analytics', 'v3', ['https://www.googleapis.com/auth/analytics.readonly'])
>>> results = lyf.ga.query(service, '2016-07-21', '2016-07-24', 'ga:sessions', 'ga:date')
>>> for result in results:
        print(result)
...
[u'20160721', u'6']
[u'20160722', u'7']
[u'20160723', u'26']

The authentication is performed using OAuth2 and is well documented. There is also a web interface for the API complete with handy auto-complete fields for each of the parameters.

While the platform is very fast and detailed, there is one limitation, which is that you can only choose up to 8 dimensions at a time for a query. For most use cases this is perfectly acceptable, but is worth noting.

Facebook

Facebook provides their Graph API for querying data in this fashion as well as an explorer tool for testing. Authorisation can be granted and stored in the form of an API key. There is a permissions system of which the most important features are Manage Pages and Read Insights which give access to Facebook page information and reach data respectively. Some insights data is publicly available but appears to be limited to viewing a page's fanbase by country. Also, they store your data for two years, purging the rest. This means if you wish to store longer historic trends, you will need to siphon off the desired data and store it manually (we'll get to this later). Result sets are paginated, but feature a useful key-based system which allows easy retrieval of subsequent pages. As with Google, the performance of querying is very impressive, returning detailed data about 500 posts made over the last two years in under 30 seconds.

Once again, it was simple to produce the API HTTP requests using python's requests library. This library is able to automatically parse the returned JSON payload into a useful Python object. Below is an example of some simple queries run once the API was configured:

>>> results = lyf.fb_insights_query(['page_impressions'], 'day', since='2016-07-26')
>>> print(results['data'][0]['values'][0]['value'])
764  
Twitter

Twitter is a little trickier to work with. They have a public API which can be used to interface with Twitter quite extensively, but is not necessarily appropriate for analysis. Specifically, I was looking for metrics about reach and the number of impressions, similar to those obtained from Facebook's Insights API. These are available from Twitter's website, but only for a given time frame, as seen below.

Twitter Analytics

They do provide an analytical API for their Ads Service, but this does not appear to be publicly or freely available. Additionally, Gnip provides the official commercial outlet for analytical Twitter data.

Nevertheless, it was quite easy to get a Python integration set up for the public API by storing the consumer keys and access tokens for our Twitter account. Then I used Tweepy as a wrapper for Twitter's REST API. At the very least, follower and tweet count information is accessible, which can give some indication of popularity. Also, we'll be able to use this API to automate some tweet management if we need to at a later date.

YouTube

YouTube data is expectedly accessible using the same Google framework as before. This allows us to access publicly available YouTube data with only a small amount of additional set up. YouTube Analytics also has a very rich data platform similar to Google Analytics which can be seen below.

YouTube Analytics

However, while it has an API, it cannot be used with a service account, the server-to-server setup I used earlier. So for now, we'll have to settle with collecting simple statistics (views and likes per video), but if we want the richer data set (watch time, demographics, audience retention) we will need to set up a live OAuth2 connection.

MailChimp

Last but not least, MailChimp also provides some analytics capability from within the site itself.

MailChimp Analytics

In addition, there is also a well documented API for looking at your subscription lists and campaigns. This can be simply queried using the requests library and supplying an API key.

Once again, as well as the ability to perform maintenance tasks, a decent amount of information is available:

  • Click Rates
  • Subscription Numbers
  • Open Rates
  • Geographical Information
  • Recipients
  • Devices
  • Revenue
  • Monthly History
Summary

That wraps up this first exploratory part, just dipping my toes into the various data streams out there. It's clear that there's a lot of data available but there still lies some complexity on tying it together for meaningful analysis. The next part of this blog series will look at trying to collect some of this data on a single platform so they can be queried together.

If you'd like to have a go at using the code from this blog, it's all on GitHub. Bear in mind that you will need to fill in a config.ini file from the sample, and configure it for authentication with your own social media accounts.

If you'd like to know more about how Rittman Mead can help your organisation make the most of this kind of data, please get in touch via the website.

Categories: BI & Warehousing

Using Apache Drill with OBIEE 12c

Thu, 2016-08-11 11:25

Apache Drill enables querying with SQL against a multitude of datasources including things like JSON files, Parquet and Avro, Hive tables, RDBMS and more. MapR have released an ODBC driver for it, and I thought it'd be neat to get it to work with OBIEE. It evidently does work for OBIEE running on Windows, but I wanted to be able to use it on my standard environment, Linux.

For more information on Apache Drill, see my previous post, Introduction to Apache Drill.

OBIEE 12c (and 11g and 10g before it) supports three primary ways of connecting to data sources:

  1. Native Gateway, such as OCI for Oracle. This is always the preferred option as it gives the greatest support and performance.
  2. Data Direct ODBC Drivers, a set of which are bundled with OBIEE for enabling connectivity to sources such as SQL Server, MySQL, Hive, and Impala. The configuration of these is documented in the OBIEE manuals, and is generally a supported configuration.
  3. Native ODBC Drivers.

To get OBIEE to work with Apache Drill we'll use the third option - native ODBC drivers. I'm doing this on SampleApp v511.

First Things First - Setting up Apache Drill

Drill can be deployed in distributed configuration (with all the parallel processing goodness which that brings), but also run as a single instance locally. For the sake of simplicity that's what I'm going to do here. It's rather easy to do:

# Download Apache Drill
wget http://www.apache.org/dyn/closer.cgi/drill/drill-1.7.0/apache-drill-1.7.0.tar.gz
# Unpack
tar -xvf apache-drill-1.7.0.tar.gz
# Run
cd /opt/apache-drill-1.7.0/ && bin/sqlline -u jdbc:drill:zk=local

You need to make sure you've got a recent JDK available, and if you're running it on BigDataLite VM watch out for this odd problem that I had which was related to classpaths and maniested itself with the error java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.JavaType.isReferenceType()Z.

All being well, you'll now have a Drill prompt:

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0  
Aug 09, 2016 5:51:43 AM org.glassfish.jersey.server.ApplicationHandler initialize  
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...  
apache drill 1.7.0  
"say hello to my little drill"
0: jdbc:drill:zk=local>  

From here you can run a simple query to check the version:

0: jdbc:drill:zk=local> SELECT version FROM sys.version;  
+----------+
| version  |
+----------+
| 1.7.0    |
+----------+
1 row selected (0.392 seconds)  

or query one of the built-in sample data sets:

0: jdbc:drill:zk=local> select count(*) from cp.`employee.json`;  
+---------+
| EXPR$0  |
+---------+
| 1155    |
+---------+
1 row selected (0.977 seconds)  

For more examples of Drill, see the tutorials.

Setting up Drill ODBC on Linux with OBIEE

With Drill setup and running locally, let's now install the ODBC driver. This is all on SampleApp v511 / Oracle Linux 6.7.

sudo rpm -i http://package.mapr.com/tools/MapR-ODBC/MapR_Drill/MapRDrill_odbc_v1.2.1.1000/MapRDrillODBC-1.2.1.x86_64.rpm

Next, create the MapR Drill ODBC driver configuration file. A sample one is provided, which you can copy from the default installation path of /opt/mapr/drillodbc/Setup/mapr.drillodbc.ini, or create new. I put it in the default path (~/.mapr.drillodbc.ini).

[Driver]
DisableAsync=0  
DriverManagerEncoding=UTF-16  
ErrorMessagesPath=/opt/mapr/drillodbc/ErrorMessages  
LogLevel=2  
LogPath=/tmp/odbc.mapr  
SwapFilePath=/tmp

ODBCInstLib=libodbcinst.so  

In the above I've changed a few things:

  • The most important is DriverManagerEncoding. If you leave this as the default of UTF-32 OBIEE will crash (SIGSEGV) when you try to query the data in Apache Drill. You can read all about my trials and tribulations trying to figure this out in a separate blog post coming soon.
  • I've set LogLevel to 2 and LogPath to a valid path, so that there's some log files to check if things go wrong
  • Set the ODBCInstLib to libodbcinst.so which matches the built in DataDirect ODBC Driver Manager library file.

Following the documentation, Configuring Database Connections Using Native ODBC Drivers:

  1. Add the necessary environment variables to BI Server. This is done per-component in a .properties file, which for the BI Server (OBIS / nqsserver) is BI_DOMAIN/config/fmwconfig/bienv/obis/obis.properties, so on SampleApp /app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/OBIS/obis.properties. To this file (which in 12.2.1 is empty by default) we add:

    MAPRDRILLINI=/home/oracle/.mapr.drillodbc.ini
    LD_LIBRARY_PATH=/opt/mapr/drillodbc/lib/64
    
  2. Add the Drill DSN to odbc.ini which for OBIEE already exists and is populated with other ODBC configurations. You'll find the file in BI_DOMAIN/config/fmwconfig/bienv/core, which on SampleApp is /app/oracle/biee/user_projects/domains/bi/config/fmwconfig/bienv/core/odbc.ini.

    1. Add to the [ODBC Data Sources] section

      DrillDSN=MapR Drill ODBC Driver 64-bit
      
    2. Add a section to the bottom of the file:

      [DrillDSN]
      Driver=/opt/mapr/drillodbc/lib/64/libmaprdrillodbc64.so
      AuthenticationType=No Authentication
      Description=Drill ODBC Driver
      ConnectionType=Direct
      HOST=localhost
      PORT=31010
      

Now restart the BI Server:

/app/oracle/biee/user_projects/domains/bi/bitools/bin/stop.sh -i obis1 && /app/oracle/biee/user_projects/domains/bi/bitools/bin/start.sh -i obis1

On a Windows machine I installed the MapR Drill ODBC driver too and created a DSN of the same name as in my odbc.ini file above. In the Administration Tool I set up a new Database (type: ODBC Basic) and associated connection pool (ODBC 2.0) pointing to DrillDSN.

Now to try it out! In Answers I build a Direct Database Request:

and run it

Nice. We can query data held in HDFS too, again with a Direct Database Request:

Exploring Drill data with OBIEE

So the above DDR prove the connectivity works. But as any ful kno, DDR is at best a 'tactical' solution, at worst, a complete hack and maintenance nightmare. Let's use the force luke, or at least, the RPD. The first obvious thing to do is Import Metadata from the connection pool that we've defined. But doing this, there's no objects shown:

That's because 'tables' in Drill are not quite as clearly defined as in a standard RDBMS. A table could be a single file, multiple files matching a pattern, or even literally a table if connecting Drill to an RDBMS. So to expose a set of data through Drill, we define a view. This is where Drill Explorer comes in as it gives a simple GUI over the available files

from where you can use the SQL tab and Create As option to create a view

Having done this, launch the Import Metadata dialog again (right click the Connection Pool and select Import Metadata), and make sure you tick Views on the Metadata types to view. Now you'll see the object. Unfortunately, it just has a single column - *. I've not figured out yet how - if if it's possible - to get a view to explode out all columns in the underlying select clause. Import the view:

You'll get an error about the * column name, but the table and schema still get brought across.

Now the slightly tedious bit - define each physical column, and define the physical and logical model, done very simplistically here:

A simple query:

Aaaaaand a simple error:

State: HY000. Code: 16001. [nQSError: 16001] ODBC error state: S1000 code: 1040 message:
[MapR][Drill] (1040) Drill failed to execute the query:
select avg(T29568."stars") as c1, T29568."city" as c2, T29568."full_address" as c3, T29568."name" as c4 from "DRILL"."dfs.tmp"."yelp_business" T29568 group by T29568."city", T29568."full_address", T29568."name" order by 2, 4, 3
[30027]Query execution error. Details:[
PARSE ERROR: Encountered ". \"" at line 1, column 33.
Was expecting one of: ")" ... "ORDER" ... "LIMIT" ... "OFFSET" ... "FETCH" ... "," .... (HY000)

Looking at the query being run, OBIEE is using double quotation marks (") to quote identifiers, but Drill requires backtick (`) instead. Heading over to DB Features can fix this:

And refreshing the report gives:

Conclusion

This is the very basics necessary to get up and running with OBIEE and Apache Drill. It would be good to see if there's an optimal, least-friction, way for getting tables in Drill exposed to OBIEE without needing to enter each physical column.

One of the many powerful features of Drill is being able to access nested and array JSON values, which I've discussed in my Introduction to Apache Drill post. The above examples just use root-level attributes, and could easily be expanded out to process some of the nested fields (such as hours in the business data above). For the time being this would be done with DDR, or a Drill view wrapped around it imported into the Physical layer of the RPD.

Categories: BI & Warehousing

An Introduction to Apache Drill

Thu, 2016-08-11 11:16

Apache Drill is an engine that can connect to many different data sources, and provide a SQL interface to them. It's not just a wanna-be SQL interface that trips over at anything complex - it's a hugely functional one including support for many built in functions as well as windowing functions. Whilst it can connect to standard data sources that you'd be able to query with SQL anyway, like Oracle or MySQL, it can also work with flat files such as CSV or JSON, as well as Avro and Parquet formats. It's this capability to run SQL against files that first piqued my interest in Apache Drill. I've been spending a lot of time looking at Big Data architectures and tools, including Big Data Discovery. As part of this, and experimenting with data pipeline options one of the gaps that I've found is the functionality to dig through files in their raw state, before they've been brought into something like Hive which would enable their exploration through BDD and other tools.

In this article I'll walk through getting started with Apache Drill, and show some of the types of queries that I think are a great example of how useful it can be.

Getting Started

It's very simple to get going with Apache Drill - just download and unpack it, and run. Whilst it can run distributed across machines for performance, it can also run standalone on a laptop.

To launch it

cd /opt/apache-drill-1.7.0/
bin/sqlline -u jdbc:drill:zk=local

If you get No current connection or com.fasterxml.jackson.databind.JavaType.isReferenceType()Z then you have a conflicting JAR problem (e.g. I encountered this on Oracle's BigDataLite VM), and should launch it with a clean environment

env -i HOME="$HOME" LC_CTYPE="${LC_ALL:-${LC_CTYPE:-$LANG}}" PATH="$PATH" USER="$USER" /opt/apache-drill-1.7.0/bin/drill-embedded

There's a built in dataset that you can use for testing:

USE cp;
SELECT employee_id, first_name FROM `employee.json` limit 5;

This should return five rows, in a very familiar environment if you're used to using SQL*Plus and similar tools:

0: jdbc:drill:zk=local> USE cp;
+-------+---------------------------------+
|  ok   |             summary             |
+-------+---------------------------------+
| true  | Default schema changed to [cp]  |
+-------+---------------------------------+
1 row selected (1.776 seconds)
0: jdbc:drill:zk=local>     SELECT employee_id, first_name FROM `employee.json` limit 5;
+--------------+-------------+
| employee_id  | first_name  |
+--------------+-------------+
| 1            | Sheri       |
| 2            | Derrick     |
| 4            | Michael     |
| 5            | Maya        |
| 6            | Roberta     |
+--------------+-------------+
5 rows selected (3.624 seconds)

So far, so SQL, so relational - so familiar, really. Where Apache Drill starts to deviate from the obvious is its use of storage handlers. In the above query cp is the 'database' that we're running our query against, but this is in fact a "classpath" (hence "cp") storage handler that's defined by default. Within a 'database' there are 'schemas' which are sub-configurations of the storage handler. We'll have a look at viewing and defining these later on. For now, it's useful to know that you can also list out the available databases:

0: jdbc:drill:zk=local> show databases;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| sys                 |
+---------------------+

Note databases command is a synonym for schemas; it's the <database>.<schema> that's returned for both. In Apache Drill the backtick is used to enclose identifiers (such as schema names, column names, and so on), and it's quite particular about it. For example, this is valid:

0: jdbc:drill:zk=local> USE `cp.default`;
+-------+-----------------------------------------+
|  ok   |                 summary                 |
+-------+-----------------------------------------+
| true  | Default schema changed to [cp.default]  |
+-------+-----------------------------------------+
1 row selected (0.171 seconds)

whilst this isn't:

0: jdbc:drill:zk=local> USE cp.default;
Error: PARSE ERROR: Encountered ". default" at line 1, column 7.
Was expecting one of:
<EOF>
"." <IDENTIFIER> ...
"." <QUOTED_IDENTIFIER> ...
"." <BACK_QUOTED_IDENTIFIER> ...
"." <BRACKET_QUOTED_IDENTIFIER> ...
"." <UNICODE_QUOTED_IDENTIFIER> ...
"." "*" ...

SQL Query USE cp.default

This is because default is a reserved word, and hence must be quoted. Hence, you can also use

0: jdbc:drill:zk=local> use cp.`default`;

but not

0: jdbc:drill:zk=local> use `cp`.default;
Querying JSON data

On the Apache Drill website there's some useful tutorials, including one using data provided by Yelp . This was the dataset that originally got me looking at Drill, since I was using it as an input to Big Data Discovery (BDD) but struggling on two counts. First up was how best to define a suitable Hive table over it in order to ingest it to BDD. Following from this was trying to understand what value there might be in the data which would drive how long to spend perfecting the way in which I exposed the data in Hive. The examples below show the kind of complications that complex JSON can introduce when queried in a tabular fashion.

First up, querying a JSON file, with the schema inferred automagically. Pretty cool.

0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` limit 5;
+---------+------+-------------+-------+------+------+
| user_id | text | business_id | likes | date | type |
+---------+------+-------------+-------+------+------+
| -6rEfobYjMxpUWLNxszaxQ | Don't waste your time. | cE27W9VPgO88Qxe4ol6y_g | 0 | 2013-04-18 | tip |
| EZ0r9dKKtEGVx2CdnowPCw | Your GPS will not allow you to find this place. Put Rankin police department in instead. They are directly across the street. | mVHrayjG3uZ_RLHkLj-AMg | 1 | 2013-01-06 | tip |
| xb6zEQCw9I-Gl0g06e1KsQ | Great drink specials! | KayYbHCt-RkbGcPdGOThNg | 0 | 2013-12-03 | tip |
| QawZN4PSW7ng_9SP7pjsVQ | Friendly staff, good food, great beer selection, and relaxing atmosphere | KayYbHCt-RkbGcPdGOThNg | 0 | 2015-07-08 | tip |
| MLQre1nvUtW-RqMTc4iC9A | Beautiful restoration. | 1_lU0-eSWJCRvNGk78Zh9Q | 0 | 2015-10-25 | tip |
+---------+------+-------------+-------+------+------+
5 rows selected (2.341 seconds)

We can use standard SQL aggregations such as COUNT:

0: jdbc:drill:zk=local> select count(*) from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json`;
+---------+
| EXPR$0  |
+---------+
| 591864  |
+---------+
1 row selected (4.495 seconds)

as well as GROUP BY operation:

0: jdbc:drill:zk=local> select `date`,count(*) as tip_count from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5;
+-------------+------------+
|    date     | tip_count  |
+-------------+------------+
| 2012-07-21  | 719        |
| 2012-05-19  | 718        |
| 2012-08-04  | 699        |
| 2012-06-23  | 690        |
| 2012-07-28  | 682        |
+-------------+------------+
5 rows selected (7.111 seconds)

Digging into the data a bit, we can see that it's not entirely flat - note, for example, the hours column, which is a nested JSON object:

0: jdbc:drill:zk=local> select full_address,city,hours from `/user/oracle/incoming/yelp/business_json` b limit 5;
+--------------+------+-------+
| full_address | city | hours |
+--------------+------+-------+
| 4734 Lebanon Church Rd
Dravosburg, PA 15034 | Dravosburg | {"Friday":{"close":"21:00","open":"11:00"},"Tuesday":{"close":"21:00","open":"11:00"},"Thursday":{"close":"21:00","open":"11:00"},"Wednesday":{"close":"21:00","open":"11:00"},"Monday":{"close":"21:00","open":"11:00"},"Sunday":{},"Saturday":{}} |
| 202 McClure St
Dravosburg, PA 15034 | Dravosburg | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 1 Ravine St
Dravosburg, PA 15034 | Dravosburg | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 1530 Hamilton Rd
Bethel Park, PA 15234 | Bethel Park | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 301 South Hills Village
Pittsburgh, PA 15241 | Pittsburgh | {"Friday":{"close":"17:00","open":"10:00"},"Tuesday":{"close":"21:00","open":"10:00"},"Thursday":{"close":"17:00","open":"10:00"},"Wednesday":{"close":"21:00","open":"10:00"},"Monday":{"close":"21:00","open":"10:00"},"Sunday":{"close":"18:00","open":"11:00"},"Saturday":{"close":"21:00","open":"10:00"}} |
+--------------+------+-------+
5 rows selected (0.721 seconds)
0: jdbc:drill:zk=local>

With Apache Drill we can simply use dot notation to access nested values. It's necessary to alias the table (b in this example) when you're doing this:

0: jdbc:drill:zk=local> select b.hours from `/user/oracle/incoming/yelp/business_json` b limit 1;
+-------+
| hours |
+-------+
| {"Friday":{"close":"21:00","open":"11:00"},"Tuesday":{"close":"21:00","open":"11:00"},"Thursday":{"close":"21:00","open":"11:00"},"Wednesday":{"close":"21:00","open":"11:00"},"Monday":{"close":"21:00","open":"11:00"},"Sunday":{},"Saturday":{}} |
+-------+

Nested objects can themselves be nested - not a problem with Apache Drill, we just chain the dot notation further:

0: jdbc:drill:zk=local> select b.hours.Friday from `/user/oracle/incoming/yelp/business_json` b limit 1;
+-----------------------------------+
|              EXPR$0               |
+-----------------------------------+
| {"close":"21:00","open":"11:00"}  |
+-----------------------------------+
1 row selected (0.238 seconds)

Note the use of backtick (`) to quote the reserved open and close keywords:

0: jdbc:drill:zk=local> select b.hours.Friday.`open`,b.hours.Friday.`close` from `/user/oracle/incoming/yelp/business_json` b limit 1;
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| 11:00   | 21:00   |
+---------+---------+
1 row selected (0.58 seconds)

Nested columns are proper objects in their own right in the query, and can be used as predicates too:

0: jdbc:drill:zk=local> select b.name,b.full_address,b.hours.Friday.`open` from `/user/oracle/incoming/yelp/business_json` b where b.hours.Friday.`open` = '11:00' limit 5;
+------------------------+------------------------------------------------+---------+
|          name          |                  full_address                  | EXPR$2  |
+------------------------+------------------------------------------------+---------+
| Mr Hoagie              | 4734 Lebanon Church Rd
Dravosburg, PA 15034    | 11:00   |
| Alexion's Bar & Grill  | 141 Hawthorne St
Greentree
Carnegie, PA 15106  | 11:00   |
| Rocky's Lounge         | 1201 Washington Ave
Carnegie, PA 15106         | 11:00   |
| Papa J's               | 200 E Main St
Carnegie
Carnegie, PA 15106      | 11:00   |
| Italian Village Pizza  | 2615 Main St
Homestead, PA 15120               | 11:00   |
+------------------------+------------------------------------------------+---------+
5 rows selected (0.404 seconds)

You'll notice in the above output that the full_address field has line breaks in -- we can just use a SQL Function to replace line breaks with commas:

0: jdbc:drill:zk=local> select b.name,regexp_replace(b.full_address,'\n',','),b.hours.Friday.`open` from `/user/oracle/incoming/yelp/business_json` b where b.hours.Friday.`open` = '11:00' limit 5;
+------------------------+------------------------------------------------+---------+
|          name          |                     EXPR$1                     | EXPR$2  |
+------------------------+------------------------------------------------+---------+
| Mr Hoagie              | 4734 Lebanon Church Rd,Dravosburg, PA 15034    | 11:00   |
| Alexion's Bar & Grill  | 141 Hawthorne St,Greentree,Carnegie, PA 15106  | 11:00   |
| Rocky's Lounge         | 1201 Washington Ave,Carnegie, PA 15106         | 11:00   |
| Papa J's               | 200 E Main St,Carnegie,Carnegie, PA 15106      | 11:00   |
| Italian Village Pizza  | 2615 Main St,Homestead, PA 15120               | 11:00   |
+------------------------+------------------------------------------------+---------+
5 rows selected (1.346 seconds)
Query Federation

So Apache Drill enables you to run SQL queries against data in a multitude of formats and locations, which is rather useful in itself. But even better than that, it lets you federate these sources in a single query. Here's an example of joining between data in HDFS and Oracle:

0: jdbc:drill:zk=local> select X.text,
. . . . . . . . . . . > Y.NAME
. . . . . . . . . . . > from hdfs.`/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` X
. . . . . . . . . . . > inner join ora.MOVIEDEMO.YELP_BUSINESS Y
. . . . . . . . . . . > on X.business_id = Y.BUSINESS_ID
. . . . . . . . . . . > where Y.NAME = 'Chick-fil-A'
. . . . . . . . . . . > limit 5;
+--------------------------------------------------------------------+--------------+
|                                text                                |     NAME     |
+--------------------------------------------------------------------+--------------+
| It's daddy daughter date night here and they go ALL OUT!           | Chick-fil-A  |
| Chicken minis!  The best part of waking up Saturday mornings.  :)  | Chick-fil-A  |
| Nice folks as always unlike those ghetto joints                    | Chick-fil-A  |
| Great clean and delicious chicken sandwiches!                      | Chick-fil-A  |
| Spicy Chicken with lettuce, tomato, and pepperjack cheese FTW!     | Chick-fil-A  |
+--------------------------------------------------------------------+--------------+
5 rows selected (3.234 seconds)

You can define a view over this:

0: jdbc:drill:zk=local> create or replace view dfs.tmp.yelp_tips as select X.text as tip_text, Y.NAME as business_name from hdfs.`/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` X inner join ora.MOVIEDEMO.YELP_BUSINESS Y on X.business_id = Y.BUSINESS_ID ;
+-------+-------------------------------------------------------------+
|  ok   |                           summary                           |
+-------+-------------------------------------------------------------+
| true  | View 'yelp_tips' replaced successfully in 'dfs.tmp' schema  |
+-------+-------------------------------------------------------------+
1 row selected (0.574 seconds)
0: jdbc:drill:zk=local> describe dfs.tmp.yelp_tips;
+----------------+--------------------+--------------+
|  COLUMN_NAME   |     DATA_TYPE      | IS_NULLABLE  |
+----------------+--------------------+--------------+
| tip_text       | ANY                | YES          |
| business_name  | CHARACTER VARYING  | YES          |
+----------------+--------------------+--------------+
2 rows selected (0.756 seconds)

and then query it as any regular object:

0: jdbc:drill:zk=local> select tip_text,business_name from dfs.tmp.yelp_tips where business_name like '%Grill' limit 5;
+------+------+
| text | NAME |
+------+------+
| Great drink specials! | Alexion's Bar & Grill |
| Friendly staff, good food, great beer selection, and relaxing atmosphere | Alexion's Bar & Grill |
| Pretty quiet here... | Uno Pizzeria & Grill |
| I recommend this location for quick lunches. 10 min or less lunch menu. Soup bar ( all you can eat)  the broccoli cheddar soup is delicious. | Uno Pizzeria & Grill |
| Instead of pizza, come here for dessert. The deep dish sundae is really good. | Uno Pizzeria & Grill |
+------+------+
5 rows selected (3.272 seconds)
Querying Twitter JSON data

Here's an example of using Drill to query a local file holding some Twitter data. You can download the file here if you want to try querying it yourself.

To start with I switched to using the dfs storage plugin:

0: jdbc:drill:zk=local> use dfs;
+-------+----------------------------------+
|  ok   |             summary              |
+-------+----------------------------------+
| true  | Default schema changed to [dfs]  |
+-------+----------------------------------+

And then tried a select against the file. Note the limit 5 clause - very useful when you're just examining the structure of a file.

0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/twitter/geo_tweets.json` limit 5;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input within/between OBJECT entries

File  /user/oracle/incoming/twitter/geo_tweets.json
Record  2819
Column  3503
Fragment 0:0

An error? That's not supposed to happen. I've got a JSON file, right? It turns out the JSON file is one complete JSON object per line. Except that it's not on the last record. Note the record count given in the error above - 2819:

[oracle@bigdatalite ~]$ wc -l geo_tweets.json
2818 geo_tweets.json

So the file only has 2818 complete lines. Hmmm. Let's take a look at that record, using a head/tail bash combo :

[oracle@bigdatalite ~]$ head -n 2819 geo_tweets.json |tail -n1
{"created_at":"Sun Jul 24 21:00:44 +0000 2016","id":757319630432067584,"id_str":"757319630432067584","text":"And now @HillaryClinton hires @DWStweets: Honorary Campaign Manager across the USA #corruption #hillarysamerica  https://t.co/8jAGUu6w2f","source":"<a href=\"http://www.handmark.com\" rel=\"nofollow\">TweetCaster for iOS</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2170786369,"id_str":"2170786369","name":"Patricia Weber","screen_name":"InnieBabyBoomer","location":"Williamsburg, VA","url":"http://lovesrantsandraves.blogspot.com/","description":"Baby Boomer, Swing Voter, Conservative, Spiritual, #Introvert, Wife, Grandma, Italian, ♥ Books, Cars, Ferrari, F1 Race♥  #tcot","protected":false,"verified":false,"followers_count":861,"friends_count":918,"listed_count":22,"favourites_count":17,"statuses_count":2363,"created_at":"Sat Nov 02 19:13:06 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000107659131/3589f

That's the complete data in the file - so Drill is right - the JSON is corrupted. If we drop that last record and create a new file (geo_tweets.fixed.json)

head -n2818 geo_tweets.json > geo_tweets.fixed.json

and query it again, we get something!

0: jdbc:drill:zk=local>  select text from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
+------+
| text |
+------+
| Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim |
| We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
| Donald Trump accepted the Republican nomination last night. Isis claimed responsibility. |
| Obama: "We must stand together and stop terrorism"
Trump: "We don't want these people in our country"
� |
| Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw |
+------+
5 rows selected (0.246 seconds)

text here being one of the json fields. I could do a select * but it's not so intelligable:

0: jdbc:drill:zk=local>  select * from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
+------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+
| created_at | id | id_str | text | source | truncated | user | geo | coordinates | place | is_quote_status | retweet_count | favorite_count | entities | favorited | retweeted | possibly_sensitive | filter_level | lang | timestamp_ms | @version | @timestamp | user_name | user_screen_name | user_followers_count | user_friends_count | user_listed_count | user_favourites_count | user_statuses_count | user_created_at | place_name | place_country | hashtags_list | urls_array | urls_list | user_mentions_screen_name_list | longitude | latitude | hashtags_array | extended_entities | user_mentions_screen_name_array | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | in_reply_to_screen_name | retweeted_status | retweeted_screen_name | quoted_status_id | quoted_status_id_str | quoted_status |
+------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+
| Fri Jul 22 19:37:11 +0000 2016 | 756573827589545984 | 756573827589545984 | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim | <a href="http://dlvr.it" rel="nofollow">dlvr.it</a> | false | {"id":67898674,"id_str":"67898674","name":"Vancouver Press","screen_name":"Vancouver_CP","location":"Vancouver, BC","url":"http://vancouver.cityandpress.com/","description":"Latest news from Vancouver. Updates are frequent.","protected":false,"verified":false,"followers_count":807,"friends_count":13,"listed_count":94,"favourites_count":1,"statuses_count":131010,"created_at":"Sat Aug 22 14:25:37 +0000 2009","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"FFFFFF","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_link_color":"8A1C3B","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"FFFFFF","profile_text_color":"2A2C31","profile_use_background_image":false,"profile_image_url":"http://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png","profile_image_url_https":"https://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png","profile_banner_url":"https://pbs.twimg.com/profile_banners/67898674/1411821103","default_profile":false,"default_profile_image":false} | {"type":"Point","coordinates":[49.2814375,-123.12109067]} | {"type":"Point","coordinates":[-123.12109067,49.2814375]} | {"id":"1e5cb4d0509db554","url":"https://api.twitter.com/1.1/geo/id/1e5cb4d0509db554.json","place_type":"city","name":"Vancouver","full_name":"Vancouver, British Columbia","country_code":"CA","country":"Canada","bounding_box":{"type":"Polygon","coordinates":[[[-123.224215,49.19854],[-123.224215,49.316738],[-123.022947,49.316738],[-123.022947,49.19854]]]},"attributes":{}} | false | 0 | 0 | {"urls":[{"url":"https://t.co/joI9GMfRim","expanded_url":"http://toplocalnow.com/ca/vancouver?section=trends","display_url":"toplocalnow.com/ca/vancouver?s…","indices":[70,93]}],"hashtags":[],"user_mentions":[],"media":[],"symbols":[]} | false | false | false | low | en | 1469216231616 | 1 | 2016-07-22T19:37:11.000Z | Vancouver Press | Vancouver_CP | 807 | 13 | 94 | 1 | 131010 | Sat Aug 22 14:25:37 +0000 2009 | Vancouver | Canada |  | ["toplocalnow.com/ca/vancouver?s…"] | toplocalnow.com/ca/vancouver?s… |  | -123.12109067 | 49.2814375 | [] | {"media":[]} | [] | null | null | null | null | null | {"user":{},"entities":{"user_mentions":[],"media":[],"hashtags":[],"urls":[]},"extended_entities":{"media":[]},"quoted_status":{"user":{},"entities":{"hashtags":[],"user_mentions":[],"media":[],"urls":[]},"extended_entities":{"media":[]}}} | null | null | null | {"user":{},"entities":{"user_mentions":[],"media":[],"urls":[],"hashtags":[]},"extended_entities":{"media":[]},"place":{"bounding_box":{"coordinates":[]},"attributes":{}},"geo":{"coordinates":[]},"coordinates":{"coordinates":[]}} |

Within the twitter data there's root-level fields, such as text, as well as nested ones such as information about the tweeter in the user field. As we saw above you reference nested fields using dot notation. Now's a good time to point out a couple of common mistakes that you may encounter. The first is not quoting reserved words, and is the first thing to check for if you get an error such as Encountered ".":

0: jdbc:drill:zk=local>  select user.screen_name,text from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
Error: PARSE ERROR: Encountered "." at line 1, column 12.
[...]

Second is declaring the table alias when using dot notation - if you don't then Apache Drill thinks that the parent column is actually the table name (VALIDATION ERROR: [...] Table 'user' not found):

0: jdbc:drill:zk=local>  select `user`.screen_name,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` limit 5;
Aug 10, 2016 11:16:45 PM org.apache.calcite.sql.validate.SqlValidatorException <init>
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'user' not found
Aug 10, 2016 11:16:45 PM org.apache.calcite.runtime.CalciteException <init>
SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 13: Table 'user' not found
Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 13: Table 'user' not found

SQL Query null

[Error Id: 1427fd23-e180-40be-a751-b6f1f838233a on 192.168.56.1:31010] (state=,code=0)

With those mistakes fixed, we can see the user's screenname:

0: jdbc:drill:zk=local>  select tweets.`user`.`screen_name` as user_screen_name,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 2;
+------------------+------+
| user_screen_name | text |
+------------------+------+
| Vancouver_CP     | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim |
| tmj_TUC_skltrd   | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
+------------------+------+
2 rows selected (0.256 seconds)
0: jdbc:drill:zk=local>

As well as nested objects, JSON supports arrays. An example of this in twitter data is hashtags, or URLs, both of which there can be zero, one, or many of in a given tweet.

0: jdbc:drill:zk=local>  select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 5;
+--------+
| EXPR$0 |
+--------+
| [] |
| [{"text":"hiring","indices":[6,13]},{"text":"Job","indices":[98,102]},{"text":"SkilledTrade","indices":[103,116]},{"text":"Tucson","indices":[117,124]},{"text":"Jobs","indices":[129,134]}] |
| [] |
| [] |
| [{"text":"lol","indices":[72,76]},{"text":"nowthatsfunny","indices":[77,91]}] |
+--------+
5 rows selected (0.286 seconds)

Using the FLATTEN function each array entry becomes a new row, thus:

0: jdbc:drill:zk=local>  select flatten(tweets.entities.hashtags) from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 5;
+----------------------------------------------+
|                    EXPR$0                    |
+----------------------------------------------+
| {"text":"hiring","indices":[6,13]}           |
| {"text":"Job","indices":[98,102]}            |
| {"text":"SkilledTrade","indices":[103,116]}  |
| {"text":"Tucson","indices":[117,124]}        |
| {"text":"Jobs","indices":[129,134]}          |
+----------------------------------------------+
5 rows selected (0.139 seconds)

Note that the limit 5 clause is showing only the first five array instances, which is actually just hashtags from the first tweet in the above list.

To access the text of the hashtag we use a subquery and the dot notation to access the text field:

0: jdbc:drill:zk=local> select ent_hashtags.hashtags.text from (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) as ent_hashtags limit 5;
+---------------+
|    EXPR$0     |
+---------------+
| hiring        |
| Job           |
| SkilledTrade  |
| Tucson        |
| Jobs          |
+---------------+
5 rows selected (0.168 seconds)

This can be made more readable by using Common Table Expressions (CTE, also known as subquery factoring) for the same result:

0: jdbc:drill:zk=local> with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets)
. . . . . . . . . . . > select ent_hashtags.hashtags.text from ent_hashtags
. . . . . . . . . . . > limit 5;
+---------------+
|    EXPR$0     |
+---------------+
| hiring        |
| Job           |
| SkilledTrade  |
| Tucson        |
| Jobs          |
+---------------+
5 rows selected (0.253 seconds)

Combining the flattened array with existing fields enables us to see things like a list of tweets with their associated hashtags:

0: jdbc:drill:zk=local> with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp limit 10;
+------------------+------+---------+
| user_screen_name | text | hashtag |
+------------------+------+---------+
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | hiring |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | SkilledTrade |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Tucson |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Jobs |
| johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | lol |
| johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | nowthatsfunny |
| greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 | WinstonSalem |
| greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 | ws |
| trendinaliaSG | 6. Hit The Stage
7. TTTT
8. Demi Lovato
9. Beijing
10. Donald Trump

2016/7/23 03:36 SGT #trndnl https://t.co/psP0GzBgZB | trndnl |
+------------------+------+---------+
10 rows selected (0.166 seconds)

We can also filter based on hashtag:

0: jdbc:drill:zk=local> with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp where tmp.hashtags.text = 'Job' limit 5;
+------------------+------+---------+
| user_screen_name | text | hashtag |
+------------------+------+---------+
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |
| tmj_VAL_health | Want to work at Genesis Rehab Services? We're #hiring in #Clinton, MD! Click for details: https://t.co/4lt7I4gMZk #Job #Healthcare #Jobs | Job |
| tmj_in_retail | Want to work in #Clinton, IN? View our latest opening: https://t.co/UiimnlubYs #Job #Retail #Jobs #Hiring #CareerArc | Job |
| tmj_la_hrta | Want to work at SONIC Drive-In? We're #hiring in #Clinton, LA! Click for details: https://t.co/aQ1FrWc7iR #Job #SONIC #Hospitality #Jobs | Job |
| tmj_ia_hrta | We're #hiring! Click to apply: Department Manager - https://t.co/SnoKcwwHFk #Job #Hospitality #Clinton, IA #Jobs #CareerArc | Job |
+------------------+------+---------+
5 rows selected (0.207 seconds)

as well as summarise hashtag counts:

0: jdbc:drill:zk=local> with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets)
. . . . . . . . . . . > select ent_hashtags.hashtags.text,count(ent_hashtags.hashtags.text) from ent_hashtags
. . . . . . . . . . . > group by ent_hashtags.hashtags.text
. . . . . . . . . . . > order by 2 desc;
+-----------------------------+---------+
|           EXPR$0            | EXPR$1  |
+-----------------------------+---------+
| Trump                       | 365     |
| trndnl                      | 176     |
| job                         | 170     |
| Hiring                      | 127     |
| Clinton                     | 108     |
| Yorkshire                   | 100     |
| CareerArc                   | 100     |
[...]

To filter out records that may not have array values (such as hashtags, which not every tweet has) and without with the query may fail, use IS NOT NULL against an attribute of first index of the array:

0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5;
+--------+
| EXPR$0 |
+--------+
| [{"text":"hiring","indices":[6,13]},{"text":"Job","indices":[98,102]},{"text":"SkilledTrade","indices":[103,116]},{"text":"Tucson","indices":[117,124]},{"text":"Jobs","indices":[129,134]}] |
| [{"text":"lol","indices":[72,76]},{"text":"nowthatsfunny","indices":[77,91]}] |
| [{"text":"WinstonSalem","indices":[0,13]},{"text":"ws","indices":[92,95]}] |
| [{"text":"trndnl","indices":[89,96]}] |
| [{"text":"trndnl","indices":[92,99]}] |
+--------+
5 rows selected (0.187 seconds)

If you try and compare the array itself, it doesn't work:

0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags is not null limit 5;
Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema.  Errors:

Error in expression at index -1.  Error: Missing function implementation: [isnotnull(MAP-REPEATED)].  Full expression: --UNKNOWN EXPRESSION--..

Fragment 0:0

[Error Id: 99ac12aa-f6b4-4692-b815-8f483da682c4 on 192.168.56.1:31010] (state=,code=0)

The above example demonstrates using array indexing, which is an alternative to FLATTEN for accessing individual objects in the array if you know they're going to exist:

0: jdbc:drill:zk=local> select tweets.entities.hashtags[0].text as first_hashtag,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5;
+---------------+------+
| first_hashtag | text |
+---------------+------+
| hiring | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle &amp; Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
| lol | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw |
| WinstonSalem | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 |
Querying CSV files

JSON files are relatively easy to interpret because they have a semi-defined schema within them, including column names. CSV (and character delimited files in general), on the other hand, are a bit more of a 'wild west' when it comes to reliably inferring column names. You can configure Apache Drill to ignore the first line of a CSV file (on the assumption that it's a header) if you want to, or to take them as column names. If you don't do this and query a CSV file that looks like this:

[oracle@bigdatalite ~]$ head nyc_parking_violations.csv
Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect    ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
1360858775,PHW9801,OH,PAS,07/01/2015,20,SUBN,HONDA,P,61490,26160,26190,0,0044,44,44,929822,0044,0000,0653P,,BX,O,651,RIVER AVE,,0,408,D,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,,

You'll get two records, each one column wide, as an array:

0: jdbc:drill:zk=local> select *  from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv` LIMIT 5;
+---------+
| columns |
+---------+
| ["Summons Number","Plate ID","Registration State","Plate Type","Issue Date","Violation Code","Vehicle Body Type","Vehicle Make","Issuing Agency","Street Code1","Street Code2","Street Code3","Vehicle Expiration Date","Violation Location","Violation Precinct","Issuer Precinct","Issuer Code","Issuer Command","Issuer Squad","Violation Time","Time First Observed","Violation County","Violation In Front Of Or Opposite","House Number","Street Name","Intersecting Street","Date First Observed","Law Section","Sub Division","Violation Legal Code","Days Parking In Effect    ","From Hours In Effect","To Hours In Effect","Vehicle Color","Unregistered Vehicle?","Vehicle Year","Meter Number","Feet From Curb","Violation Post Code","Violation Description","No Standing or Stopping Violation","Hydrant Violation","Double Parking Violation"] |
| ["1360858775","PHW9801","OH","PAS","07/01/2015","20","SUBN","HONDA","P","61490","26160","26190","0","0044","44","44","929822","0044","0000","0653P","","BX","O","651","RIVER AVE","","0","408","D","","BBBBBBB","ALL","ALL","","0","0","-","0","","","","",""] |

To access the actual columns in the CSV file you need to use columns[x] syntax to reference them. Watch out that columns is case-sensitive, and the numbering is zero-based:

0: jdbc:drill:zk=local> select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv` limit 5;
+----------+--------------------+
| PlateID  | RegistrationState  |
+----------+--------------------+
| AR877A   | NJ                 |
| 73268ME  | NY                 |
| 2050240  | IN                 |
| 2250017  | IN                 |
| AH524C   | NJ                 |
+----------+--------------------+
5 rows selected (0.247 seconds)

To make it easier to work with the data on a repeated basis you can define a view over the data:

0: jdbc:drill:zk=local> create view dfs.tmp.NYC_Parking_01 as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`;
+-------+-----------------------------------------------------------------+
|  ok   |                             summary                             |
+-------+-----------------------------------------------------------------+
| true  | View 'NYC_Parking_01' created successfully in 'dfs.tmp' schema  |
+-------+-----------------------------------------------------------------+
1 row selected (0.304 seconds)

This is using the dfs storage plugin and the tmp schema within it, which has the following storage configuration - note that writeable is true

"tmp": {
  "location": "/tmp",
  "writable": true,
  "defaultInputFormat": null
}

(if you use the wrong database [storage plugin] or schema you'll get Schema [hdfs] is immutable.)

Query the new view

0: jdbc:drill:zk=local> select * from dfs.tmp.NYC_Parking_01 limit 5;
+-----------+---------------------+
|  PlateID  |  RegistrationState  |
+-----------+---------------------+
| Plate ID  | Registration State  |
| PHW9801   | OH                  |
| K8010F    | TN                  |
| GFG6211   | NY                  |
| GHL1805   | NY                  |
+-----------+---------------------+
5 rows selected (0.191 seconds)

Through the view, or direct against the CSV path, you can also run aggregates:

0: jdbc:drill:zk=local> select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) > 1 limit 1;
+----------+---------+
| PlateID  | EXPR$1  |
+----------+---------+
| 2050240  | 4       |
+----------+---------+
1 row selected (15.983 seconds)

Although this isn't rerunnable for the same result - probably because of the limit clause

0: jdbc:drill:zk=local> select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) > 1 limit 1;
+----------+---------+
| PlateID  | EXPR$1  |
+----------+---------+
| AR877A   | 3       |
+----------+---------+
1 row selected (12.881 seconds)

Under the covers the view definition is written to /tmp - you'll want to move this path if you're wanting to preserve this data past reboot:

    [oracle@bigdatalite parking]$ cat /tmp/NYC_Parking_01.view.drill
    {
      "name" : "NYC_Parking_01",
      "sql" : "SELECT `columns`[1] AS `PlateID`, `columns`[2] AS `RegistrationState`\nFROM `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`",
      "fields" : [ {
        "name" : "PlateID",
        "type" : "ANY",
        "isNullable" : true
      }, {
        "name" : "RegistrationState",
        "type" : "ANY",
        "isNullable" : true
      } ],
      "workspaceSchemaPath" : [ "hdfs" ]

You can also create an actual table using CTAS (Create Table As Select):

0: jdbc:drill:zk=local> create table dfs.tmp.parking as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_1       | 4471875                    |
| 1_0       | 4788421                    |
+-----------+----------------------------+
2 rows selected (42.913 seconds)

This is stored on disk (per the dfs config) and by default in Parquet format:

[oracle@bigdatalite parking]$ ls -l /tmp/parking/
total 76508
-rw-r--r--. 1 oracle oinstall 40623288 Aug 10 22:53 1_0_0.parquet
-rw-r--r--. 1 oracle oinstall 37717804 Aug 10 22:53 1_1_0.parquet
Drill's Web Interface

Drill comes with a web interface which you can access at http://<IP>:8047/ and is useful for

  • Issuing queries

  • Configuring additional storage plugins (e.g. database, hdfs, etc)

  • Metrics and debug

Defining Storage Plugins

From the Drill web interface you can view existing storage plugins, or define new ones. To create a new one, enter its name (for example, hdfs, but could be fred for all that it matters - it's just a label) under New Storage Plugin on the Storage page, and click on Create. Paste the necessary JSON definition in the Configuration box, and then click Create. If you don't want to use the GUI there's also a REST API.

Storage plugin configuration is stored either within Zookeeper (when running Drill distributed), or locally in the sys.store.provider.local.path path when running standalone. By default this is under /tmp which gets cleared down at server reboot. To persist custom storage configurations amend the sys.store.provider.local.path in drill-override.conf, for example:

drill.exec: {
    cluster-id: "drillbits1",
    zk.connect: "localhost:2181"
    sys.store.provider.local.path="/home/oracle/drill/"
}
Working with filesystem data

Here's an example of a storage configuration that enables Drill to access a CDH cluster's HDFS:

    {
      "type": "file",
      "enabled": true,
      "connection": "hdfs://cdh57-01-node-01:8020/",
      "config": null,
      "workspaces": {
        "root": {
          "location": "/",
          "writable": true,
          "defaultInputFormat": null
        }
      },
      "formats": {
        "csv": {
          "type": "text",
          "extensions": [
            "csv"
          ],
          "delimiter": ","
        },
        "json": {
          "type": "json",
          "extensions": [
            "json"
          ]
        }
      }
    }

As well as the connection parameter itself for HDFS, the important bit in this configuration is the formats section. This tells Drill how to interpet files that it finds, without the end-user having to explicitly declare their type.

For the filesystem-based plugin dfs (which can include local files, HDFS, even Amazon S3), you can browse the available "tables":

List the files in HDFS (previously selected with use hdfs;)

0: jdbc:drill:zk=local> show files;
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
|  name  | isDirectory  | isFile  | length  | owner  |    group    | permissions  |       accessTime       |     modificationTime     |
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
| hbase  | true         | false   | 0       | hbase  | supergroup  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-07-25 14:46:08.212  |
| share  | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwx    | 1969-12-31 19:00:00.0  | 2016-05-15 12:28:08.152  |
| solr   | true         | false   | 0       | solr   | solr        | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-06-01 18:34:50.716  |
| tmp    | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwt    | 1969-12-31 19:00:00.0  | 2016-06-24 04:54:41.491  |
| user   | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwx    | 1969-12-31 19:00:00.0  | 2016-06-21 15:55:59.084  |
| var    | true         | false   | 0       | hdfs   | supergroup  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-11 17:53:29.804  |
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
6 rows selected (0.145 seconds)

Show files in a given path:

0: jdbc:drill:zk=local> show files in `/user/oracle`;
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
|      name      | isDirectory  | isFile  | length  |  owner  |  group  | permissions  |       accessTime       |     modificationTime     |
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
| .Trash         | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-23 20:42:34.815  |
| .sparkStaging  | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-07-06 03:56:38.863  |
| .staging       | true         | false   | 0       | oracle  | oracle  | rwx------    | 1969-12-31 19:00:00.0  | 2016-06-01 18:37:04.005  |
| incoming       | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-08-03 05:34:12.38   |
| mediademo      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-06-01 18:59:45.653  |
| moviedemo      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:02:55.652  |
| moviework      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.497  |
| oggdemo        | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.552  |
| oozie-oozi     | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.651  |
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
9 rows selected (0.428 seconds)

You can also query across multiple files by specifying a wildcard match. Here's the truncated list of files available:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> show files in `hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/`;
+--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+
|           name           | isDirectory  | isFile  |  length  | owner  | group  | permissions  |        accessTime        |     modificationTime     |
+--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+
| FlumeData.1466176113171  | false        | true    | 1055675  | rmoff  | rmoff  | rw-r--r--    | 2016-08-10 21:28:27.072  | 2016-06-17 16:08:38.023  |
| FlumeData.1466176113172  | false        | true    | 1051411  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.756  | 2016-06-17 16:08:40.597  |
| FlumeData.1466176113173  | false        | true    | 1054734  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.752  | 2016-06-17 16:08:43.33   |
| FlumeData.1466176113174  | false        | true    | 1050991  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.743  | 2016-06-17 16:08:44.361  |
| FlumeData.1466176113175  | false        | true    | 1053577  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.748  | 2016-06-17 16:08:45.162  |
| FlumeData.1466176113176  | false        | true    | 1051965  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.752  | 2016-06-17 16:08:46.261  |
| FlumeData.1466176113177  | false        | true    | 1049555  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:47.425  |
| FlumeData.1466176113178  | false        | true    | 1050566  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:48.23   |
| FlumeData.1466176113179  | false        | true    | 1051751  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.756  | 2016-06-17 16:08:49.381  |
| FlumeData.1466176113180  | false        | true    | 1052249  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.757  | 2016-06-17 16:08:50.042  |
| FlumeData.1466176113181  | false        | true    | 1055002  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:50.896  |
| FlumeData.1466176113182  | false        | true    | 1050812  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:52.191  |
| FlumeData.1466176113183  | false        | true    | 1048954  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.757  | 2016-06-17 16:08:52.994  |
| FlumeData.1466176113184  | false        | true    | 1051559  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.773  | 2016-06-17 16:08:54.025  |
[...]

Count number of records in one file (FlumeData.1466176113171):

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171`(type => 'json'));
+---------+
| EXPR$0  |
+---------+
| 277     |
+---------+
1 row selected (0.798 seconds)

In several files (FlumeData.146617611317*):

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.146617611317*`(type => 'json'));
+---------+
| EXPR$0  |
+---------+
| 2415    |
+---------+
1 row selected (2.466 seconds)

In all files in the folder (*):

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/*`(type => 'json'));
+---------+
| EXPR$0  |
+---------+
| 7414    |
+---------+
1 row selected (3.867 seconds)

And even across multiple folders:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/flume/incoming/twitter/2016/06/*/*`(type => 'json'));
+---------+
| EXPR$0  |
+---------+
| 206793  |
+---------+
1 row selected (87.545 seconds)
Querying data without an identifying extension

Drill relies on the format clause of the storage extension configurations in orer to determine how to interpret files based on their extensions. You won't always have that luxury of extensions being available, or being defined. If you try and query such data, you'll not get far. In this example I'm querying data on HDFS that's in JSON format but without the .json suffix:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT text FROM `hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171` limit 5;
Error: VALIDATION ERROR: From line 1, column 18 to line 1, column 23: Table 'hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171' not found

SQL Query null

Fear not - you can declare them as part of the query syntax.

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT text FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171`(type => 'json')) limit 5;
+------+
| text |
+------+
| RT @jjkukrl: susu bayi jg lagi mahal nih ugh ayah harus semangat cari duit ^^9 https://t.co/2NvTOShRbI |
| Oracle Java 1Z0-808 Web Exam Simulator https://t.co/tZ3gU8EMj3 |
| @TribuneSelatan ahaha kudu gaya atuh da arek lebarann ahahaha |
| Short impression of yesterday's speech. What a great day it was! #lifeatoracle #team #salesincentive #oracle https://t.co/SVK2ovOe3U |
| Want to work at Oracle? We're #hiring in New York! Click for details: https://t.co/NMTo1WMHVw #Sales #Job #Jobs #CareerArc |
+------+
5 rows selected (1.267 seconds)
Storage Configuration - Oracle

Per the documentation it's easy to query data residing in a RDBMS, such as Oracle. Simply copy the JDBC driver into Apache Drill's jar folder:

cp /u01/app/oracle/product/12.1.0.2/dbhome_1/jdbc/lib/ojdbc7.jar /opt/apache-drill-1.7.0/jars/3rdparty/

And then add the necessary storage configuration, which I called ora:

    {
      "type": "jdbc",
      "driver": "oracle.jdbc.OracleDriver",
      "url": "jdbc:oracle:thin:moviedemo/welcome1@localhost:1521/ORCL",
      "username": null,
      "password": null,
      "enabled": true
    }

If you get an error Please retry: error (unable to create/ update storage) then check that the target Oracle database is up, the password is correct, and so on.

You can then query the data within Hive:

0: jdbc:drill:zk=local> use ora.MOVIEDEMO;
+-------+--------------------------------------------+
|  ok   |                  summary                   |
+-------+--------------------------------------------+
| true  | Default schema changed to [ora.MOVIEDEMO]  |
+-------+--------------------------------------------+
1 row selected (0.205 seconds)

0: jdbc:drill:zk=local> show tables;
+----------------+-----------------------------+
|  TABLE_SCHEMA  |         TABLE_NAME          |
+----------------+-----------------------------+
| ora.MOVIEDEMO  | ACTIVITY                    |
| ora.MOVIEDEMO  | BDS_CUSTOMER_RFM            |
| ora.MOVIEDEMO  | BUSINESS_REVIEW_SUMMARY     |
[...]

0: jdbc:drill:zk=local> select * from ACTIVITY limit 5;
+--------------+---------+
| ACTIVITY_ID  |  NAME   |
+--------------+---------+
| 3.0          | Pause   |
| 6.0          | List    |
| 7.0          | Search  |
| 8.0          | Login   |
| 9.0          | Logout  |
+--------------+---------+
5 rows selected (1.644 seconds)

If you get Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query. then enable verbose errors in Apache Drill to see what the problem is:

0: jdbc:drill:zk=local> ALTER SESSION SET `exec.errors.verbose` = true;
+-------+-------------------------------+
|  ok   |            summary            |
+-------+-------------------------------+
| true  | exec.errors.verbose updated.  |
+-------+-------------------------------+
1 row selected (0.154 seconds)

0: jdbc:drill:zk=local> select * from ora.MOVIEDEMO.YELP_BUSINESS limit 1;
Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query.

sql SELECT *
FROM "MOVIEDEMO"."YELP_BUSINESS"
plugin ora
Fragment 0:0

[Error Id: 40343dd5-1354-48ed-90ef-77ae1390411b on bigdatalite.localdomain:31010]

(java.sql.SQLException) ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-11504: error from external driver: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

Here the problem was with the external table that Oracle was querying (ORA-29913: error in executing ODCIEXTTABLEOPEN). It's actually an Oracle external table over a Hive table, which obviously Drill could be querying directly - but hey, we're just sandboxing here...

Query Execution

Just as Oracle has its Cost Based Optimiser (CBO) which helps it determine how to execute a query, and do so most efficiently, Apache Drill has an execution engine that determines how to actually execute the query you give it. This also includes how to split it up over multiple nodes ("drillbits") if available, as well as optimisations such as partition pruning in certain cases. You can read more about how the query execution works here, and view the explain plan for a query using explain plan :

0: jdbc:drill:zk=local> !set maxwidth 10000  
0: jdbc:drill:zk=local> explain plan for select `date`,count(*) as tip_count from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5;  
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(date=[$0], tip_count=[$1])  
00-02        SelectionVectorRemover  
00-03          Limit(fetch=[5])  
00-04            SelectionVectorRemover  
00-05              TopN(limit=[5])  
00-06                HashAgg(group=[{0}], tip_count=[$SUM0($1)])  
00-07                  HashAgg(group=[{0}], tip_count=[COUNT()])  
00-08                    Scan(groupscan=[EasyGroupScan [selectionRoot=hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json, numFiles=1, columns=[`date`], files=[hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json]]])  
 | {
  "head" : {
    "version" : 1,
    "generator" : {
      "type" : "ExplainHandler",
      "info" : ""
    [...]

You can also use the Drill web interface to see information about how a query executed:


Drill Explorer

The MapR Drill ODBC driver comes with a tool called Drill Explorer. This is a GUI that enables you to explore the data by navigating the databases (==storage plugins) and folders/files within, previewing the data and even creating views on it.

Drill Client

Within the Drill client there are various settings available:

0: jdbc:drill:zk=local> !set  
autocommit          true  
autosave            false  
color               true  
fastconnect         true  
force               false  
headerinterval      100  
historyfile         /home/oracle/.sqlline/history  
incremental         true  
isolation           TRANSACTION_REPEATABLE_READ  
maxcolumnwidth      15  
maxheight           56  
maxwidth            1000000  
numberformat        default  
outputformat        table  
propertiesfile      /home/oracle/.sqlline/sqlline.properties  
rowlimit            0  
showelapsedtime     true  
showheader          true  
shownestederrs      false  
showwarnings        true  
silent              false  
timeout             -1  
trimscripts         true  
verbose             false  

To change one, such as the width of output displayed:

0: jdbc:drill:zk=local> !set maxwidth 10000  

To connect to remote Drill specify the Zookeeper node(s) that store the Drillbit connection information:

rmoff@asgard-3:apache-drill-1.7.0> bin/sqlline -u jdbc:drill:zk=cdh57-01-node-01.moffatt.me:2181,cdh57-01-node-02.moffatt.me:2181,cdh57-01-node-03.moffatt.me:2181  
Conclusion

Apache Drill is a powerful tool for using familiar querying language (SQL) against different data sources. On a small scale, simply being able to slice and dice through structured files like JSON is a massive win. On a larger scale, it will be interesting to experiment with how Apache Drill compares when querying larger volumes of data across a cluster of machines, maybe compared to a tool such as Impala.

For more information about Apache Drill see how to access Drill from within OBIEE, as well as bonus geeky blog coming soon explaining the debug tools I used to try and figure out why it wouldn't initially work...

Categories: BI & Warehousing

OBIEE 12.2.1.1.0 - New Feature Guide

Thu, 2016-07-28 04:00
OBIEE 12.2.1.1.0 - New Feature Guide

Oracle has recently released version 12.2.1.1.0 for OBIEE 12c, which has a variety of exciting new features for Data Visualization, BI Publisher, Dashboards in OBIEE, and ODBC connections. It can be performed as an in-place upgrade (performed by the Upgrade Assistant) or as a fresh install. The upgrade process is explained at length here. There is also an example Linux in-place upgrade from 12.2.1.0 to 12.2.1.1 which can be found here. Let's take an in-depth look at some of the new features.

Data Visualization (aka Visual Analyzer)

A variety of features have been added to Data Visualization, which brings it closer to its standalone counterpart, Data Visualization Desktop.

Visualize Data from Oracle Applications

One of the most powerful exploratory features of Data Visualization Desktop is the ability to use your OBIEE analyses as direct datasources. This allows user to "revisualize" and explore the data contained within one or more analyses in different ways. Oracle has added this same feature in data visualization. To access this feature from the DV Home Page, click on Data Sources > Create New Data Source.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

You are then presented with three choices: From a File, From Oracle Applications or From Database (more on this later).

OBIEE 12.2.1.1.0 - New Feature Guide

Selecting From Oracle Applications opens a new window where you enter your connection information.

OBIEE 12.2.1.1.0 - New Feature Guide

Once you enter the connection information, you are able to view all of the folders you have access to within the OBIEE catalog. Selecting an analysis and clicking ok brings you to another screen where you can specify between measures and attributes for your columns and change the specified aggregation for your measure columns.

OBIEE 12.2.1.1.0 - New Feature Guide

Clicking Add Data Source adds the analyses to the Data Source page

.OBIEE 12.2.1.1.0 - New Feature Guide

Visualize Data from Databases

Another feature added to DV is the ability to connect directly to a database. It allows you to add tables directly from a database schema or write a SQL statement to select the data you want.

To connect to a database as a data source, select Data Sources from the Home Page then Create New Data Source > From Database. This brings up a connection window to enter your connection details.

OBIEE 12.2.1.1.0 - New Feature Guide

From here, you can also select from a large selection of database types other than the default Oracle Type. Clicking on Oracle brings up the list which matches the selections available in the latest release of Data Visualization Desktop.

OBIEE 12.2.1.1.0 - New Feature Guide

Choosing to add tables from a database schema allows you to add data sources, one table at a time.

OBIEE 12.2.1.1.0 - New Feature Guide

In some situations, you might want to create one datasource selection from a database from multiple tables. You can achieve this by clicking on the SQL tab after making your database connection and writing a custom sql statement.

OBIEE 12.2.1.1.0 - New Feature Guide

Clicking OK brings you to a results screen with all of the columns previously defined in your sql statement and also give you the ability to rename your datasource and to change the aggregation type for your measure columns.

OBIEE 12.2.1.1.0 - New Feature Guide

Modifying Uploaded Data Sources

Also known as "Data Wrangling", Oracle has added the ability to manipulate a dataset depending on the column data type. By invoking a logical SQL function, Data Visualization does the work for you and can create a new column or edit an existing one. To show an example of this, I created a new VA Project and uploaded a sample Excel file.

OBIEE 12.2.1.1.0 - New Feature Guide

In the top left menu bar, there is an option called Stage. Clicking on it opens the datasource and shows all of the columns present and the first 100 records.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Note: If you have more than one file, clicking on the name of the file in the top left will reveal a dropdown menu where you can choose between them.

To edit or add a column, navigate to the right of the column you wish to edit and click on the options icon OBIEE 12.2.1.1.0 - New Feature Guide.

This brings up the options menu where you can select from a variety of different options for editing or adding column depending on the datatype.

OBIEE 12.2.1.1.0 - New Feature Guide

Using the Concatenate function, I was able to create a new column that showed the full name of each customer combined with data in the Age Group column separated by a colon.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

If you chose to create a new column, a Create New Data Element window opens allowing you to input logical SQL functions similar to the edit column formula feature in OBIEE Analytics.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

New and Enhanced Visualization Types

DV now includes new visualizations including donut charts, text boxes, sunburst, combo, scatter (cat.) and stacked scatter (cat.), which brings it in line with the offerings from Data Visualization Desktop.

OBIEE 12.2.1.1.0 - New Feature Guide

Also included (assuming you have installed and configured Oracle's R distribution) is the ability to add clusters, outliers, reference lines, trend lines and forecast. There are two ways to add these to your visualizations. The first is by selecting the Analytics option in the horizontal menu bar on the far right side and then choose the desired function. Double clicking on a function automatically adds it to your visualization.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Another option is to add them directly from Menu > Properties > Analytics within your visualization.  You can add analytic functions and, depending on the function, there are a variety of different options to change how the function is displayed.

OBIEE 12.2.1.1.0 - New Feature Guide

You can also add URLs or links to insights within Tiles, Text Boxes and Image visualizations and, if you use Chrome for Windows or Android, there is a dictation option within properties that you can use to add descriptions.

OBIEE 12.2.1.1.0 - New Feature Guide

Customize Color Schemes

One of the more frustrating absences from DV up to this point was the ability to customize colors in your visualizations. Oracle has not only added this feature but given you the ability to customize and save these customizations, making color conformance for a group of users or across an entire organization a breeze.

Let's start by looking at an example of applying color to an entire project. To access the project color properties, click on Canvas Settings > Project Properties. By clicking on the color selection in the Color Series section, you can choose from one of four default selections or create your own custom palette.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Notice how each measure is assigned a color and that color is continuous across the entire project.

You can also manage colors from inside of an individual visualization. To do this, click on Menu > Color > Manage Color Assignments. From here you can assign colors to each measure individually and it is then displayed across the entire project.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Another option is to change a color by right clicking on an individual data point.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

In this example, I right clicked on Technology and changed its color which then changed it for every visualization in the project.

OBIEE 12.2.1.1.0 - New Feature Guide

Data Blending

Oracle added the ability to specify which data source takes precedence over the other when blending two data sources together. This can be very useful when you have a project that includes data from two data sources and there are match values in one data source that are not in the second. Let's look at an example.

I have two data sources. One contains population information and one contains country name and they are matched on the country codes.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Notice in the example that there are countries that have no population facts and and population facts that have no countries.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

So that we can see only the countries which have population information, click on one of the columns in the report, right click and select Properties > Data Sets. Here you are presented with an option for each source: All Rows or Matching Rows.

OBIEE 12.2.1.1.0 - New Feature Guide

Changing the source that contains country information to Matching Rows will keep only the countries that have corresponding population information.

OBIEE 12.2.1.1.0 - New Feature Guide

Share Reports as Read-Only

It is now possible to share a report with other users as read-only. They will still be able to interact and edit filters, but won't be able to see the authoring content controls. They are also able to go back an forth between presentation mode.

Other Features 

Two other features for DV introduced in 12.2.1.1 is the ability to upload data files up to 50mb in size. Also you can hide the filter panel from view to increase canvas space for your projects.

BI Publisher

Oracle has also updated BI Publisher 12c to version 12.2.1.1. You can see the official document here.

Deliver documents to Oracle Document Cloud Service

You are now able to deliver BI Publisher reports to the Oracle Documents Cloud service via a delivery channel in BI Publisher. You can deliver reports to cloud storage from both on-premise or a cloud deployment of BI Publisher.

You can set up this feature under Administration > Delivery > Document Cloud Services within BI Publisher 12.2.1.1.

OBIEE 12.2.1.1.0 - New Feature Guide

Here, under the Document Cloud Services tab, you can add the connection details to your cloud server.

OBIEE 12.2.1.1.0 - New Feature Guide

OBIEE 12.2.1.1.0 - New Feature Guide

Use WebCenter Content as a Data Source

You can now read text data files from WebCenter Content. You can create a Content Server data source under Administration > Delivery > Content Server

OBIEE 12.2.1.1.0 - New Feature Guide

After you create your Content Server as a data source, you can create Content Server as a dataset in a BI Publisher data model and retrieve text data files stored in Web Center Content by Document ID.

OBIEE 12.2.1.1.0 - New Feature Guide

Attach PDF to Your Invoices

You can now send PDF attachments along with invoices with BI Publisher 12.2.1.1. You can attach PDFs along with the invoice while bursting. The process is explained in more detail in the BI Publisher 12.2.1.1 documents here.

Integrate Using RESTful APIs

BI Publisher 12.2.1.1 introduces a set of REST APIs that allow you to view reports in an application. It connects to the BI Publisher Server through a URL and uses JSON objects to contain data. You can read more about RESTful APIs in BI Publisher in the BI Publisher 12.2.1.1 new features guide here.

Deliver Email Body in HTML Format along with Report as Attachment

You can use standard HTML4 formatting tags to create the email body, include a logo or images, add hyperlinks and more. Reports will be generated as attachments to the email.

Manage Custom Fonts

In BI Publisher 12.2.1.1, you can manage custom fonts from the Administration page, and once uploaded, these custom fonts can be displayed in reports.

OBIEE 12.2.1.1.0 - New Feature Guide

Manage PGP Keys for FTP Delivery Encryption

In BI Publisher 12.2.1.1, Administrators can upload and manage PGP Keys from the Administration page. FTP delivery encryption is now a self-service feature. More details can be found in the BI Publisher 12.2.1.1 documents.

Dynamic Memory Guard

You can now separately configure limits for online and offline reports. The limits can be made dynamic by using variables, system defined functions and operators and, when the data changes, the dynamic memory guard will adjust.

You can access the Memory Guard in the Administration page under Runtime Configuration > Properties.

OBIEE 12.2.1.1.0 - New Feature Guide

More information can be found in the BI Publisher 12.2.1.1 documents.

Enabling Encryption for File Data Security on Cloud

You can now enable File Data Encryption in the Administration page and your data files with be encrypted at the time of uploading them on the server.

OBIEE Dashboard Subpages

A new feature for OBIEE 12.2.1.1 is the ability to add subpages within dashboards. This feature will allow dashboard designers to create multiple subpages for each dashboard page. To add a subpage, click on the desired dashboard page and click the "Add Dashboard Page" icon and select "Add Subpage."

OBIEE 12.2.1.1.0 - New Feature Guide

The Add Subpage window will then be displayed where you can name your subpage and add a description.

OBIEE 12.2.1.1.0 - New Feature Guide

Once you click OK, the subpage will be added to the dashboard page and adding content to the subpage is the exact same process as adding content to a dashboard page.

OBIEE 12.2.1.1.0 - New Feature Guide

RPD - Enable Data Driven Fragment Selection

A new feature for RPD developers is the ability to improve the performance of fragmented logical table sources by using the Enable Data Driven Fragment Selection feature. More information on how to enable this feature can be found in the 12.2.1.1 documents here.

Additional Data Sources Supported

New data sources are supported for 12.2.1.1 using DataDirect 7.1.5. These include Amazon RedShift, Oracle Service Cloud, Greenplum, Salesforce, and Teradata. Connections to Apache Spark and MongoDB are also available using 8.0.

Conclusion

All in all, the 12.2.1.1 update for OBIEE 12c adds some very powerful features. Data Visualization is now a very complete product and it's new features and native connectors make it a powerful addition to the new era of data visualization. For Oracle Cloud users, the integration for BI Publisher and Data Visualization makes the "report and share" process easier than ever.

Categories: BI & Warehousing

Stream Analytics and Processing with Kafka and Oracle Stream Analytics

Tue, 2016-07-26 09:00

In my previous post I looked the latest release of Oracle Stream Analytics (OSA), and saw how it provided a graphical interface to "Fast Data". Users can analyse streaming data as it arrives based on conditions and rules. They can also transform the stream data, publishing it back out as a stream in its own right. In this article we'll see how OSA can be used with Kafka.

Kafka is one of the foremost streaming technologies nowadays, for very good reasons. It is highly scalable and flexible, supporting multiple concurrent consumers. Oracle Streaming Analytics supports Kafka as both a source and target. To set up an inbound stream from Kafka, first we define the Connection:

Once the Connection is defined, we can create a Stream for a given Kafka topic: Oracle_Stream_Analytics If you get an error at this point of Unable to deploy OEP application then check the OSA log - it could be a connectivity issue to Zookeeper.
Exception in thread "SpringOsgiExtenderThread-286" org.springframework.beans.FatalBeanException:  
Error in context lifecycle initialization; nested exception is com.bea.wlevs.ede.api.EventProcessingException:  
org.I0Itec.zkclient.exception.ZkTimeoutException:  
Unable to connect to zookeeper server within timeout: 6000  
Assuming that the Stream is saved with no errors, you can then create an Exploration based on the stream and all being well, the live tweets are soon shown. Unlike the example at the top of this article, these tweets are coming in via Kafka, rather than the built-in OSA Twitter Stream. This is partly to demonstrate the Kafka capabilities, but also because the built-in OSA Twitter Stream only includes a subset of the available twitter data fields. Avro? Nope.

Data in Kafka can be serialised in many formats, including Avro - which OSA doesn’t seem to like. No error is thrown to the GUI but the exploration remains blank.

Looking in the OSA log file there’s a whole lot of errors recorded similar to this:
  
line 1:0 no viable alternative at character '?'  
line 1:1 no viable alternative at character '?'  
line 1:2 no viable alternative at character '?'  
line 1:3 no viable alternative at character '?'  
line 1:4 no viable alternative at character '?'  
line 1:5 no viable alternative at character '?'  
line 1:6 no viable alternative at character '?'  
line 1:7 no viable alternative at character '?'  
line 1:8 no viable alternative at character ''  
JSON? Kinda. One of the challenges that I found working with OSA was defining the “Shape” (data model) of the inbound stream data. JSON is a format used widely as a technology-agnostic data interchange format, including for the twitter data that I was working with. You can see a sample record here. One of the powerful features of JSON is its ability to nest objects in a record, as well as create arrays of them. You can read more about this detail in a recent article I wrote here. Unfortunately it seems that OSA does not support flattening out JSON, meaning that only elements in the root of the model are accessible. For twitter, that means we can see the text, and who it was in reply to, but not the user who tweeted it, since the latter is a nested element (along with many other fields, including hashtags which are also an array):
root
|-- created_at: string (nullable = true)
|-- entities: struct (nullable = true)
|    |-- hashtags: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- indices: array (nullable = true)
|    |    |    |    |-- element: long (containsNull = true)
|    |    |    |-- text: string (nullable = true)
|    |-- user_mentions: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- id: long (nullable = true)
|    |    |    |-- id_str: string (nullable = true)
|    |    |    |-- indices: array (nullable = true)
|    |    |    |    |-- element: long (containsNull = true)
|    |    |    |-- name: string (nullable = true)
|    |    |    |-- screen_name: string (nullable = true)
|-- source: string (nullable = true)
|-- text: string (nullable = true)
|-- timestamp_ms: string (nullable = true)
|-- truncated: boolean (nullable = true)
|-- user: struct (nullable = true)
|    |-- followers_count: long (nullable = true)
|    |-- following: string (nullable = true)
|    |-- friends_count: long (nullable = true)
|    |-- name: string (nullable = true)
|    |-- screen_name: string (nullable = true)
So what to do if the inbound streaming data is in nested-JSON format? It seems to me the only option is to pre-process it to flatten it. There are a variety of tools that could be used here - in the first instance I’d generally reach for Logstash, it being the one I’m most familiar with. To get an idea of the schema of a JSON record you can use jsonschema.net. Funnily enough when I was researching this blog post I came across the exact same problem on a forum posted by … me! Early last year I was working with the same dataset, and had the same issue with embedded arrays. The way to do it in Logstash is with a bit of Ruby code to flatten the arrays, and a standard mutate to bring nested objects up to the root level. Sample code:
mutate {  
    add_field => { "user_name" => "%{[user][name]}" }  
    add_field => { "user_screen_name" => "%{[user][screen_name]}" }  
}
ruby {  
    code => 'event["hashtags_array"] = event["[entities][hashtags]"].collect { |m| m["text"] } unless event["[entities][hashtags]"].nil?  
             event["hashtags_list"] = event["hashtags_array"].join(",")  unless event["[hashtags_array]"].nil?'  
}
You can find the full Logstash code on gist here. With this logstash code running I set up a new OSA Stream pointing to the new Kafka topic that Logstash was writing, and added the flattened fields to the Shape: We can then see in the Exploration the fields that we wanted to get at - user name, hashtags, and so on: Other Shape Gotchas

One of the fields in Twitter data is ‘source’ - which unfortunately is a reserved identifier in the CQL language that OSA uses behind the scenes.

Caused By: org.springframework.beans.FatalBeanException: Exception initializing channel; nested exception is com.bea.wlevs.ede.api.ConfigurationException: Event type [sx-10-16-Kafka_Technology_Tweets_JSON-1] of channel [channel] uses invalid or reserved CQL identifier = , source

It’s not clear how to define a shape in which the source data field is named after a reserved identifier.

Further Exploration of Twitter Streams with OSA

Using the flattened Twitter stream coming via Kafka that I demonstrated above, let’s now look at more OSA functionality.

Depending on the source of your data stream, and your purpose for analysing it, you may well want to filter out certain content. This can be done from the Exploration screen:

The Business Rules section of the Exploration enables you to define rules about the data and set field values based on it. This can be static values, or expressions based on data in the stream. There doens’t seem to be a way to add arbitrary fields via this, so I amended the Stream Shape to include a ‘spare’ field that I then populated: Kafka Stream Transformation with OSA

Here we’ll see how OSA can be used to ingest one Kafka topic, apply a transformation, and stream it to another Kafka topic.

The OSA exploration screen offers a basic aggregation (‘summary’) function, here showing the number of tweets per language:

Using the Windows icon to the right of the Sources box the time window can be defined, along with the refresh frequency:

This means that the count of tweets per language will be calculated looking at the data for the past 30 seconds, and this will be evaluated every five seconds. More complex functionality such as pivoting on the group-by column (so as to be able to chart out the number of tweets per language as separate metrics) doesn’t seem to be present in this release; arguably this is moving over into per analytics territory such as would be found in Oracle’s Big Data Discovery.

Taking the summarised stream (count of tweets, by language) I first Publish the exploration, making it available for use as the input to a subsequent exploration. Then from the Catalog page select a Pattern, which I’m going to use to build a stream showing the most common languages in the past five seconds. With the Top N pattern you specify the event stream (in this case, the summarised stream that I built above), and the metric by which to order the events which here is the count of tweets per language.

For completeness, I’m going to stream the output of this pattern exploration back to a Kafka topic

Note that I’ve defined a new Shape here based on the columns in the pattern. In the pattern itself I renamed the COUNT column to a clearer one (tweetcount5_sec). Renaming it wasn’t strictly necessary since it’s possible to define the field/shape mapping when you configure the Target:

For the target to take effect, I publish the pattern exploration, and then using kafka-console-consumer can see the topic being populated in realtime by OSA:

Being able to apply transformations to streams in realtime like this and stream the results is pretty useful. There are some limitations to the capabilities of OSA through the front end GUI. For example, support for nested json, and integration with the Kafka Schema Registry to automatically derive Shapes for inbound topics would both be great. Lower-level, the option to specify the consumer group id, as well as the start point for consumption (beginning of topic, or streaming at the end) are both things that would probably be necessary sooner or later using OSA for full-blown development.

OSA and Spatial

One of the Patterns that OSA provides is a Spatial one, which can be used to analyse source data that includes geo-location data. This could be to simply plot the occurrence of the data point (as we'll see shortly) on a map. It can also be used in a more sophisticated manner, to track a given entity's movements on a map. An example of this could be a fleet of trucks reporting their position back at regular intervals. Areas on a map can be defined and conditions triggered as the entity enters or leaves the area. For now though, we'll keep it simple. Using the flattened Twitter stream from Kafka that I produced from Logstash above, I'm going to plot Tweets in realtime on a map, along with a very simplistic tagging of the broad area in which they came from.

In my source Kafka topic I have two fields, latitude and longitude. I expose these as part of the 'flattening' of the JSON in this logstash script, since by default they're nested within the coordinates field and as an array too. When defining the Stream's Shape make sure you define the datatype correctly (Double) - OSA is not very forgiving of stupidity and I spent a frustrating time trying to work out why "-80.1422195" was coming through as zero - obviously defined as an Integer this was never going to work!

Not entirely necessary, but useful for debug purposes, I setup an exploration based on the flattened Twitter stream, with a filter to only include tweets that had geo-location data in them. This way I knew what tweets I should expect to be seeing in the next step. One of the things that I have found with OSA is that it has a tendency to fail silently; instead of throwing errors you'll just not get any data. By setting up the filter exploration I could at least debug things a bit more easily.

Oracle_Stream_Analytics

After this I created a new object, a Map. A Map object defines a set of named areas, which could be sourced from a database table, or drawn manually - which is what I did here by setting the Map Type to 'None (Create Manually)'. One thing to note about the maps is that they're sourced online (openstreetmap.org) so you'll need an internet connection to do this. Once the Map is open, click the Polygon Tool icon and click-drag a shape around the area that you want to "geo-fence". Each area is given a name, and this is what is used in the streaming data to label the event's geographical area.

Yorkshire - God's Own County

Having got our source data stream with geo-data in, and a Map on which to plot it and analyse the location of each event, we now use the Spatial General pattern to create an Exploration. The topology looks like this:

Oracle_Stream_Analytics

The fields in the Spatial General pattern are all pretty obvious. Object key is the field to use to track the same entity across multiple events, if you want to use the enter/exit/stay statuses. For tweets we just use 'Enter', but for people or vehicles, for example, you might get multiple status reports and want to track them on a map. For example, when a person is near a point of interest that you're tracking, or a vehicle has remained in a set area for too long.

Oracle_Stream_Analytics

If you let the Exploration now run, depending on the rate of event ingest, you'll sooner or later see points appearing on the map and event details underneath. The "status" column is populated (blank if the event is outside of the defined geo-fences), as is the "Place", based on the geo-fence names that you defined.

osa_ani_05

Summary

I can see OSA being used in two ways. The first as an ‘endpoint’ for streams with users taking actions based on the data, with some of the use cases listed here. The second is for prototyping transformations and analyses on streams prior to productionising them. The visual interface and immediacy of feedback on transformations applied means that users can quickly understand what further processing they may want to apply to the stream using actual streaming data to inform this.

This latter concept - that of prototyping - is similar to that which we see with another of Oracle’s products, Big Data Discovery. With BDD users can analyse data in the organisation’s data reservoir, as well as apply transformations to it (read more). Just as BDD doesn’t replace OBIEE or Visual Analyzer but enables users to understand how they do want to model the data in these tools, OSA wouldn’t replace “production grade” integration done by Oracle Data Integrator. What it would do is allow users to get a clearer idea of the transformations they would want performed in it.

OSA's user interface is easy to use and intuitive, and this is definitely a tool that you would put in front of technically minded business users. There are limitations to what can be achieved technically through the web GUI alone and something like Oracle Data Integrator (ODI) would still be a more appropriate fit for complex streaming work. At Oracle Open World last year it was announced (slides) that a beta would be starting for ODI using Spark Streaming for ETL and stream processing, so it'll be interesting to see this when it comes out.

Further Reading

Categories: BI & Warehousing

An Introduction to Oracle Stream Analytics

Mon, 2016-07-25 09:00
An Introduction to Oracle Stream Analytics

Oracle Stream Analytics (OSA) is a graphical tool that provides “Business Insight into Fast Data”. In layman terms, that translates into an intuitive web-based interface for exploring, analysing, and manipulating streaming data sources in realtime. These sources can include REST, JMS queues, as well as Kafka. The inclusion of Kafka opens OSA up to integration with many new-build data pipelines that use this as a backbone technology.

An Introduction to Oracle Stream AnalyticsPreviously known as Oracle Stream Explorer, it is part of the SOA component of Fusion Middleware (just as OBIEE and ODI are part of FMW too). In a recent blog it was positioned as “[…] part of Oracle Data Integration And Governance Platform.”. Its Big Data credentials include support for Kafka as source and target, as well as the option to execute across multiple nodes for scaling performance and capacity using Spark. I’ve been exploring OSA from the comfort of my own Mac, courtesy of Docker and a Docker image for OSA created by Guido Schmutz. The benefits of Docker are many and covered elsewhere, but what I loved about it in this instance was that I didn’t have to download a VM that was 10s of GB. Nor did I have to spend time learning how to install OSA from scratch, which whilst interesting wasn’t a priority compared to just trying to tool out and seeing what it could do. [Update] it turns out that installation is a piece of cake, and the download is less than 1Gb … but in general the principle still stands - Docker is a great way to get up and running quickly with something In this article we’ll take OSA for a spin, looking at some of the functionality and terminology, and then real examples of use with live Twitter data. To start with, we sign in to Oracle Stream Analytics: An Introduction to Oracle Stream AnalyticsFrom here, click on the Catalog link, where a list of all the resources are listed. Some of these resource types include:
  • Streams - definitions of sources of data such as Kafka, JMS, and a dummy data generator (event generator)
  • Connections - Servers etc from which Streams are defined
  • Explorations - front-end for seeing contents of Streams in realtime, as well as applying light transformations
  • Targets - destination for transformed streams
Viewing Realtime Twitter Data with OSA The first example I’ll show is the canonical big data/streaming example everywhere – Twitter. Twitter is even built into OSA as a Stream source. If you go to https://dev.twitter.com you can get yourself a set of credentials enabling you to query the live Twitter firehose for given hashtags or users. With my twitter dev credentials, I create a new Connection in OSA: An Introduction to Oracle Stream AnalyticsAn Introduction to Oracle Stream AnalyticsAn Introduction to Oracle Stream AnalyticsNow we have an entry in the Catalog, for the Twitter connection: An Introduction to Oracle Stream Analyticsfrom which we can create a Stream, using the connection and a set of hashtags or users for whom we want to stream tweets: An Introduction to Oracle Stream AnalyticsAn Introduction to Oracle Stream AnalyticsThe Shape is basically the schema or data model that is applied for the stream. There is one built-in for Twitter, which we’ll use here: An Introduction to Oracle Stream AnalyticsWhen you click Save, if you get an error Unable to deploy OEP application then check the OSA log file for errors such as unable to reach Twitter, or invalid credentials. Assuming the Stream is created successfully you are then prompted to create an Exploration from where you can see the Stream in realtime: An Introduction to Oracle Stream AnalyticsExplorations can have multiple stream sources, and be used to transform the contents, which we’ll see later. For now, after clicking Create, we get our Exploration window, which shows the contents of the stream in realtime: An Introduction to Oracle Stream AnalyticsAt the bottom of the screen there’s the option to plot one or more charts showing the value of any numeric values in the stream, as can be seen in the animation above. I’ll leave this example here for now, but finish by using the Publish option from the Actions menu, which makes it available as a source for subsequent analyses. Adding Lookup Data to Streams Let's look now at some more of the options available for transforming and 'wrangling' streaming data with OSA. Here I’m going to show how two streams can be joined together (but not crossed) based on a common field, and the resulting stream used as the input for a subsequent process. The data is simulated, using a CSV file (read by OSA on a loop) and OSA's Event Generator. From the Catalog page I create a new Stream, using Event Generator as the Type: An Introduction to Oracle Stream AnalyticsOn the second page of the setup I define how frequently I want the dummy events to be generated, and the specification for the dummy data: An Introduction to Oracle Stream AnalyticsThe last bit of setup for the stream is to define the Shape, which is the schema of data that I’d like generated: An Introduction to Oracle Stream AnalyticsThe Exploration for this stream shows the dummy data: An Introduction to Oracle Stream AnalyticsThe second stream is going to be sourced from a very simple key/value CSV file:
attr_id,attr_value  
1,never  
2,gonna  
3,give  
4,you  
5,up  
The stream type is CSV, and I can configure how often OSA reads from it, as well as telling OSA to loop back to the beginning when it's read to the end, thus simulating a proper stream. The ‘shape’ is picked up automatically from the file, based on the first row (headers) and then inferred data types: An Introduction to Oracle Stream AnalyticsThe Exploration for the stream shows the five values repeatedly streamed through (since I ticked the box to ‘loop’ the CSV file in the stream): An Introduction to Oracle Stream AnalyticsBack on the Catalog page I’m going to create a new Exploration, but this time based on a Pattern. Patterns are pre-built templates for stream manipulation and processing. Here we’ll use the pattern for a “left outer join” between streams. An Introduction to Oracle Stream AnalyticsAn Introduction to Oracle Stream AnalyticsThe Pattern has a set of pre-defined fields that need to be supplied, including the stream names and the common field with which to join them. Note also that I’ve increased the Window Range. This is necessary so that a greater range of CSV stream events are used for the lookup. If the Range is left at the default of 1 second then only events from both streams occurring in the same second that match on attr_id would be matched. Unless both streams happen to be in sync on the same attr_id from the outset then this isn’t going to happen that often, and certainly wouldn’t in a real-life data stream. So now we have the two joined streams: An Introduction to Oracle Stream AnalyticsWithin an Exploration it is possible to do light transformation work. By right-clicking on a column you can rename or remove it, which I’ve done here for the duplicated attr_id (duplicated since it appears in both streams), as well as renamed the attr_value: An Introduction to Oracle Stream Analytics Daisy-Chaining, Targets, and Topology

Once an Exploration is Published it can be used as the Source for subsequent Explorations, enabling you to map out a pipeline based on multiple source streams and transformations. Here we're taking the exploration created just above that joined the two streams together, and using the output as the source for a new Exploration:

An Introduction to Oracle Stream AnalyticsSince the Exploration is based on a previous one, the same stream data is available, but with the joins and transformations already applied

An Introduction to Oracle Stream AnalyticsFrom here another transformation could be applied, such as replacing the value of one column conditionally based on that of another

An Introduction to Oracle Stream AnalyticsWhilst OSA enables rapid analysis and transformation of inbound streams, it also lets you stream the transformed results outside of OSA, to a Target as we saw in the Kafka example above. As well as Kafka other technologies are supported as targets, including a REST endpoint, or a simple CSV file.

An Introduction to Oracle Stream AnalyticsWith a target configured, as well as an Exploration based on the output of another, the Topology comes in handy for visualising the flow of data. You can access this from the Topology icon in an Exploration page, or from the dropdown menu on the Catalog page against a given object

An Introduction to Oracle Stream Analytics

In the next post I will look at how Oracle Stream Analytics can be used to analyse, enrich, and publish data to and from Kafka. Stay tuned!

Categories: BI & Warehousing

Announcing OBI Remote Training

Wed, 2016-07-20 04:00
Announcing OBI Remote Training

Announcing OBI Remote Training

Since the release of OBIEE 12c in 2015, we have received countless inquiries about how we would be offering our training. Our customers are familiar with our ability to provide on-site private training for a team and we are well known for hosting training classes in our offices in the UK and the US. But what most people aren’t aware of is that we now offer OBI remote training.

Our public training schedule offers a variety of courses monthly, some of which are offered exclusively as remote classes. And for any one of our public classes that is hosted in our U.S. offices, we also offer a limited number of seats to remote attendees. What does this mean for you? This means you have options!

One of our goals here at Rittman Mead is to provide unhindered access to the great wealth of information our team has accumulated through their extensive real-world experience. Now we've translated this goal into more accessible training. We understand budgets can be tight and travel may not always be an option for you or your team, but we don’t want that to be the reason you can’t attend our training.

In mid-2015 we started testing our ability to deliver remote training. Our main concern as we began testing was whether we’d be able to deliver the same value to our customers in a digital classroom that we've traditionally been able to deliver in a physical classroom. Our fear was that when you lost the face-to-face interaction between the instructor and students, you would also lose some of the rhythm and chemistry of the training, and, consequently, our students would feel less engaged. Other more technical concerns were on our minds, ranging from sound and video quality to connectivity. Much to our surprise and satisfaction, however, our concerns quickly dissolved as, time after time, we were able to deliver the training without issue.

So after plenty of testing, we are pleased to offer remote training as a regular option in our training schedule.

We are aware that remote training (or online training) has been around for some timewe are not claiming to be innovators in the ways of online learning—but we feel that the platform for online learning has finally reached a level that is in line with the quality we demand for our training.

In fact, we have consistently received high marks from customers who have attended our remote training, solidifying our confidence that it does in fact live up to our standards. We invite you to check out our training options. Whether it be on-site training (public or private) or remote training, rest assured that you will be receiving expert-level training from Rittman Mead’s best.

For a full list of our scheduled trainings, see our US or UK calendars.

Categories: BI & Warehousing

System Metrics Collectors

Tue, 2016-07-19 02:18
System Metrics Collectors

The need to monitor and control the system performances is not new. What is new is the trend of clever, lightweight, easy to setup, open source metric collectors in the market, along with timeseries databases to store these metrics, and user friendly front ends through which to display and analyse the data.

In this post I will compare Telegraf, Collectl and Topbeat as lightweight metric collectors. All of them do a great job of collecting variety of useful system and application statistic data with minimal overhead to the servers.  Each has the strength of easy configuration and accessible documentation but still there are some differences around range of input and outputs; how they extract the data, what metrics they collect and where they store them.

  • Telegraf is part of the Influx TICK stack, and works with a vast variety of useful input plugins such as Elasticsearch, nginx, AWS and so on. It also supports a variety of outputs, obviously InfluxDB being the primary one. (Find out more...)
  • Topbeat is a new tool from Elastic, the company behind Elasticsearch, Logstash, and Kibana. The Beats platform is evolving rapidly, and includes topbeat, winlogbeat, and packetbeat. In terms of metric collection its support for detailed metrics such as disk IO is relatively limited currently. (Find out more...)
  • Collectl is a long-standing favourite of systems performance analysts, providing a rich source of data. This depth of complexity comes at a bit of a cost when it comes to the documentation’s accessibility, it being aimed firmly at a systems programmer! (Find out more...)

In this post I have used InfluxDB as the backend for storing the data, and Grafana as the front end visualisation tool. I will explain more about both tools later in this post.

In the screenshot below I have used Grafana dashboards to show  "Used CPU", "Used Memory" and "Network Traffic" stats from the mentioned collectors. As you can see the output of all three is almost the same. What makes them different is:

    • What your infrastructure can support? For example, you cannot install Telegraf on old version of X Server.
    • What input plugins do you need? The current version of Topbeat doesn’t support more detailed metrics such as disk IO and network stats.
    • What storage do you want/need to use for the outputs? InfluxDB works as the best match for Telegraf data, whilst Beats pairs naturally with Elasticsearch
    • What is your visualisation tool and what does it work with best. In all cases the best front end should natively support time series visualisations.

System Metrics Collectors

Next I am going to provide more details on how to download/install each of the mentioned metrics collector services, example commands are written for a linux system.

Telegraf "An open source agent written in Go for collecting metrics and data on the system it's running on or from other services. Telegraf writes data it collects to InfluxDB in the correct format."
  1. Download and install InfluxDB: sudo yum install -y https://s3.amazonaws.com/influxdb/influxdb-0.10.0-1.x86_64.rpm
  2. Start the InfluxDB service: sudo service influxdb start
  3. Download Telegraf: wget http://get.influxdb.org/telegraf/telegraf-0.12.0-1.x86_64.rpm
  4. Install Telegraf: sudo yum localinstall telegraf-0.12.0-1.x86_64.rpm
  5. Start the Telegraf service: sudo service telegraph start
  6. Done!

The default configuration file for Telegraf sits in /etc/telegraf/telegraf.conf or a new config file can be generated using the -sample-config flag on the location of your choice:  telegraf -sample-config > telegraf.conf .  Update the config file to enable/disable/setup different input or outputs plugins e.g. I enabled network inputs: [[inputs.net]]. Finally to test the config files and to verify the output metrics run: telegraf -config telegraf.conf -test

Once all ready and started, a new database called 'telegraf' will be added to the InfluxDB storage which you can connect and query. You will read more about InfluxDB in this post.

 

Collectl Unlike most monitoring tools that either focus on a small set of statistics, format their output in only one way, run either interactively or as a daemon but not both, collectl tries to do it all. You can choose to monitor any of a broad set of subsystems which currently include buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp.
  • Install collectl: sudo yum install collectl
  • Update the Collectl config file at /etc/collectl.conf to turn on/off different switches and also to write the Collectl's output logs to a database, i.e. InfluxDB
  • Restart Collectl service  sudo service collectl restart
  • Collectl will write its log in a new InfluxDB database called “graphite”.

 

Topbeat Topbeat is a lightweight way to gather CPU, memory, and other per-process and system wide data, then ship it to (by default) Elasticsearch to analyze the results.
  • Download Topbeat: wget https://download.elastic.co/beats/topbeat/topbeat-1.2.1-x86_64.rpm
  • Install: sudo yum local install topbeat-1.2.1-x86_64.rpm
  • Edit the topbeat.yml configuration file at /etc/topbeat and set the output to elasticsearch or logstash.
  • If choosing elasticsearch as output, you need to load the index template, which lets Elasticsearch know which fields should be analyzed in which way. The recommended template file is installed by the Topbeat packages. You can either configure Topbeat to load the template automatically, Or you can run a shell script to load the template: curl -XPUT 'http://localhost:9200/_template/topbeat -d@/etc/topbeat/topbeat.template.json
  • Run topbeat: sudo /etc/init.d/topbeat start
  • To test your Topbeat Installation try: curl -XGET 'http://localhost:9200/topbeat-*/_search?pretty'
  • TopBeat logs are written at /var/log
  • Reference to output fields 

 

Why write another metrics collector?

From everything that I have covered above, it is obvious that there is no shortage of open source agents for collecting metrics. Still you may come across a situation that none of the options could be used e.g. specific operating system (in this case, MacOS on XServe) that can’t support any of the options above. The below code is my version of light metric collector, to keep track of Disk IO stats, network, CPU and memory of the host where the simple bash script will be run.

The code will run through an indefinite loop until it is forced quit. Within the loop, first I have used a CURL request (InfluxDB API Reference) to create a database called OSStat, if the database name exists nothing will happen. Then I have used a variety of built-in OS tools to extract the data I needed. In my example sar -u for cpu, sar -n for network, vm_stat for memory, iotop for diskio could return the values I needed. With a quick search you will find many more options. I also used a combinations of awk, sed and grep to transform the values from these tools to the structure that I was easier to use on the front end. Finally I pushed the results to InfluxDB using the curl requests.

#!/bin/bash  
export INFLUX_SERVER=$1  
while [ 1 -eq 1 ];  
do

#######CREATE DATABASE ########
curl -G http://$INFLUX_SERVER:8086/query  -s --data-urlencode "q=CREATE DATABASE OSStat" > /dev/null

####### CPU  #########
sar 1 1 -u | tail -n 1 | awk -v MYHOST=$(hostname)   '{  print "cpu,host="MYHOST"  %usr="$2",%nice="$3",%sys="$4",%idle="$5}' | curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null

####### Memory ##########
FREE_BLOCKS=$(vm_stat | grep free | awk '{ print $3 }' | sed 's/\.//')  
INACTIVE_BLOCKS=$(vm_stat | grep inactive | awk '{ print $3 }' | sed 's/\.//')  
SPECULATIVE_BLOCKS=$(vm_stat | grep speculative | awk '{ print $3 }' | sed 's/\.//')  
WIRED_BLOCKS=$(vm_stat | grep wired | awk '{ print $4 }' | sed 's/\.//')

FREE=$((($FREE_BLOCKS+SPECULATIVE_BLOCKS)*4096/1048576))  
INACTIVE=$(($INACTIVE_BLOCKS*4096/1048576))  
TOTALFREE=$((($FREE+$INACTIVE)))  
WIRED=$(($WIRED_BLOCKS*4096/1048576))  
ACTIVE=$(((4096-($TOTALFREE+$WIRED))))  
TOTAL=$((($INACTIVE+$WIRED+$ACTIVE)))

curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary  "memory,host="$(hostname)" Free="$FREE",Inactive="$INACTIVE",Total-free="$TOTALFREE",Wired="$WIRED",Active="$ACTIVE",total-used="$TOTAL > /dev/null

####### Disk IO ##########
iotop -t 1 1 -P | head -n 2  | grep 201 | awk -v MYHOST=$(hostname)  
  '{ print "diskio,host="MYHOST" io_time="$6"read_bytes="$8*1024",write_bytes="$11*1024}'  | curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null

###### NETWORK ##########
sar -n DEV 1  |grep -v IFACE|grep -v Average|grep -v -E ^$ | awk -v MYHOST="$(hostname)" '{print "net,host="MYHOST",iface="$2" pktin_s="$3",bytesin_s="$4",pktout_s="$4",bytesout_s="$5}'|curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null

sleep 10;  
done

 

 

InfluxDB Storage "InfluxDB is a time series database built from the ground up to handle high write and query loads. It is the second piece of the TICK stack. InfluxDB is meant to be used as a backing store for any use case involving large amounts of timestamped data, including DevOps monitoring, application metrics, IoT sensor data, and real-time analytics."

InfluxDB's SQL-like query language is called InfluxQL, You can connect/query InfluxDB via Curl requests (mentioned above), command line or browser. The following sample InfluxQLs cover useful basic command line statements to get you started:

influx -- Connect to the database

SHOW DATABASES  -- Show existing databases, _internal is the embedded databased used for internal metrics

USE telegraf -- Make 'telegraf' the current database

SHOW MEASUREMENTS -- show all tables within current database

SHOW FIELD KEYS -- show tables definition within current database

InfluxDB also have a browser admin console that is by default accessible on port 8086. (Official Reference(Read more on RittmanMead Blog)

System Metrics Collectors

 

Grafana Visualisation "Grafana provides rich visualisation options best for working with time series data for Internet infrastructure and application performance analytics."

Best to use InfluxDB as datasource for Grafana as Elasticsearch datasources doesn't support all Grafana's features e.g. functions behind the panels. Here is a good introduction video to visualisation with Grafana.

System Metrics Collectors

Categories: BI & Warehousing

Connecting Oracle Data Visualization Desktop to OBIEE

Mon, 2016-07-18 04:00
Connecting Oracle Data Visualization Desktop to OBIEE

Recently at Rittman Mead we have been asked a lot of questions surrounding Oracle’s new Data Visualization Desktop tool and how it integrates with OBIEE. Rather than referring people to the Oracle docs on DVD, I decided to share with you my experience connecting to an OBIEE 12c instance and take you through some of the things I learned through the process.

In a previous blog, I went though database connections with Data Visualization Desktop and how to create reports using data pulled directly from the database. Connecting to DVD to OBIEE is largely the same process, but allows the user to pull in data at pre-existing report level. I decided to use our 12c ChitChat demo server as the OBIEE source and created some sample reports in answers to test out with DVD.

From the DVD Data Sources page, clicking "Create New Data Source" brings up a selection pane with the option to select “From Oracle Applications.”

Connecting Oracle Data Visualization Desktop to OBIEE

Clicking this option brings up a connection screen with options to enter a connection name, URL (location of the reports you want to pull in as a source), username, and password respectively. This seems like a pretty straightforward process. Reading the Oracle docs on connectivity to OBIEE with DVD say to navigate to the web catalog, select the folder containing the analysis you want to use as a source, and then copy and paste the URL from your browser into the URL connection in DVD. However, using this method will cause the connection to fail.

Connecting Oracle Data Visualization Desktop to OBIEE

To get Data Visualization Desktop to connect properly, you have to use the URL that you would normally use to log into OBIEE analytics with the proper username and password.

Connecting Oracle Data Visualization Desktop to OBIEE

Once connected, the web catalog folders are displayed.

Connecting Oracle Data Visualization Desktop to OBIEE

From here, you can navigate to the analyses you want to use for data sources.

Connecting Oracle Data Visualization Desktop to OBIEE

Selecting the analysis you want to use as your data source is the same process as selecting schemas and tables from a database source. Once the selection is made, a new screen is displayed with all of the tables and columns that were used for the analysis within OBIEE.

Connecting Oracle Data Visualization Desktop to OBIEE

From here you can specify each column as an attribute or measure column and change the aggregation for your measures to something other than what was imported with the analysis.

Clicking "Add to Project" loads all the data into DVD under Data Elements and is displayed on the right hand side just like subject area contents in OBIEE.

Connecting Oracle Data Visualization Desktop to OBIEE

The objective of pulling data in from existing analyses is described by Oracle as revisualization. Keep in mind that Data Visualization Desktop is meant to be a discovery tool and not so much a day-to-day report generator.

The original report was a pivot table with Revenue and Order information for geographical, product and time series dimensions. Let’s say that I just wanted to look at the revenue for all cities located in the Americas by a specific brand for the year 2012.

Dragging in the appropriate columns and adding filters took seconds and the data loaded almost instantaneously. I changed the view to horizontal bar and added a desc sort to Revenue and this was my result:

Connecting Oracle Data Visualization Desktop to OBIEE

Notice how the revenue for San Fransisco is much higher than any of the other states. Let’s say I want to get a closer look at all the other states without seeing the revenue data for San Fransisco. I could create a new filter for City and exclude San Fransisco from the list or I could just create a filter range for Revenue. Choosing the latter gave me the option of moving a slider to change my revenue value distribution and showed me the results in real time. Pretty cool, right?

Connecting Oracle Data Visualization Desktop to OBIEE

Connecting Oracle Data Visualization Desktop to OBIEE

Taking one report and loading it in can open up a wide range of data discovery opportunities but what if there are multiple reports I want to pull data from? You can do this and combine the data together in DVD as long as the two reports contain columns to join the two together.

Going back to my OBIEE connection, there are two reports I created on the demo server that both contain customer data.

Connecting Oracle Data Visualization Desktop to OBIEE

By pulling in both the Customer Information and Number of Customer Orders Per Year report, Data Visualization Desktop creates two separate data sources which show up under Data Elements.

Connecting Oracle Data Visualization Desktop to OBIEE

Inspecting one of the data sources shows the match between the two is made on both Customer Number and Customer Name columns.

Connecting Oracle Data Visualization Desktop to OBIEE

Note: It is possible to make your own column matches manually using the Add Another Match feature.

By using two data sets from two different reports, you can blend the data together to discover trends, show outliers and view the data together without touching the database or having to create new reports within OBIEE.

Connecting Oracle Data Visualization Desktop to OBIEE

The ability to connect directly to OBIEE with Data Visualization Desktop and pull in data from individual analyses is a very powerful feature that makes DVD’s that much greater. Combining data from multiple analyses blend them together internally creates some exciting data discovery possibilities for users with existing OBIEE implementations.

Categories: BI & Warehousing

Using R with Jupyter Notebooks and Oracle Big Data Discovery

Thu, 2016-07-14 05:00
Using R with Jupyter Notebooks and Oracle Big Data Discovery

Oracle's Big Data Discovery encompasses a good amount of exploration, transformation, and visualisation capabilities for datasets residing in your organisation’s data reservoir. Even with this though, there may come a time when your data scientists want to unleash their R magic on those same datasets. Perhaps the data domain expert has used BDD to enrich and cleanse the data, and now it's ready for some statistical analysis? Maybe you'd like to use R's excellent forecast package to predict the next six months of a KPI from the BDD dataset? And not only predict it, but write it back into the dataset for subsequent use in BDD? This is possible using BDD Shell and rpy2. It enables advanced analysis and manipulation of datasets already in BDD. These modified datasets can then be pushed back into Hive and then BDD.

BDD Shell provides a native Python environment, and you may opt to use the pandas library to work with BDD datasets as detailed here. In other cases you may simply prefer working with R, or have a particular library in mind that only R offers natively. In this article we’ll see how to do that. The "secret sauce" is rpy2 which enables the native use of R code within a python-kernel Jupyter Notebook.

As with previous articles I’m using a Jupyter Notebook as my environment. I’ll walk through the code here, and finish with a copy of the notebook so you can see the full process.

First we'll see how you can use R in Jupyter Notebooks running a python kernel, and then expand out to integrate with BDD too. You can view and download the first notebook here.

Import the RPY2 environment so that we can call R from Jupyter

import readline is necessary to workaround the error: /u01/anaconda2/lib/libreadline.so.6: undefined symbol: PC

import readline
%load_ext rpy2.ipython
Example usage Single inline command, prefixed with %R
%R X=c(1,4,5,7); sd(X); mean(X)
array([ 4.25])
R code block, marked by %%R
%%R
Y = c(2,4,3,9)
summary(lm(Y~X))
Call:  
lm(formula = Y ~ X)

Residuals:  
    1     2     3     4  
 0.88 -0.24 -2.28  1.64 

Coefficients:  
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.0800     2.3000   0.035    0.975  
X             1.0400     0.4822   2.157    0.164

Residual standard error: 2.088 on 2 degrees of freedom  
Multiple R-squared:  0.6993,    Adjusted R-squared:  0.549  
F-statistic: 4.651 on 1 and 2 DF,  p-value: 0.1638
Graphics plot, output to the notebook
%R plot(X, Y)
Using R with Jupyter Notebooks and Oracle Big Data Discovery Pass Python variable to R using -i
import numpy as np
Z = np.array([1,4,5,10])
%R -i Z mean(Z)
array([ 5.])
For more information see the documentation Working with BDD Datasets from R in Jupyter Notebooks Now that we've seen calling R in Jupyter Notebooks, let's see how to use it with BDD in order to access datasets. The first step is to instantiate the BDD Shell so that you can access the datasets in BDD, and then to set up the R environment using rpy2
execfile('ipython/00-bdd-shell-init.py')  
%load_ext rpy2.ipython

I also found that I had to make readline available otherwise I got an error (/u01/anaconda2/lib/libreadline.so.6: undefined symbol: PC)

import readline  

After this, we can import a BDD dataset, convert it to a Spark dataframe and then a pandas dataframe, ready for passing to R

ds = dss.dataset('edp_cli_edp_8d6fd230-8e99-449c-9480-0c2bddc4f6dc')  
spark_df = ds.to_spark()  
import pandas as pd  
pandas_df = spark_df.toPandas()  

Note that there is a lot of passing of the same dataframe into different memory structures here - from BDD dataset context to Spark to Pandas, and that’s before we’ve even hit R. It’s fine for ad-hoc wrangling but might start to be painful with very large datasets.

Now we use the rpy2 integration with Jupyter Notebooks and invoke R parsing of the cell’s contents, using the %%R syntax. Optionally, we can pass across variables with the -i parameter, which we’re doing here. Then we assign the dataframe to an R-notation variable (optional, but stylistically nice to do), and then use R's summary function to show a summary of each attribute:

%%R -i pandas_df  
R.df <- pandas_df  
summary(R.df)  
vendorid     tpep_pickup_datetime tpep_dropoff_datetime passenger_count  
 Min.   :1.000   Min.   :1.420e+12    Min.   :1.420e+12     Min.   :0.000  
 1st Qu.:1.000   1st Qu.:1.427e+12    1st Qu.:1.427e+12     1st Qu.:1.000  
 Median :2.000   Median :1.435e+12    Median :1.435e+12     Median :1.000  
 Mean   :1.525   Mean   :1.435e+12    Mean   :1.435e+12     Mean   :1.679  
 3rd Qu.:2.000   3rd Qu.:1.443e+12    3rd Qu.:1.443e+12     3rd Qu.:2.000  
 Max.   :2.000   Max.   :1.452e+12    Max.   :1.452e+12     Max.   :9.000  
 NA's   :12      NA's   :12           NA's   :12            NA's   :12     
 trip_distance      pickup_longitude  pickup_latitude    ratecodeid    
 Min.   :    0.00   Min.   :-121.93   Min.   :-58.43   Min.   : 1.000  
 1st Qu.:    1.00   1st Qu.: -73.99   1st Qu.: 40.74   1st Qu.: 1.000  
 Median :    1.71   Median : -73.98   Median : 40.75   Median : 1.000  
 Mean   :    3.04   Mean   : -72.80   Mean   : 40.10   Mean   : 1.041  
 3rd Qu.:    3.20   3rd Qu.: -73.97   3rd Qu.: 40.77   3rd Qu.: 1.000  
 Max.   :67468.40   Max.   : 133.82   Max.   : 62.77   Max.   :99.000  
 NA's   :12         NA's   :12        NA's   :12       NA's   :12      
 store_and_fwd_flag dropoff_longitude dropoff_latitude  payment_type 
 N   :992336        Min.   :-121.93   Min.   : 0.00    Min.   :1.00  
 None:    12        1st Qu.: -73.99   1st Qu.:40.73    1st Qu.:1.00  
 Y   :  8218        Median : -73.98   Median :40.75    Median :1.00  
                    Mean   : -72.85   Mean   :40.13    Mean   :1.38  
                    3rd Qu.: -73.96   3rd Qu.:40.77    3rd Qu.:2.00  
                    Max.   :   0.00   Max.   :44.56    Max.   :5.00  
                    NA's   :12        NA's   :12       NA's   :12    
  fare_amount          extra            mta_tax          tip_amount     
 Min.   :-170.00   Min.   :-1.0000   Min.   :-1.7000   Min.   :  0.000  
 1st Qu.:   6.50   1st Qu.: 0.0000   1st Qu.: 0.5000   1st Qu.:  0.000  
 Median :   9.50   Median : 0.0000   Median : 0.5000   Median :  1.160  
 Mean   :  12.89   Mean   : 0.3141   Mean   : 0.4977   Mean   :  1.699  
 3rd Qu.:  14.50   3rd Qu.: 0.5000   3rd Qu.: 0.5000   3rd Qu.:  2.300  
 Max.   : 750.00   Max.   :49.6000   Max.   :52.7500   Max.   :360.000  
 NA's   :12        NA's   :12        NA's   :12        NA's   :12       
  tolls_amount      improvement_surcharge  total_amount       PRIMARY_KEY     
 Min.   : -5.5400   Min.   :-0.3000       Min.   :-170.80   0-0-0   :      1  
 1st Qu.:  0.0000   1st Qu.: 0.3000       1st Qu.:   8.75   0-0-1   :      1  
 Median :  0.0000   Median : 0.3000       Median :  11.80   0-0-10  :      1  
 Mean   :  0.3072   Mean   : 0.2983       Mean   :  16.01   0-0-100 :      1  
 3rd Qu.:  0.0000   3rd Qu.: 0.3000       3rd Qu.:  17.80   0-0-1000:      1  
 Max.   :503.0500   Max.   : 0.3000       Max.   : 760.05   0-0-1001:      1  
 NA's   :12         NA's   :12            NA's   :12        (Other) :1000560  

We can use native R code and R libraries including the excellent dplyr to lightly wrangle and then chart the data:

%%R

library(dplyr)  
library(ggplot2)

R.df %>%  
    filter(fare_amount > 0) %>%  
    ggplot(aes(y=fare_amount, x=tip_amount,color=passenger_count)) +  
    geom_point(alpha=0.5 )
Using R with Jupyter Notebooks and Oracle Big Data Discovery

Finally, using the -o flag on the %%R invocation, we can pass back variables from the R context back to pandas :

%%R -o R_output  
R_output <-  
    R.df %>%  
    mutate(foo = 'bar')

and from there back to Spark and write the results to Hive:

spark_df2 = sqlContext.createDataFrame(R_output)  
spark_df2.write.mode('Overwrite').saveAsTable('default.updated_dataset')  

and finally ingest the new Hive table to BDD:

from subprocess import call  
call(["/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI","--table default.updated_dataset"])  

You can download the notebook here.

https://gist.github.com/6f7d3138efdbb322d8543d35912c99ab

Categories: BI & Warehousing

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

Wed, 2016-07-13 09:02
Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

Big Data Discovery (BDD) is a great tool for exploring, transforming, and visualising data stored in your organisation’s Data Reservoir. I presented a workshop on it at a recent conference, and got an interesting question from the audience that I thought I’d explore further here. Currently the primary route for getting data into BDD requires that it be (i) in HDFS and (ii) have a Hive table defined on top of it. From there, BDD automagically ingests the Hive table, or the dataprocessingCLI is manually called which prompts the BDD DGraph engine to go and sample (or read in full) the Hive dataset.

This is great, and works well where the dataset is vast (this is Big Data, after all) and needs the sampling that DGraph provides. It’s also simple enough for Hive tables that have already been defined, perhaps by another team. But - and this was the gist of the question that I got - what about where the Hive table doesn’t exist already? Because if it doesn’t, we now need to declare all the columns as well as choose the all-important SerDe in order to read the data.

SerDes are brilliant, in that they enable the application of a schema-on-read to data in many forms, but at the very early stages of a data project there are probably going to be lots of formats of data (such as TSV, CSV, JSON, as well as log files and so on) from varying sources. Choosing the relevant SerDe for each one, and making sure that BDD is also configured with the necessary jar, as well as manually listing each column to be defined in the table, adds overhead to the project. Wouldn’t it be nice if we could side-step this step somehow? In this article we’ll see how!

Importing Datasets through BDD Studio

Before we get into more fancy options, don’t forget that BDD itself offers the facility to upload CSV, TSV, and XLSX files, as well as connect to JDBC datasources. Data imported this way will be stored by BDD in a Hive table and ingested to DGraph.

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

This is great for smaller files held locally. But what about files on your BDD cluster, that are too large to upload from local machine, or in other formats - such as JSON?

Loading a CSV file

As we’ve just seen, CSV files can be imported to Hive/BDD directly through the GUI. But perhaps you’ve got a large CSV file sat local to BDD that you want to import? Or a folder full of varying CSV files that would be too time-consuming to upload through the GUI one-by-one?

For this we can use BDD Shell with the Python Pandas library, and I’m going to do so here through the excellent Jupyter Notebooks interface. You can read more about these here and details of how to configure them on BigDataLite 4.5 here. The great thing about notebooks, whether Jupyter or Zeppelin, is that I don’t need to write any more blog text here - I can simply embed the notebook inline and it is self-documenting:

https://gist.github.com/76b477f69303dd8a9d8ee460a341c445

(gist link)

Note that at end of this we call dataprocessingCLI to automatically bring the new table into BDD’s DGraph engine for use in BDD Studio. If you’ve got BDD configured to automagically add new Hive tables, or you don’t want to run this step, you can just comment it out.

Loading simple JSON data

Whilst CSV files are tabular by definition, JSON records can contain nested objects (recursively), as well as arrays. Let’s look at an example of using SparkSQL to import a simple flat JSON file, before then considering how we handle nested and array formats. Note that SparkSQL can read datasets from both local (file://) storage as well as HDFS (hdfs://):

https://gist.github.com/8b7118c230f34f7d57bd9b0aa4e0c34c

(gist link)

Once loaded into Hive, it can be viewed in Hue:

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery Loading nested JSON data

What’s been great so far, whether loading CSV, XLS, or simple JSON, is that we’ve not had to list out column names. All that needs modifying in the scripts above to import a different file with a different set of columns is to change the filename and the target tablename. Now we’re going to look at an example of a JSON file with nested objects - which is very common in JSON - and we’re going to have to roll our sleeves up a tad and start hardcoding some schema details.

First up, we import the JSON to a SparkSQL dataframe as before (although this time I’m loading it from HDFS, but local works too):

df = sqlContext.read.json('hdfs:///user/oracle/incoming/twitter/2016/07/12/')  

Then I declare this as a temporary table, which enables me to subsequently run queries with SQL against it

df.registerTempTable("twitter")  

A very simple example of a SQL query would be to look at the record count:

result_df = sqlContext.sql("select count(*) from twitter")  
result_df.show()

+----+  
| _c0|  
+----+  
|3011|  
+----+

The result of a sqlContext.sql invocation is a dataframe, which above I’m assigning to a new variable, but I could as easily run:

sqlContext.sql("select count(*) from twitter").show()  

for the same result.

The sqlContext has inferred the JSON schema automagically, and we can inspect it using

df.printSchema()  

The twitter schema is huge, so I’m just quoting a few choice sections of it here to illustrate subsequent points:

root  
|-- created_at: string (nullable = true)  
|-- entities: struct (nullable = true)  
|    |-- hashtags: array (nullable = true)  
|    |    |-- element: struct (containsNull = true)  
|    |    |    |-- indices: array (nullable = true)  
|    |    |    |    |-- element: long (containsNull = true)  
|    |    |    |-- text: string (nullable = true)  
|    |-- user_mentions: array (nullable = true)  
|    |    |-- element: struct (containsNull = true)  
|    |    |    |-- id: long (nullable = true)  
|    |    |    |-- id_str: string (nullable = true)  
|    |    |    |-- indices: array (nullable = true)  
|    |    |    |    |-- element: long (containsNull = true)  
|    |    |    |-- name: string (nullable = true)  
|    |    |    |-- screen_name: string (nullable = true)  
|-- source: string (nullable = true)  
|-- text: string (nullable = true)  
|-- timestamp_ms: string (nullable = true)  
|-- truncated: boolean (nullable = true)  
|-- user: struct (nullable = true)  
|    |-- followers_count: long (nullable = true)  
|    |-- following: string (nullable = true)  
|    |-- friends_count: long (nullable = true)  
|    |-- name: string (nullable = true)  
|    |-- screen_name: string (nullable = true)

Points to note about the schema:

  • In the root of the schema we have attributes such as text and created_at
  • There are nested elements (“struct”) such as user and within it screen_name, followers_count etc
  • There’s also array objects, where an attribute can occur more than one, such as hashtags, and user_mentions.

Accessing root and nested attributes is easy - we just use dot notation:

sqlContext.sql("SELECT created_at, user.screen_name, text FROM twitter").show()

+--------------------+--------------+--------------------+  
|          created_at|   screen_name|                text|  
+--------------------+--------------+--------------------+  
|Tue Jul 12 16:13:...|  Snehalstocks|"Students need to...|  
|Tue Jul 12 16:13:...|   KingMarkT93|Ga caya :( https:...|

We can save this as a dataframe that’s then persisted to Hive, for ingest into BDD:

subset02 = sqlContext.sql("SELECT created_at, user.screen_name, text FROM twitter")  
tablename = 'twitter_user_text'  
qualified_tablename='default.' + tablename  
subset02.write.mode('Overwrite').saveAsTable(qualified_tablename)  

Which in Hue looks like this:

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

Attributes in an array are a bit more tricky. Here’s an example tweet with multiple user_mentions and a hashtag too:

https://twitter.com/flederbine/status/752940179569115136

Here we use the LATERAL VIEW syntax, with the optional OUTER operator since not all tweets have these additional entities, and we want to make sure we show all tweets including those that don’t have these entities. Here’s the SQL formatted for reading:

SELECT id,  
created_at,  
user.screen_name,  
text as tweet_text,  
hashtag.text as hashtag,  
user_mentions.screen_name as mentioned_user  
from twitter  
LATERAL VIEW OUTER explode(entities.user_mentions) user_mentionsTable as user_mentions  
LATERAL VIEW OUTER explode(entities.hashtags) hashtagsTable AS hashtag  

Which when run as from sqlContext.sql() gives us:

+------------------+--------------------+---------------+--------------------+-------+---------------+  
|                id|          created_at|    screen_name|          tweet_text|hashtag|    screen_name|  
+------------------+--------------------+---------------+--------------------+-------+---------------+  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|      johnnyq72|  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|       orcldoug|  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|          rmoff|  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|    markrittman|  
|752940179569115136|Tue Jul 12 18:58:...|     flederbine|@johnnyq72 @orcld...|ImALLin|     mikedurran|  
+------------------+--------------------+---------------+--------------------+-------+---------------+

and written back to Hive for ingest to BDD:

Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery

You can use these SQL queries both for simply flattening JSON, as above, or for building summary tables, such as this one showing the most common hashtags in the dataset:

sqlContext.sql("SELECT hashtag.text,count(*) as inst_count from twitter  LATERAL VIEW OUTER explode(entities.hashtags) hashtagsTable AS hashtag GROUP BY hashtag.text order by inst_count desc").show(4)

+-----------+----------+  
|       text|inst_count|  
+-----------+----------+  
|     Hadoop|       165|  
|     Oracle|       151|  
|        job|       128|  
|    BigData|       112|

You can find the full Jupyter Notebook with all these nested/array JSON examples here:

https://gist.github.com/a38e853d3a7dcb48a9df99ce1e3505ff

(gist link)

You may decide after looking at this that you’d rather just go back to Hive and SerDes, and as is frequently the case in ‘data wrangling’ there’s multiple ways to achieve the same end. The route you take comes down to personal preference and familiarity with the toolsets. In this particular case I'd still go for SparkSQL for the initial exploration as it's quicker to 'poke around' the dataset than with defining and re-defining Hive tables -- YMMV. A final point to consider before we dig in is that SparkSQL importing JSON and saving back to HDFS/Hive is a static process, and if your underlying data is changing (e.g. streaming to HDFS from Flume) then you would probably want a Hive table over the HDFS file so that it is live when queried.

Loading an Excel workbook with many sheets

This was the use-case that led me to researching programmatic import of datasets in the first place. I was doing some work with a dataset of road traffic accident data, which included a single XLS file with over 30 sheets, each a lookup table for a separate set of dimension attributes. Importing each sheet one by one through the BDD GUI was tedious, and being a lazy geek, I looked to automate it.

Using Pandas read_excel function and a smidge of Python to loop through each sheet it was easily done. You can see the full notebook here:

https://gist.github.com/rmoff/3fa5d857df8ca5895356c22e420f3b22

(gist link)

Categories: BI & Warehousing

Rittman Mead at KScope16

Tue, 2016-06-21 10:01

June is the perfect month: summer begins, major football (and futbol) tournaments are in full swing, and of course, KScope16 is on! Rittman Mead have participated in many of the past KScope conferences and will continue that tradition this year with a wide range of presentation and training topics across the Business Intelligence, Data Warehousing, Big Data, and Advanced Analytics subject areas. This year the event is held in Chicago at the Sheraton Grand Chicago, a great location right on the river and a short distance from sights such as the Navy Pier and Magnificent Mile. And, as always, there will be some great networking opportunities throughout the week.

chicago-kscope16

This year, we’re pleased to congratulate our own Becky Wagner (of Becky’s BI Apps Corner) on her graduation from the ODTUG Leadership Program. This is the third class to complete the program, which aims to build leadership skills, confidence, and work/life balance skills, amongst many other things, throughout the 9 month class duration. Well done, Becky! You’ll be able to find her and the rest of the 2016 Leadership class at the Career Community/Leadership Program Information Breakfast – Monday, June 27th 7:30AM at the Sheraton hotel main dining area. They will be kicking off the application season for the 2017 Leadership Program and providing further information about the new ODTUG career program as well.

The rest of the Rittman Mead attendees will be presenting one or more sessions during the week.

KScope16 Rittman Mead Schedule

The week kicks off with my talk Monday on GoldenGate and Kafka integration using the new Oracle GoldenGate for Big Data release 12.2. The topic of this session, real-time data streaming, provides a great solution for the challenging task of ingesting data from multiple different sources, and different structures, and making that data readily available for others in the organization to use. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution.

GoldenGate to Kafka logo

Tuesday morning, Charles Elliott, Managing Director of Rittman Mead in the US, will begin the first of his two-part hands on lab sessions on Free Form Data Visualizations. This hands-on introduction to all things visual in OBIEE’s new Visual Analyzer (VA) will allow attendees to see the top reporting features within the current release while also understanding the core differences between OBIEE’s previous reporting platform and what’s available now. In addition to creating some of the most eye-catching visualizations the new tool has to offer, this session will highlight new OBIEE 12c features such as mapping, advanced analytics, and an overview of the brand-new web catalog. Whether you are just interested in what’s new in free-form Oracle BI visualizations or preparing for an upgrade, this session is for you!

During the lunch and learn sessions you can find me moderating the BI and Data Warehousing panel and Mark Rittman will be moderating the Big Data and Advanced Analytics crew. Grab a lunch and join us!

The remainder of Rittman Mead sessions fall on Wednesday, with some interesting topics around Big Data Discovery, OBIEE branding and Essbase integration, and the Kimball ETL Subsystems. Also, the second day of the Free Form Data Visualizations HOL will be held on Wednesday afternoon.

It all starts with Mark Rittman presenting “OBIEE 12c: What’s New for Integration and Reporting Against EPM Sources”. He’ll dig into interesting new features around OBIEE 12c and Essbase integration, including improvements around using Essbase as a query acceleration layer and general improvements around usability and RPD/Outline integration. Come to this session to see what they are, how they affect OBIEE/Essbase integration, and where this initiative may be going over time.

Following Mark and his dive into OBIEE and Essbase on Wednesday will be my session, “A Walk Through the Kimball ETL Subsystems with Oracle Data Integration”, with a focus on how to implement the ETL Subsystems using Oracle Data Integration solutions. As you may know, Ralph Kimball, dimensional modeling and data warehousing expert and founder of The Kimball Group, spent much of his career working to build an enterprise data warehouse methodology that can meet analytical reporting needs. His book, “The Data Warehouse ETL Toolkit,” is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories, describing how the Oracle Data Integration products (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality) are perfectly suited for the Kimball approach.

After lunch, Andy Rocha and Pete Tamisin will dive into “How to Brand and Own Your OBIEE Interface: Past, Present, and Future”. In this presentation, they will explore the different techniques in branding and customizing your OBIEE environment in different versions: 11.1.1.7, 11.1.1.9, and 12c. What that really means is they will talk about how to brand the interface in a thoughtful way… not just throwing the company’s colors on the screen, but thinking about what will actually make the application more attractive and useful for the user community.

bdd-rittman

Closing out the Rittman Mead Wednesday sessions is Mark Rittman with his unique look at machine learning and advanced visualizations using Oracle Big Data Discovery (BDD). Mark will use the BDD Shell, pySpark, and Jupyter to ingest, process and run machine learning routines on IoT and smart home data, which he has been collecting over the past year. Oracle Big Data Discovery provides a graphical environment for cataloging, searching, and displaying data from Hadoop and other sources along with some basic data wrangling features and display options. In this session, he’ll look at how BDD can be extended using the aforementioned APIs to cover more advanced predictive modeling, machine learning, and data science tasks, and how the graphics API can be used to add custom D3 advanced visualizations for your datasets and analyses. I’m definitely looking forward to this one!

We’re all looking forward to attending the event in Chicago this year and can’t wait for next week! If you’d like to get together to discuss any of the above topics we’re presenting, or anything data integration or BI related, feel free to reach out via email (info@rittmanmead.com), twitter (@mRainey) or drop a note in the comments below. Hope to see you there!

The post Rittman Mead at KScope16 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Rittman Mead at KScope16

Tue, 2016-06-21 09:01
Rittman Mead at KScope16

June is the perfect month: summer begins, major football (and futbol) tournaments are in full swing, and of course, KScope16 is on! Rittman Mead have participated in many of the past KScope conferences and will continue that tradition this year with a wide range of presentation and training topics across the Business Intelligence, Data Warehousing, Big Data, and Advanced Analytics subject areas. This year the event is held in Chicago at the Sheraton Grand Chicago, a great location right on the river and a short distance from sights such as the Navy Pier and Magnificent Mile. And, as always, there will be some great networking opportunities throughout the week.

Rittman Mead at KScope16

This year, we're pleased to congratulate our own Becky Wagner (of Becky's BI Apps Corner) on her graduation from the ODTUG Leadership Program. This is the third class to complete the program, which aims to build leadership skills, confidence, and work/life balance skills, amongst many other things, throughout the 9 month class duration. Well done, Becky! You'll be able to find her and the rest of the 2016 Leadership class at the Career Community/Leadership Program Information Breakfast - Monday, June 27th 7:30AM at the Sheraton hotel main dining area. They will be kicking off the application season for the 2017 Leadership Program and providing further information about the new ODTUG career program as well.

The rest of the Rittman Mead attendees will be presenting one or more sessions during the week.

Rittman Mead at KScope16

The week kicks off with my talk Monday on GoldenGate and Kafka integration using the new Oracle GoldenGate for Big Data release 12.2. The topic of this session, real-time data streaming, provides a great solution for the challenging task of ingesting data from multiple different sources, and different structures, and making that data readily available for others in the organization to use. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution.

Rittman Mead at KScope16

Tuesday morning, Charles Elliott, Managing Director of Rittman Mead in the US, will begin the first of his two-part hands on lab sessions on Free Form Data Visualizations. This hands-on introduction to all things visual in OBIEE’s new Visual Analyzer (VA) will allow attendees to see the top reporting features within the current release while also understanding the core differences between OBIEE’s previous reporting platform and what’s available now. In addition to creating some of the most eye-catching visualizations the new tool has to offer, this session will highlight new OBIEE 12c features such as mapping, advanced analytics, and an overview of the brand-new web catalog. Whether you are just interested in what’s new in free-form Oracle BI visualizations or preparing for an upgrade, this session is for you!

During the lunch and learn sessions you can find me moderating the BI and Data Warehousing panel and Mark Rittman will be moderating the Big Data and Advanced Analytics crew. Grab a lunch and join us!

The remainder of Rittman Mead sessions fall on Wednesday, with some interesting topics around Big Data Discovery, OBIEE branding and Essbase integration, and the Kimball ETL Subsystems. Also, the second day of the Free Form Data Visualizations HOL will be held on Wednesday afternoon.

Rittman Mead at KScope16

It all starts with Mark Rittman presenting "OBIEE 12c: What's New for Integration and Reporting Against EPM Sources". He'll dig into interesting new features around OBIEE 12c and Essbase integration, including improvements around using Essbase as a query acceleration layer and general improvements around usability and RPD/Outline integration. Come to this session to see what they are, how they affect OBIEE/Essbase integration, and where this initiative may be going over time.

Following Mark and his dive into OBIEE and Essbase on Wednesday will be my session, “A Walk Through the Kimball ETL Subsystems with Oracle Data Integration", with a focus on how to implement the ETL Subsystems using Oracle Data Integration solutions. As you may know, Ralph Kimball, dimensional modeling and data warehousing expert and founder of The Kimball Group, spent much of his career working to build an enterprise data warehouse methodology that can meet analytical reporting needs. His book, “The Data Warehouse ETL Toolkit,” is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories, describing how the Oracle Data Integration products (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality) are perfectly suited for the Kimball approach.

After lunch, Andy Rocha and Pete Tamisin will dive into "How to Brand and Own Your OBIEE Interface: Past, Present, and Future". In this presentation, they will explore the different techniques in branding and customizing your OBIEE environment in different versions: 11.1.1.7, 11.1.1.9, and 12c. What that really means is they will talk about how to brand the interface in a thoughtful way… not just throwing the company’s colors on the screen, but thinking about what will actually make the application more attractive and useful for the user community.

Rittman Mead at KScope16

Closing out the Rittman Mead Wednesday sessions is Mark Rittman with his unique look at machine learning and advanced visualizations using Oracle Big Data Discovery (BDD). Mark will use the BDD Shell, pySpark, and Jupyter to ingest, process and run machine learning routines on IoT and smart home data, which he has been collecting over the past year. Oracle Big Data Discovery provides a graphical environment for cataloging, searching, and displaying data from Hadoop and other sources along with some basic data wrangling features and display options. In this session, he'll look at how BDD can be extended using the aforementioned APIs to cover more advanced predictive modeling, machine learning, and data science tasks, and how the graphics API can be used to add custom D3 advanced visualizations for your datasets and analyses. I'm definitely looking forward to this one!

We're all looking forward to attending the event in Chicago this year and can't wait for next week! If you'd like to get together to discuss any of the above topics we're presenting, or anything data integration or BI related, feel free to reach out via email (info@rittmanmead.com), twitter (@mRainey) or drop a note in the comments below. Hope to see you there!

Categories: BI & Warehousing

Using Jupyter Notebooks with Big Data Discovery 1.2

Wed, 2016-06-15 10:00

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

Whilst BDD Shell is command-line based, there’s also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive “Notebook”. This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a ‘Notebook’ enables you to modify and re-run commands, and then once correct retain them as a fully functioning script for future use.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you’d find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2.

For information how on to set up BDD Shell and Jupyter Notebooks, see this previous post. For the purpose of this article I’m running Jupyter on port 18888 so as not to clash with Hue:

cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It’s important that you run this from the bdd-shell folder, otherwise the BDD shell won’t initialise properly
  • Jupyter by default only listens locally, so you need to use a web browser local to the server, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and from the New menu select a Python 2 notebook:

You should then see an empty notebook, ready for use:

The ‘cell’ (grey box after the In [ ]:) is where you enter code to run – type in execfile('ipython/00-bdd-shell-init.py') and press shift-Enter. This will execute it – if you don’t press shift you just get a newline. Whilst it’s executing you’ll notice the line prefix changes from [ ] to [*], and in the terminal window from which you launched Jupyter you’ll see some output related to the BDD Shell starting

WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).
WARNING: Running spark-class from user-defined location.
spark.driver.cores is set but does not apply in client mode.

Now back in the Notebook, enter the following – use Enter, not Shift-enter, between lines:

dss = bc.datasets()
dss.count

Now press shift-enter to execute it. This uses the pre-defined bc BDD context to get the datasets object, and return a count from it.

BDD_Shell_example_04

By clicking the + button on the toolbar, using the up and down arrows on the toolbar, and the Code/Markdown dropdown, it’s possible to insert “cells” which are not code but instead commentary on what the code is. This way you can produce fully documented, but executable, code objects.

From the File menu give the notebook a name, and then Close and Halt, which destroys the Jupyter process (‘kernel’) that was executing the BDD Shell session. Back at the Jupyter main page, you’ll note that a ipynb file has been created, which holds the notebook definition and can be downloaded, sent to colleagues, uploaded to blogs to share, saved in source control, and so on. Here’s the file for the notebook above – note that it’s hosted on gist, which automagically previews it as a Notebook, so click on Raw to see the actual code behind it.

The fantastically powerful thing about the Notebooks is that you can modify and re-run steps as you go — but you never lose the history of how you got somewhere. Most people will be familar with learning or exploring a tool and its capabilities and eventually getting it to work – but no idea how they got there. Even for experienced users of a tool, being able to prove how to replicate a final result is important for (a) showing the evidence for how they got there and (b) enabling others to take that work and build on it.

With an existing notebook file, whether a saved one you created or one that someone sent you, you can reopen it in Jupyter and re-execute it, in order to replicate the results previously seen. This is an important tenet of [data] science in general – show your workings, and it’s great that Big Data Discovery supports this option. Obviously, showing the count of datasets is not so interesting or important to replicate. The real point here is being able to take datasets that you’ve got in BDD, done some joining and wrangling on already taking advantage of the GUI, and then dive deep into the data science and analytics world of things like Spark MLLib, Pandas, and so on. As a simple example, I can use a couple of python libraries (installed by default with Anaconda) to plot a correlation matrix for one of my BDD datasets:

jupyter_corr_matrix

As well as producing visualisations or calculations within BDD shell, the real power comes in being able to push the modified data back into Hive, and thus continue to work with it within BDD.

With Jupyter Notebooks not only can you share the raw notebooks for someone else to execute, you can export the results to HTML, PDF, and so on. Here’s the notebook I started above, developed out further and exported to HTML – note how you can see not only the results, but exactly the code that I ran in order to get them. In this I took the dataset from BDD, added a column into it using a pandas windowing function, and then saved it back to a new Hive table:
(you can view the page natively here, and the ipynb here)

.gist table { margin-bottom: 0; }

Once the data’s been written back to Hive from the Python processing, I ran BDD’s data_processing_CLI to add the new table back into BDD

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table updated_accident_data

And once that’s run, I can then continue working with the data in BDD:

This workflow enables a continual loop of data wrangling, enrichment, advanced processing, and visualisation – all using the most appropriate tools for the job.

You can also use BDD Shell/Jupyter as another route for loading data into BDD. Whilst you can import CSV and XLS files into BDD directly through the web GUI, there are limitations – such as an XLS workbook with multiple sheets has to be imported one sheet at a time. I had a XLS file with over 40 sheets of reference data in it, which was not going to be time-efficient to load one at a time into BDD.

Pandas supports a lot of different input types – including Excel files. So by using Pandas to pull the data in, then convert it to a Spark dataframe I can write it to Hive, from where it can be imported to BDD. As before, the beauty of the Notebook approach is that I could develop and refine the code, and then simply share the Notebook here

.gist table { margin-bottom: 0; }

The post Using Jupyter Notebooks with Big Data Discovery 1.2 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Using Jupyter Notebooks with Big Data Discovery 1.2

Wed, 2016-06-15 09:00
Using Jupyter Notebooks with Big Data Discovery 1.2

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

Whilst BDD Shell is command-line based, there's also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive "Notebook". This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a 'Notebook' enables you to modify and re-run commands, and then once correct retain them as a fully functioning script for future use.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you'd find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2.

For information how on to set up BDD Shell and Jupyter Notebooks, see this previous post. For the purpose of this article I'm running Jupyter on port 18888 so as not to clash with Hue:

  
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell  
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It's important that you run this from the bdd-shell folder, otherwise the BDD shell won't initialise properly
  • Jupyter by default only listens locally, so you need to use a web browser local to the server, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and from the New menu select a Python 2 notebook:

Using Jupyter Notebooks with Big Data Discovery 1.2

You should then see an empty notebook, ready for use:

Using Jupyter Notebooks with Big Data Discovery 1.2

The 'cell' (grey box after the In [ ]:) is where you enter code to run - type in execfile('ipython/00-bdd-shell-init.py') and press shift-Enter. This will execute it - if you don't press shift you just get a newline. Whilst it's executing you'll notice the line prefix changes from [ ] to [*], and in the terminal window from which you launched Jupyter you'll see some output related to the BDD Shell starting

  
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).  
WARNING: Running spark-class from user-defined location.  
spark.driver.cores is set but does not apply in client mode.  

Now back in the Notebook, enter the following - use Enter, not Shift-enter, between lines:

  
dss = bc.datasets()  
dss.count  

Using Jupyter Notebooks with Big Data Discovery 1.2

Now press shift-enter to execute it. This uses the pre-defined bc BDD context to get the datasets object, and return a count from it.

Using Jupyter Notebooks with Big Data Discovery 1.2

By clicking the + button on the toolbar, using the up and down arrows on the toolbar, and the Code/Markdown dropdown, it's possible to insert "cells" which are not code but instead commentary on what the code is. This way you can produce fully documented, but executable, code objects.

Using Jupyter Notebooks with Big Data Discovery 1.2

From the File menu give the notebook a name, and then Close and Halt, which destroys the Jupyter process ('kernel') that was executing the BDD Shell session. Back at the Jupyter main page, you'll note that a ipynb file has been created, which holds the notebook definition and can be downloaded, sent to colleagues, uploaded to blogs to share, saved in source control, and so on. Here's the file for the notebook above - note that it's hosted on gist, which automagically previews it as a Notebook, so click on Raw to see the actual code behind it.

The fantastically powerful thing about the Notebooks is that you can modify and re-run steps as you go -- but you never lose the history of how you got somewhere. Most people will be familar with learning or exploring a tool and its capabilities and eventually getting it to work - but no idea how they got there. Even for experienced users of a tool, being able to prove how to replicate a final result is important for (a) showing the evidence for how they got there and (b) enabling others to take that work and build on it.

With an existing notebook file, whether a saved one you created or one that someone sent you, you can reopen it in Jupyter and re-execute it, in order to replicate the results previously seen. This is an important tenet of [data] science in general - show your workings, and it's great that Big Data Discovery supports this option. Obviously, showing the count of datasets is not so interesting or important to replicate. The real point here is being able to take datasets that you've got in BDD, done some joining and wrangling on already taking advantage of the GUI, and then dive deep into the data science and analytics world of things like Spark MLLib, Pandas, and so on. As a simple example, I can use a couple of python libraries (installed by default with Anaconda) to plot a correlation matrix for one of my BDD datasets:

Using Jupyter Notebooks with Big Data Discovery 1.2

As well as producing visualisations or calculations within BDD shell, the real power comes in being able to push the modified data back into Hive, and thus continue to work with it within BDD.

With Jupyter Notebooks not only can you share the raw notebooks for someone else to execute, you can export the results to HTML, PDF, and so on. Here's the notebook I started above, developed out further and exported to HTML - note how you can see not only the results, but exactly the code that I ran in order to get them. In this I took the dataset from BDD, added a column into it using a pandas windowing function, and then saved it back to a new Hive table:
(you can view the page natively here, and the ipynb here)

https://gist.github.com/rmoff/f1024dd043565cb8a58a0c54e9a782f2

Once the data's been written back to Hive from the Python processing, I ran BDD's dataprocessingCLI to add the new table back into BDD

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table updated_accident_data

And once that's run, I can then continue working with the data in BDD:

Using Jupyter Notebooks with Big Data Discovery 1.2

Using Jupyter Notebooks with Big Data Discovery 1.2

This workflow enables a continual loop of data wrangling, enrichment, advanced processing, and visualisation - all using the most appropriate tools for the job.

You can also use BDD Shell/Jupyter as another route for loading data into BDD. Whilst you can import CSV and XLS files into BDD directly through the web GUI, there are limitations - such as an XLS workbook with multiple sheets has to be imported one sheet at a time. I had a XLS file with over 40 sheets of reference data in it, which was not going to be time-efficient to load one at a time into BDD.

Pandas supports a lot of different input types - including Excel files. So by using Pandas to pull the data in, then convert it to a Spark dataframe I can write it to Hive, from where it can be imported to BDD. As before, the beauty of the Notebook approach is that I could develop and refine the code, and then simply share the Notebook here

https://gist.github.com/rmoff/3fa5d857df8ca5895356c22e420f3b22

Categories: BI & Warehousing

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

Mon, 2016-06-13 09:19

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets, particularly when used in conjunction with Jupyter Notebooks. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you’d find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2. In this article we’ll see how to configure BDD Shell on Big Data Lite 4.5 (along with Jupyter Notebooks), and in a subsequent post dive into how to actually use them.

Setting up BDD Shell on Big Data Lite

You can find the BDD Shell installation document here.

Login to BigDataLite 4.5 (oracle/welcome1) and open a Terminal window. The first step is to download Anaconda, which is a distribution of Python that also includes “[…] over 100 of the most popular Python, R and Scala packages for data science” as well as Jupyter notebook, which we’ll see in a moment.

cd ~/Downloads/
wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh

Then install it: (n.b. bash is part of the command to enter)

bash Anaconda2-4.0.0-Linux-x86_64.sh

Accept the licence when prompted, and then select a install location – I used /u01/anaconda2 where the rest of the BigDataLite installs are

Anaconda2 will now be installed into this location:
/home/oracle/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/oracle/anaconda2] >>> /u01/anaconda2

After a few minutes of installation, you’ll be prompted to whether you want to prepend Anaconda’s location to the PATH environment variable. I opted not to (which is the default) since Python is used elsewhere on the system and by prepending it it’ll take priority and possibly break things.

Do you wish the installer to prepend the Anaconda2 install location
to PATH in your /home/oracle/.bashrc ? [yes|no]
[no] >>> no

Now edit the BDD Shell configuration file (/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.conf) in your favourite text editor to add/amend the following lines:

SPARK_EXECUTOR_PYTHON=/u01/anaconda2/bin/python
LOCAL_PYTHON_HOME=/u01/anaconda2

Amend the path if you didn’t install Anaconda into /u01

In the same configuration file, add/amend:

SPARK_HOME=/usr/lib/spark/
SPARK_EXTRA_CLASSPATH=/usr/lib/oozie/oozie-sharelib-yarn/lib/spark/spark-avro_2.10-1.1.0-cdh5.7.0.jar

Now run the BDD Shell setup:

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/setup.sh

This should succeed:

[bigdatalite.localdomain] Validating pre-requisites...
[bigdatalite.localdomain] Validation Success
[bigdatalite.localdomain] Setting up BDD Shell...
[bigdatalite.localdomain] Setup Success
[oracle@bigdatalite Downloads]$

Assuming it does, you can now launch the shell bdd-shell.sh:

[oracle@bigdatalite Downloads]$ /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.sh
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).
WARNING: Running spark-class from user-defined location.
spark.driver.cores is set but does not apply in client mode.
Welcome to
     ___   ___   ___       __   _     ____  _     _
    | |_) | | \ | | \     ( (` | |_| | |_  | |   | |
    |_|_) |_|_/ |_|_/     _)_) |_| | |_|__ |_|__ |_|__

SparkContext available as sc, HiveContext available as sqlContext.
BDD Context available as bc.

>>>

From the BDD Shell you can interact with BDD, for example to list out the datasets currently defined in the Catalog:

>>> bc.datasets().count
17
>>> for ds in bc.datasets():
...     print ds
...

media_demo_customer     edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        Hive    default.media_demo_customer

movie_genre     default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        Hive    default.movie_genre

media_demo_customer     default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        Hive    default.media_demo_customer

Whilst BDD Shell is command-line based, there’s also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive “Notebook”. This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a ‘Notebook’ enables you to modify and re-run commands, and then once correct, retain them as a fully functioning script for future use.

To launch it, run:

cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It’s important that you run this from the bdd-shell folder, otherwise the BDD shell won’t initialise properly
  • By default Jupyter uses 8888, which is already in use on BigDataLite by Hue, so use a different one by specifying --port
  • Jupyter by default only listens locally, so you need to either be using BigDataLite desktop to run Firefox, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and you should see the default Jupyter screen with a list of files:

In the next article, we’ll see how to use Jupyter Notebooks with Big Data Discovery, and get an idea of just how powerful the combination can be.

The post Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5 appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

Mon, 2016-06-13 08:19
Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets, particularly when used in conjunction with Jupyter Notebooks. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you'd find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2. In this article we'll see how to configure BDD Shell on Big Data Lite 4.5 (along with Jupyter Notebooks), and in a subsequent post dive into how to actually use them.

Setting up BDD Shell on Big Data Lite

You can find the BDD Shell installation document here.

Login to BigDataLite 4.5 (oracle/welcome1) and open a Terminal window. The first step is to download Anaconda, which is a distribution of Python that also includes "[...] over 100 of the most popular Python, R and Scala packages for data science" as well as Jupyter notebook, which we'll see in a moment.

cd ~/Downloads/  
wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
Then install it: (n.b. bash is part of the command to enter)
  
bash Anaconda2-4.0.0-Linux-x86_64.sh  

Accept the licence when prompted, and then select a install location - I used /u01/anaconda2 where the rest of the BigDataLite installs are

Anaconda2 will now be installed into this location:  
/home/oracle/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/oracle/anaconda2] >>> /u01/anaconda2

After a few minutes of installation, you'll be prompted to whether you want to prepend Anaconda's location to the PATH environment variable. I opted not to (which is the default) since Python is used elsewhere on the system and by prepending it it'll take priority and possibly break things.

Do you wish the installer to prepend the Anaconda2 install location  
to PATH in your /home/oracle/.bashrc ? [yes|no]  
[no] >>> no

Now edit the BDD Shell configuration file (/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.conf) in your favourite text editor to add/amend the following lines:

  
SPARK\_EXECUTOR\_PYTHON=/u01/anaconda2/bin/python  
LOCAL\_PYTHON\_HOME=/u01/anaconda2  

Amend the path if you didn't install Anaconda into /u01

In the same configuration file, add/amend:

  
SPARK_HOME=/usr/lib/spark/  
SPARK\_EXTRA\_CLASSPATH=/usr/lib/oozie/oozie-sharelib-yarn/lib/spark/spark-avro_2.10-1.1.0-cdh5.7.0.jar  

Now run the BDD Shell setup:

  
/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/setup.sh

This should succeed:

  
[bigdatalite.localdomain] Validating pre-requisites...
[bigdatalite.localdomain] Validation Success
[bigdatalite.localdomain] Setting up BDD Shell...
[bigdatalite.localdomain] Setup Success
[oracle@bigdatalite Downloads]$

Assuming it does, you can now launch the shell bdd-shell.sh:

  
[oracle@bigdatalite Downloads]$ /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.sh
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).  
WARNING: Running spark-class from user-defined location.  
spark.driver.cores is set but does not apply in client mode.  
Welcome to  
     ___   ___   ___       __   _     ____  _     _
    | |_) | | \ | | \     ( (` | |_| | |_  | |   | |
    |_|_) |_|_/ |_|_/     _)_) |_| | |_|__ |_|__ |_|__

SparkContext available as sc, HiveContext available as sqlContext.  
BDD Context available as bc.

>>>

From the BDD Shell you can interact with BDD, for example to list out the datasets currently defined in the Catalog:

>>> bc.datasets().count  
17  
>>> for ds in bc.datasets():
...     print ds
...

media\_demo\_customer     edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        Hive    default.media_demo_customer

movie_genre     default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        Hive    default.movie_genre

media\_demo\_customer     default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        Hive    default.media_demo_customer  

Whilst BDD Shell is command-line based, there's also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive "Notebook". This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a 'Notebook' enables you to modify and re-run commands, and then once correct, retain them as a fully functioning script for future use.

To launch it, run:

  
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell  
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It's important that you run this from the bdd-shell folder, otherwise the BDD shell won't initialise properly
  • By default Jupyter uses 8888, which is already in use on BigDataLite by Hue, so use a different one by specifying --port
  • Jupyter by default only listens locally, so you need to either be using BigDataLite desktop to run Firefox, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and you should see the default Jupyter screen with a list of files:

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

In the next article, we'll see how to use Jupyter Notebooks with Big Data Discovery, and get an idea of just how powerful the combination can be.

Categories: BI & Warehousing

OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service

Wed, 2016-06-01 18:17

One of the big changes in OBIEE 12c for end users is the ability to upload their own data sets and start analysing them directly, without needing to go through the traditional data provisioning and modelling process and associated leadtimes. The implementation of this is one of the big architectural changes of OBIEE 12c, introducing the concept of the Extended Subject Areas (XSA), and the Data Set Service (DSS).

In this article we’ll see some of how XSA and DSS work behind the scenes, providing an important insight for troubleshooting and performance analysis of this functionality.

What is an XSA?

An Extended Subject Area (XSA) is made up of a dataset, and associated XML data model. It can be used standalone, or “mashed up” in conjunction with a “traditional” subject area on a common field

How is an XSA Created?

At the moment the following methods are available:

  1. “Add XSA” in Visual Analzyer, to upload an Excel (XLSX) document

  2. CREATE DATASET logical SQL statement, that can be run through any interface to the BI Server, including ‘Issue Raw SQL’, nqcmd, JDBC calls, and so on

  3. Add Data Source in Answers. Whilst this option shouldn’t actually be present according to a this doc, it will be for any users of 12.2.1 who have uploaded the SampleAppLite BAR file so I’m including it here for completeness.

Under the covers, these all use the same REST API calls directly into datasetsvc. Note that these are entirely undocumented, and only for internal OBIEE component use. They are not intended nor supported for direct use.

How does an XSA work?

External Subject Areas (XSA) are managed by the Data Set Service (DSS). This is a java deployment (datasetsvc) running in the Managed Server (bi_server1), providing a RESTful API for the other OBIEE components that use it.

The end-user of the data, whether it’s Visual Analyzer or the BI Server, send REST web service calls to DSS, storing and querying datasets within it.

Where is the XSA Stored?

By default, the data for XSA is stored on disk in SINGLETON_DATA_DIRECTORY/components/DSS/storage/ssi, e.g. /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi

[oracle@demo ssi]$ ls -lrt /app/oracle/biee/user_projects/domains/bi/bidata/components/DSS/storage/ssi|tail -n5
-rw-r----- 1 oracle oinstall    8495 2015-12-02 18:01 7e43a80f-dcf6-4b31-b898-68616a68e7c4.dss
-rw-r----- 1 oracle oinstall  593662 2016-05-27 11:00 1beb5e40-a794-4aa9-8c1d-5a1c59888cb4.dss
-rw-r----- 1 oracle oinstall  131262 2016-05-27 11:12 53f59d34-2037-40f0-af21-45ac611f01d3.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:04 a4fc922d-ce0e-479f-97e4-1ddba074f5ac.dss
-rw-r----- 1 oracle oinstall 1014459 2016-05-27 13:06 c93aa2bd-857c-4651-bba2-a4f239115189.dss

They’re stored using the format in which they were created, which is XLSX (via VA) or CSV (via CREATE DATASET)

[oracle@demo ssi]$ head 53f59d34-2037-40f0-af21-45ac611f01d3.dss
"7 Megapixel Digital Camera","2010 Week 27",44761.88
"MicroPod 60Gb","2010 Week 27",36460.0
"MP3 Speakers System","2010 Week 27",36988.86
"MPEG4 Camcorder","2010 Week 28",32409.78
"CompCell RX3","2010 Week 28",33005.91

There’s a set of DSS-related tables installed in the RCU schema BIPLATFORM, which hold information including the XML data model for the XSA, along with metadata such as the user that uploaded the file, when they uploaded, and then name of the file on disk:

How Can the Data Set Service be Configured?

The configuration file, with plenty of inline comments, is at ORACLE_HOME/bi/endpointmanager/jeemap/dss/DSS_REST_SERVICE.properties. From here you can update settings for the data set service including upload limits as detailed here.

XSA Performance

Since XSA are based on flat files stored in disk, we need to be very careful in their use. Whilst a database may hold billions of rows in a table with with appropriate indexing and partitioning be able to provide sub-second responses, a flat file can quickly become a serious performance bottleneck. Bear in mind that a flat file is just a bunch of data plopped on disk – there is no concept of indices, blocks, partitions — all the good stuff that makes databases able to do responsive ad-hoc querying on selections of data.

If you’ve got a 100MB Excel file with thousands of cells, and want to report on just a few of them, you might find it laggy – because whether you want to report on them on or not, at some point OBIEE is going to have to read all of them regardless. We can see how OBIEE is handling XSA under the covers by examining the query log. This used to be called nqquery.log in OBIEE 11g (and before), and in OBIEE 12c has been renamed obis1-query.log.

In this example here I’m using an Excel worksheet with 140,000 rows and 78 columns. Total filesize of the source XLSX on disk is ~55Mb.

First up, I’ll build a query in Answers with a couple of the columns:

The logical query uses the new XSA syntax:

SELECT
   0 s_0,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."first_name" s_1,
   XSA('prodney'.'MOCK_DATA_bigger_55Mb')."Columns"."foo" s_2
FROM XSA('prodney'.'MOCK_DATA_bigger_55Mb')
ORDER BY 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

The query log shows

Rows 144000, bytes 13824000 retrieved from database query
Rows returned to Client 200

So of the 55MB of data, we’re pulling all the rows (144,000) back to the BI Server for it to then perform the aggregation on it, resulting in the 200 rows returned to the client (Presentation Services). Note though that the byte count is lower (13Mb) than the total size of the file (55Mb).

As well as aggregation, filtering on XSA data also gets done by the BI Server. Consider this example here, where we add a predicate:

xsa14

In the query log we can see that all the data has to come back from DSS to the BI Server, in order for it to filter it:

Rows 144000, bytes 23040000 retrieved from database
Physical query response time 24.195 (seconds),
Rows returned to Client 0

Note the time taken by DSS — nearly 25 seconds. Compare this later on to when we see the XSA data served from a database, via the XSA Cache.

In terms of BI Server (not XSA) caching, the query log shows that a cache entry was written for the above request:

Query Result Cache: [59124] The query for user 'prodney' was inserted into the query result cache. The filename is '/app/oracle/biee/user_projects/domains/bi/servers/obis1/cache/NQS__736113_56359_0.TBL'

If I refresh the query in Answers, the data is fetched anew (per this changed behaviour in OBIEE 12c), and the cache repopulated. If I clear the Presentation Services cache and re-open the analysis, I get the results from the BI Server cache, and it doesn’t have to refetch the data from the Data Set Service.

Since the cache has two columns in, an attribute and a measure, I wondered if running a query with just the fact rolled up might hit the cache (since it has all the data there that it needs)

Unfortunately it didn’t, and to return a single row of data required BI Server to fetch all the rows again – although looking at the byte count it appears it does prune the columns required since it’s now just over 2Mb of data returned this time:

Rows 144000, bytes 2304000 retrieved from database
Rows returned to Client 1

Interestingly if I build an analysis with several more of the columns from the file (in this example, ten of a total of 78), the data returned from the DSS to BI Server (167Mb) is greater than that of the original file (55Mb).

Rows 144000, bytes 175104000
Rows returned to Client 1000

And this data coming back from the DSS to the BI Server has to go somewhere – and if it’s big enough it’ll overflow to disk, as we can see when I run the above:

$ ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp
[...]
-rwxrwx--- 1 oracle oinstall 2910404 2016-06-01 14:08 nQS_AG_22345_7503_7c9c000a_50906091.TMP
-rwxrwx--- 1 oracle oinstall   43476 2016-06-01 14:08 nQS_AG_22345_7504_7c9c000a_50906091.TMP
-rw------- 1 oracle oinstall 6912000 2016-06-01 14:08 nQS_AG_22345_7508_7c9c000a_50921949.TMP
-rw------- 1 oracle oinstall  631375 2016-06-01 14:08 nQS_EX_22345_7506_7c9c000a_50921652.TMP
-rw------- 1 oracle oinstall 3670016 2016-06-01 14:08 nQS_EX_22345_7507_7c9c000a_50921673.TMP
[...]

You can read more about BI Server’s use of temporary files and the impact that it can have on system performance and particularly I/O bandwidth in this OTN article here.

So – as the expression goes – “buyer beware”. XSA is an excellent feature, but used in its default configuration with files stored on disk it has the potential to wreak havoc if abused.

XSA Caching

If you’re planning to use XSA seriously, you should set up the database-based XSA Cache. This is described in detail in the PDF document attached to My Oracle Support note OBIEE 12c: How To Configure The External Subject Area (XSA) Cache For Data Blending| Mashup And Performance (Doc ID 2087801.1).

In a proper implementation you would follow in full the document, including provisioning a dedicated schema and tablespace for holding the data (to make it easier to manage and segregate from other data), but here I’m just going to use the existing RCU schema (BIPLATFORM), along with the Physical mapping already in the RPD (10 - System DB (ORCL)):

In NQSConfig.INI, under the XSA_CACHE section, I set:

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "10 - System DB (ORCL)"."Catalog"."dbo";
CONNECTION_POOL = "10 - System DB (ORCL)"."UT Connection Pool";

And restart the BI Server:

/app/oracle/biee/user_projects/domains/bi/bitools/bin/stop.sh -i obis1 && /app/oracle/biee/user_projects/domains/bi/bitools/bin/start.sh -i obis1

Per the document, note that in the BI Server log there’s an entry indicating that the cache has been successfully started:

[101001] External Subject Area cache is started successfully using configuration from the repository with the logical name ssi.
[101017] External Subject Area cache has been initialized. Total number of entries: 0 Used space: 0 bytes Maximum space: 107374182400 bytes Remaining space: 107374182400 bytes. Cache table name prefix is XC2875559987.

Now when I re-run the test XSA analysis from above, returning three columns, the BI Server goes off and populates the XSA cache table:

-- Sending query to database named 10 - System DB (ORCL) (id: <<79879>> XSACache Create table Gateway), connection pool named UT Connection Pool, logical request hash b4de812e, physical request hash 5847f2ef:
CREATE TABLE dbo.XC2875559987_ZPRODNE1926129021 ( id3209243024 DOUBLE PRECISION, first_n[..]

Or rather, it doesn’t, because PHYSICAL_SCHEMA seems to want the literal physical schema, rather than the logical physical one (?!) that the USAGE_TRACKING configuration stanza is happy with in referencing the table.

Properties: description=<<79879>> XSACache Create table Exchange; producerID=0x1561aff8; requestID=0xfffe0034; sessionID=0xfffe0000; userName=prodney;
[nQSError: 17001] Oracle Error code: 1918, message: ORA-01918: user 'DBO' does not exist

I’m trying to piggyback on SA511’s existing configruation, which uses catalog.schema notation:

Instead of the more conventional approach to have the actual physical schema (often used in conjunction with ‘Require fully qualified table names’ in the connection pool):

So now I’ll do it properly, and create a database and schema for the XSA cache – I’m still going to use the BIPLATFORM schema though…

Updated NQSConfig.INI:

[ XSA_CACHE ]

ENABLE = YES;

# The schema and connection pool where the XSA data will be cached.
PHYSICAL_SCHEMA = "XSA Cache"."BIEE_BIPLATFORM";
CONNECTION_POOL = "XSA Cache"."XSA CP";

After refreshing the analysis again, there’s a successful creation of the XSA cache table:

-- Sending query to database named XSA Cache (id: <<65685>> XSACache Create table Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash ccc0a410: [[
CREATE TABLE BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 ( id3209243024 DOUBLE PRECISION, first_name2360035083 VARCHAR2(17 CHAR), [...]

as well as a stats gather:

-- Sending query to database named XSA Cache (id: <<65685>> XSACache Collect statistics Gateway), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d73151bb:
BEGIN DBMS_STATS.GATHER_TABLE_STATS(ownname => 'BIEE_BIPLATFORM', tabname => 'XC2875559987_ZPRODNE1645894381' , estimate_percent => 5 , method_opt => 'FOR ALL COLUMNS SIZE AUTO' ); END;

Although I do note that it is used a fixed estimate_percent instead of the recommended AUTO_SAMPLE_SIZE. The table itself is created with a fixed prefix (as specified in the obis1-diagnostic.log at initialisation), and holds a full copy of the XSA (not just the columns in the query that triggered the cache creation):

With the dataset cached, the query is then run and the query log shows a XSA cache hit

External Subject Area cache hit for 'prodney'.'MOCK_DATA_bigger_55Mb'/Columns :
Cache entry shared_cache_key = 'prodney'.'MOCK_DATA_bigger_55Mb',
table name = BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357,
row count = 144000,
entry size = 201326592 bytes,
creation time = 2016-06-01 20:14:26.829,
creation elapsed time = 49779 ms,
descriptor ID = /app/oracle/biee/user_projects/domains/bi/servers/obis1/xsacache/NQSXSA_BIEE_BIPLATFORM.XC2875559987_ZPRODNE2128899357_2.CACHE

with the resulting physical query fired at the XSA cache table (replacing what would have gone against the DSS web service):

-- Sending query to database named XSA Cache (id: <<65357>>), connection pool named XSA CP, logical request hash 9a548c60, physical request hash d3ed281d: [[
WITH
SAWITH0 AS (select T1000001.first_name2360035083 as c1,
     T1000001.last_name3826278858 as c2,
     sum(T1000001.foo2363149668) as c3
from
     BIEE_BIPLATFORM.XC2875559987_ZPRODNE1645894381 T1000001
group by T1000001.first_name2360035083, T1000001.last_name3826278858)
select D1.c1 as c1, D1.c2 as c2, D1.c3 as c3, D1.c4 as c4 from ( select 0 as c1,
     D102.c1 as c2,
     D102.c2 as c3,
     D102.c3 as c4
from
     SAWITH0 D102
order by c2, c3 ) D1 where rownum <= 5000001

It’s important to point out the difference of what’s happening here: the aggregation has been pushed down to the database, meaning that the BI Server doesn’t have to. In performance terms, this is a Very Good Thing usually.

Rows 988, bytes 165984 retrieved from database query
Rows returned to Client 988

Whilst it doesn’t seem to be recorded in the query log from what I can see, the data returned from the XSA Cache also gets inserted into the BI Server cache, and if you open an XSA-based analysis that’s not in the presentation services cache (a third cache to factor in!) you will get a cache hit on the BI Server cache. As discussed earlier in this article though, if an analysis is built against an XSA for which a BI Server cache entry exists that with manipulation could service it (eg pruning columns or rolling up), it doesn’t appear to take advantage of it – but since it’s hitting the XSA cache this time, it’s less of a concern.

If you change the underlying data in the XSA

The BI Server does pick this up and repopulates the XSA Cache.

The XSA cache entry itself is 192Mb in size – generated from a 55Mb upload file. The difference will be down to data types and storage methods etc. However, that it is larger in the XSA Cache (database) than held natively (flat file) doesn’t really matter, particularly if the data is being aggregated and/or filtered, since the performance benefit of pushing this work to the database will outweigh the overhead of storage space. Consider this example here, where I run an analysis pulling back 44 columns (of the 78 in the spreadsheet) and hit the XSA cache, it runs in just over a second, and transfers from the database a total of 5.3Mb (the data is repeated, so rolls up):

Rows 1000, bytes 5576000 retrieved from database
Rows returned to Client 1000

If I disable the XSA cache and run the same query, we see this:

Rows 144000, bytes 801792000 Retrieved from database
Physical query response time 22.086 (seconds)
Rows returned to Client 1000

That’s 764Mb being sent back for the BI Server to process, which it does by dumping a whole load to disk in temporary work files:

$  ls -l /app/oracle/biee/user_projects/domains/bi/servers/obis1/tmp/obis_temp
[...]]
-rwxrwx--- 1 oracle oinstall 10726190 2016-06-01 21:04 nQS_AG_29733_261_ebd70002_75835908.TMP
-rwxrwx--- 1 oracle oinstall   153388 2016-06-01 21:04 nQS_AG_29733_262_ebd70002_75835908.TMP
-rw------- 1 oracle oinstall 24192000 2016-06-01 21:04 nQS_AG_29733_266_ebd70002_75862509.TMP
-rw------- 1 oracle oinstall  4195609 2016-06-01 21:04 nQS_EX_29733_264_ebd70002_75861716.TMP
-rw------- 1 oracle oinstall 21430272 2016-06-01 21:04 nQS_EX_29733_265_ebd70002_75861739.TMP

As a reminder – this isn’t “Bad”, it’s just not optimal (response time of 50 seconds vs 1 second), and if you scale that kind of behaviour by many users with many datasets, things could definitely get hairy for all users of the system. Hence – use the XSA Cache.

As a final point, with the XSA Cache being in the database the standard range of performance optimisations are open to us – indexing being the obvious one. No indexes are built against the XSA Cache table by default, which is fair enough since OBIEE has no idea what the key columns on the data are, and the point of mashups is less to model and optimise the data but to just get it up there in front of the user. So you could index the table if you knew the key columns that were going to be filtered against, or you could even put it into memory (assuming you’ve licensed the option).

The MoS document referenced above also includes further performance recommendations for XSA, including the use of RAM Disk for XSA cache metadata files, as well as the managed server temp folder

Summary

External Subject Areas are great functionality, but be aware of the performance implications of not being able to push down common operations such as filtering and aggregation. Set up XSA Caching if you are going to be using XSA properly.

If you’re interested in the direction of XSA and the associated Data Set Service, this slide deck from Oracle’s Socs Cappas provides some interesting reading. Uploading Excel files into OBIEE looks like just the beginning of what the Data Set Service is going to enable!

The post OBIEE 12c – Extended Subject Areas (XSA) and the Data Set Service appeared first on Rittman Mead Consulting.

Categories: BI & Warehousing

Pages