Feed aggregator

Making Hadoop easier

Pat Shuff - Mon, 2016-09-26 02:07
Last week we looked at provisioning a Hadoop server and realized that the setup was a little complex and somewhat difficult. This is what people typically do the first time when they want to provision a service. They download the binaries (or source if you are really crazy) and install everything from scratch. Our recommendation is to do everything this way the first time. It does help you get a better understanding of how the setup works and dependencies. For example, Hadoop 2.7.3 required Java 1.8 or greater. If we go with Hadoop 2.7.2 we can get by with Java 1.7.

Rather than going through all of the relationships, requirements, and libraries needed to get something working we are going to do what we would typically do to spin up a server if we suddenly need one up and running. We go to a service that provides pre-compiled and pre-configured public domain code sandboxes and get everything running that way. The service of choice for the Oracle Compute Cloud is Bitnami We can search for a Hadoop configuration and provision it into our IaaS foundation. Note that we could do the same using the Amazon EMR and get the same results. The key difference between the two are configurations, number of servers, and cost. We are going to go through the Bitnami deployment on the Oracle Cloud in this blog.

Step 1 Search for Hadoop on http://oracle.bitnami.com and launch the instance into your region of choice.

Step 2 Configure and launch the instance. We give the instance a name, we increase the default disk size from 10 GB to 60 GB to have room for data, we go with the hadoop 2.7.2-1 version, select Oracle Linux 6.7 as the OS (Ubuntu is an alternative), and go with a small OC3 footprint for the compute size. Don't change the security rules. A new one will be generated for you as well as the ssh keys when you provision through this service.

Step 3 Log into your instance. To do this you will need ssh and use the keys that bitnami generates for you. The instance creation takes 10-15 minutes and should show you a screen with the ip address and have links for you to download the keys.

Step 4 Once you have access to the master system you can execute the commands that we did last week. The only key difference with this implementation is that you will need to install java-1.8 with a yum install because by default the development kit is not installed and we need the jar functionality as part of configuration. The steps needed to repeat our tests from the previous blog entry.

 --- setup hdfs file system 
   hdfs namenode -format
   hdfs getconf -namenodes
   hdfs dfs -mkdir input
   cp /opt/bitnami/hadoop/etc/hadoop/*.xml input
   hdfs dfs -put input/*.xml input
 --- setup simple test with wordcount
   hdfs dfs -mkdir wordcount
   hdfs dfs -mkdir wordcount/input
   mkdir ~/wordcount
   mkdir ~/wordcount/input
   vi file01
   mv file01 ~/wordcount/input
   vi ~/wordcount/input/file02
   hdfs dfs -put ~/wordcount/input/* wordcount/input
   vi WordCount.java
 --- install java-1.8 to get all of the libraries
   sudo yum install java-1.8\*
 --- create ec.jar file
   export HADOOP_CLASSPATH=/opt/bitnami/java/lib/tools.jar
   hadoop com.sun.tools.javac.Main WordCount.java
   jar cf wc.jar WordCount*.class
   hadoop jar wc.jar WordCount wordcount/input wordcount/output
   hadoop fs -cat wordcount/output/part-r-00000
 --- download data and test pig
   mkdir data
   cd data
   w get http://stat-computing.org/dataexpo/2009/1987.csv.bz2
   w get http://stat-computing.org/dataexpo/2009/1988.csv.bz2
   bzip2 -d 1987.csv.bz2
   bzip2 -d 1988.csv.bz2
   hdfs dfs -mkdir airline
   hdfs dfs -copyFromLocal 19*.csv airline
   vi totalmiles.pig
   pig totalmiles.pig
   hdfs dfs -cat data/totalmiles/part-r-00000

Note that we can do the exact same thing using Amazon AWS. They have a MapReduce product called EMR. If you go to the main console, click on EMR at the bottom of the screen, you can create a Hadoop cluster. Once you get everything created and can ssh into the master you can repeat the steps above.

I had a little trouble with the WordCount.java program in that the library version was a little different. The JVM_1.7 libraries had a problem linking and adding the JVM_1.8 binaries did not properly work with the Hadoop binaries. You also need to change the HADOOP_CLASSPATH to point to the proper tools.jar file since it is in a different location from the Bitnami install. I think with a little tweaking it would all work. The pig sample code works with no problem so we were able to test that without changing anything.

In summary, provisioning a Hadoop server or cluster in the cloud is very easy if someone else has done the heavy lifting and pre-configured a server or group of servers for you. I was able to provision two clusters before lunch, run through the exercises, and still have time to go through it again to verify. Using a service like private Marketplaces, Bitnami, or the AWS Marketplace makes it much simpler to deploy sandbox images.

Oracle 12cR2: IS_ROLLING_INVALID in V$SQL

Yann Neuhaus - Sun, 2016-09-25 11:51

In a previous post I published a test case to show when a cursor is not shared anymore after a rolling invalidation. Basically the dbms_stats marks the cursor as ‘rolling invalid’ and the next execution marks it as ‘rolling invalid executed’. Looking at 12cR2 there is a little enhancement in V$SQL with an additional column displays those states.

Note that 12cR2 full documentation is not yet available, but you can test this on the Exadata Express Cloud Service.

I set the invalidation period to 5 seconds instead of 5 hours to show the behavior without waiting

17:43:52 SQL> alter system set "_optimizer_invalidation_period"=5;
System altered.

I’ll run a statement with dbms_sql in order to separate parse and execute phases

17:43:53 SQL> variable c number
17:43:53 SQL> exec :c := dbms_sql.open_cursor;
PL/SQL procedure successfully completed.
17:43:53 SQL> exec dbms_sql.parse(:c, 'select (cast(sys_extract_utc(current_timestamp) as date)-date''1970-01-01'')*24*3600 from DEMO' , dbms_sql.native );
PL/SQL procedure successfully completed.

Here is the cursor from V$SQL including the new IS_ROLLING_INVALID column:

17:43:53 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTIVE_TIME IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- ------------------ ------------------
0 1 1 0 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 17:43:52 N

Statement is parsed (one parse call + load) but IS_ROLLING_INVALID is N

Now I execute it:

17:43:53 SQL> exec dbms_output.put_line( dbms_sql.execute(:c) );
0
PL/SQL procedure successfully completed.
 
17:43:53 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTIVE_TIME IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- ------------------ ------------------
0 1 1 1 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 17:43:52 N

Statement has one execution.

I’m now gathering statistics with default rolling invalidation:

17:43:53 SQL> exec dbms_stats.gather_table_stats(user,'DEMO');
PL/SQL procedure successfully completed.
 
17:43:53 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTIVE_TIME IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- ------------------ ------------------
0 1 1 1 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 17:43:52 Y

The cursor is now marked as rolling invalid (IS_ROLLING_INVALID=”Y”) but wait, this is not a “Y”/”N” boolean, there’s another possible value.

I execute the statement again (no parse call, only execution):

17:43:53 SQL> exec dbms_output.put_line( dbms_sql.execute(:c) );
0
PL/SQL procedure successfully completed.
 
17:43:53 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTIVE_TIME IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- ------------------ ------------------
0 1 1 2 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 17:43:52 X

Cursor is now marked as rolling invalid executed (“X”) and this is where the rolling window starts (which I’ve set to 5 seconds instead of 5 hours)

I wait 5 seconds and the cursor has not changed:

17:43:58 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTIVE_TIME IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- ------------------ ------------------
0 1 1 2 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 17:43:52 X
 

I execute it again (no parse call, only re-execute the cursor):

17:43:58 SQL> exec dbms_output.put_line( dbms_sql.execute(:c) );
0
PL/SQL procedure successfully completed.

For this execution, a new child has been created:

17:43:58 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTIVE_TIME IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- ------------------ ------------------
0 1 1 2 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 17:43:52 X
0 1 0 1 2016-09-25/17:43:53 2016-09-25/17:43:57 25-SEP-16 17:43:57 N

So rolling invalidation do not require a parse call. Execution can start the rolling window and set the invalidation timestamp, and first execution after this timestamp creates a new child cursor.

I’ll now test what happens with parse calls only.

I set a longer rolling window (2 minutes) here:

17:43:58 SQL> exec dbms_stats.gather_table_stats(user,'DEMO');
PL/SQL procedure successfully completed.
 
17:43:58 SQL> alter system set "_optimizer_invalidation_period"=120;
System altered.

The last child has been marked as rolling invalid but not yet executed in this state:

17:43:58 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTIVE_TIME IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- ------------------ ------------------
0 1 1 2 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 17:43:52 X
0 1 0 1 2016-09-25/17:43:53 2016-09-25/17:43:57 25-SEP-16 17:43:57 Y

From a new session I open another cursor:

17:43:58 SQL> connect &_user./demo@&_connect_identifier
Connected.
17:43:58 SQL> exec :c := dbms_sql.open_cursor;
PL/SQL procedure successfully completed.

And run several parse calls without execute, one every 10 seconds:

17:43:58 SQL> exec for i in 1..12 loop dbms_sql.parse(:c, 'select (cast(sys_extract_utc(current_timestamp) as date)-date''1970-01-01'')*24*3600 from DEMO' , dbms_sql.native ); dbms_lock.sleep(10); end loop;
PL/SQL procedure successfully completed.

So two minutes later I see that I have a new child created during the rolling window:

17:45:58 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTI IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- --------- ------------------
0 1 1 2 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 X
0 1 3 1 2016-09-25/17:43:53 2016-09-25/17:43:57 25-SEP-16 Y
0 1 9 0 2016-09-25/17:43:53 2016-09-25/17:44:27 25-SEP-16 N

Here, at the third parse call (17:44:27) during the invalidation window, a new child cursor has been created. The old one is still marked as rolling invalid (“Y”), but not ‘rolling invalid executed’ (“X”) because it has not been executed.

So it seems that both parse or execute are triggering the rolling invalidation, and the IS_ROLLING_INVALID displays which one.

An execute will now execute the new cursor:

17:45:58 SQL> exec dbms_output.put_line( dbms_sql.execute(:c) );
 
PL/SQL procedure successfully completed.
 
17:45:58 SQL> select invalidations,loads,parse_calls,executions,first_load_time,last_load_time,last_active_time,is_rolling_invalid from v$sql where sql_id='61x2h0y9zv0r6';
 
INVALIDATIONS LOADS PARSE_CALLS EXECUTIONS FIRST_LOAD_TIME LAST_LOAD_TIME LAST_ACTI IS_ROLLING_INVALID
------------- ---------- ----------- ---------- ------------------- ------------------- --------- ------------------
0 1 1 2 2016-09-25/17:43:53 2016-09-25/17:43:53 25-SEP-16 X
0 1 3 1 2016-09-25/17:43:53 2016-09-25/17:43:57 25-SEP-16 Y
0 1 9 1 2016-09-25/17:43:53 2016-09-25/17:44:27 25-SEP-16 N

Of course, when new cursors have been created we can see the reason in V$SQL_SHARED_CURSOR:

17:45:58 SQL> select child_number,reason from v$sql_shared_cursor where sql_id='61x2h0y9zv0r6';
 
CHILD_NUMBER REASON
------------ --------------------------------------------------------------------------------
0 <ChildNode><ChildNumber>0</ChildNumber><ID>33</ID><reason>Rolling Invalidate Win
dow Exceeded(2)</reason><size>0x0</size><details>already_processed</details></Ch
ildNode><ChildNode><ChildNumber>0</ChildNumber><ID>33</ID><reason>Rolling Invali
date Window Exceeded(3)</reason><size>2x4</size><invalidation_window>1472658232<
/invalidation_window><ksugctm>1472658237</ksugctm></ChildNode>
 
1 <ChildNode><ChildNumber>1</ChildNumber><ID>33</ID><reason>Rolling Invalidate Win
dow Exceeded(2)</reason><size>0x0</size><details>already_processed</details></Ch
ildNode><ChildNode><ChildNumber>1</ChildNumber><ID>33</ID><reason>Rolling Invali
date Window Exceeded(3)</reason><size>2x4</size><invalidation_window>1472658266<
/invalidation_window><ksugctm>1472658268</ksugctm></ChildNode>
 
2

The last child cursor has been created at 5:44:28 (invalidation_window=1472658268) because invalidation timestamp (ksugctm=1472658266)

So what?

We love Oracle because it’s not a black box. And it’s good to see that they continue in this way by exposing in V$ views information that can be helpful for troubleshooting.

Rolling invalidation has been introduced for dbms_stats because we have to gather statistics and we don’t want hard parse storms after that.
But remember that invalidation can also occur with DDL such as create, alter, drop, comment, grant, revoke.

You should avoid running DDL when application is running. However, we may have to do some of those operations online. It would be nice to have the same rolling invalidation mechanisms and it seems that it will be possible:


SQL> show parameter invalid
 
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
cursor_invalidation string IMMEDIATE
 
SQL> alter session set cursor_invalidation=XXX;
ERROR:
ORA-00096: invalid value XXX for parameter cursor_invalidation, must be from among IMMEDIATE, DEFERRED

That’s interesting. I’ll explain which DDL can use that in a future blog post.

 

Cet article Oracle 12cR2: IS_ROLLING_INVALID in V$SQL est apparu en premier sur Blog dbi services.

Replacement of environment variables or properties in Bash

Darwin IT - Sun, 2016-09-25 10:47
Earlier I wrote about the automatic installation of Fusion Middleware components using response files. A thing that lacked in my scripts was that although I had a FMW_HOME variable set in my enviroment shell script, the response files had the location hard coded in them. At the time I hadn't had the chance to figure out how to do property/variable replacement in shell. I do know how to do it with ANT. But I figured that installing ANT for only this was a bit too much, since with the installation of FMW you already get ANT as a module.

For an upgrade of my scripts to FMW 12.2.1.1, I did a Google-search on it and found: http://stackoverflow.com/questions/415677/how-to-replace-placeholders-in-a-text-file. The top 2 suggestions were:

  1. sed -e "s/\${i}/1/" -e "s/\${word}/dog/" template.txt
  2. i=32 word=foo envsubst < template.txt
Although the first option was favoured by many and considered the answer on the querstion, I personally favour the second. It turns out that sed does not accept references to the environment variables as a replacement. And that makes the replacements hardcoded again. The second does accept environment variable references. Actually, if the variable-reference in the template file  is already present in the environment, no actual replacement assignment have to be provided.

So let's say my response file template looks like:

[ENGINE]

#DO NOT CHANGE THIS.
Response File Version=1.0.0.0.0

[GENERIC]

#Set this to true if you wish to skip software updates
DECLINE_AUTO_UPDATES=true

#
MOS_USERNAME=

#
MOS_PASSWORD=<SECURE VALUE>

#If the Software updates are already downloaded and available on your local system, then specify the path to the directory where these patches are available and set SPECIFY_DOWNLOAD_LOCATION to true
AUTO_UPDATES_LOCATION=

#
SOFTWARE_UPDATES_PROXY_SERVER=

#
SOFTWARE_UPDATES_PROXY_PORT=

#
SOFTWARE_UPDATES_PROXY_USER=

#
SOFTWARE_UPDATES_PROXY_PASSWORD=<SECURE VALUE>

#The oracle home location. This can be an existing Oracle Home or a new Oracle Home
ORACLE_HOME=${FMW_HOME}

#Set this variable value to the Installation Type selected. e.g. Fusion Middleware Infrastructure, Fusion Middleware Infrastructure With Examples.
INSTALL_TYPE=Fusion Middleware Infrastructure

#Provide the My Oracle Support Username. If you wish to ignore Oracle Configuration Manager configuration provide empty string for user name.
MYORACLESUPPORT_USERNAME=

#Provide the My Oracle Support Password
MYORACLESUPPORT_PASSWORD=<SECURE VALUE>

#Set this to true if you wish to decline the security updates. Setting this to true and providing empty string for My Oracle Support username will ignore the Oracle Configuration Manager configuration
DECLINE_SECURITY_UPDATES=true

#Set this to true if My Oracle Support Password is specified
SECURITY_UPDATES_VIA_MYORACLESUPPORT=false

#Provide the Proxy Host
PROXY_HOST=

#Provide the Proxy Port
PROXY_PORT=

#Provide the Proxy Username
PROXY_USER=

#Provide the Proxy Password
PROXY_PWD=<SECURE VALUE>

#Type String (URL format) Indicates the OCM Repeater URL which should be of the format [scheme[Http/Https]]://[repeater host]:[repeater port]
COLLECTOR_SUPPORTHUB_URL=



Saved as 'fmw_12.2.1.1.0_infrastructure.rsp.tpl'; note the reference ORACLE_HOME=${FMW_HOME}. And I have set FMW_HOME with an fmw12c_env.sh script, as described in former posts. Then I only have to do:
envsubst < fmw_12.2.1.1.0_infrastructure.rsp.tpl >>fmw_12.2.1.1.0_infrastructure.rsp
To have the file copied to fmw_12.2.1.1.0_infrastructure.rsp with a replaced FMW_HOME variable:

...
#
SOFTWARE_UPDATES_PROXY_PASSWORD=<SECURE VALUE>

#The oracle home location. This can be an existing Oracle Home or a new Oracle Home
ORACLE_HOME=/u01/app/oracle/FMW12211

#Set this variable value to the Installation Type selected. e.g. Fusion Middleware Infrastructure, Fusion Middleware Infrastructure With Examples.
INSTALL_TYPE=Fusion Middleware Infrastructure
...

Couldn't be more simple, I'd say. Nice thing is that this enables me to do more directives. So, learned something again, from a question dated 7,5 years ago...

Thoughts/Info on Essbase/EssCS after Oracle Open World 2016

Tim Tow - Sat, 2016-09-24 21:59
I was at Oracle Open World last week and have some notes to share on the upcoming EssCS (Essbase Cloud Service) product and on Essbase in general.

EssCS will be Infrastructure as a Service (IaaS) product and will be priced on the number of cores and memory.  I believe there will be a metered and an unmetered pricing as well (metered meaning pay per cycle or some other usage measure).  According to presentations at Open World, which did have safe harbor statements meaning "do not make decisions based on this info as it could change", there will be options for 1, 2, 4, or 16 physical cores ("OCPU's") and 7.5 to 15 Gb of RAM.  In addition, it will be an updated version of Essbase that is not the current on-prem version.  It will feature, among other things:


  • The new Java Agent running in Weblogic that moves security from the essbase.sec file to a relational database.
  • Simplified security
  • ASO / BSO / Hybrid
  • Sandboxing and scenario management - what if's without copying data with workflow for approval to merge into given scenarios
  • Cloud-based UI
  • Ability to specify/create an Essbase cube from Excel 
  • A cube template library ("app store" type of thing)
  • A web-based outline editor (though most editing should be done in Excel)
  • EssCLI (Essbase Command Line Interface) - a sort of EPMAutomate for Essbase
  • The Essbase Java API and a new REST API (which is currently being engineered)

I do not remember hearing any dollar amount for EssCS at Open World.  I expect availability in the next 3 to 6 months though it wouldn't surprise me if it were to slip further.

As far as on-prem Essbase updates, I would expect that the updates we see in EssCS will go on-prem as part of the EPM 2017 on-prem release which Oracle currently believes will be delivered late in 2017 (also subject to safe harbor, etc).

As far as how Oracle is selling Essbase, Essbase is now firmly in the BI organization and is being sold by the BI reps; EPM reps do not generally sell Essbase.  To the Essbase team, EPM is important as they are an internal customer, but EPM is not their only customer.  As such, I saw at least one presentation that promoted the idea of customers writing custom planning applications using Essbase.  While some people I talked with thought that approach muddled the EPM message for customers, I see it as a situation where if they don't compete in the custom market, then someone else will.  As someone who frequently is involved in complex applications where the EPM Planning product may not be a fit, I am thrilled to see that message from Oracle.




Categories: BI & Warehousing

Video Tutorial: XPLAN_ASH Active Session History - Part 12

Randolf Geist - Sat, 2016-09-24 17:27
The final part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.


This is the final part of this tutorial, but there are more tutorials coming - about configuring the script, script internals and also the Rowsource Statistics mode of the script.

Modern software architecture – what is a database?

Yann Neuhaus - Sat, 2016-09-24 17:03

This blog post is focused at developers and software architects. I’m probably not writing at the right place. You’re on an infrastructure experts blog and the author is an Oracle DBA. So what can you learn from someone working on that 30 years old technology talking about that old SQL language ? You run with modern languages, powerful frameworks, multi-layer architecture, micro-services, distributed database and of course all open-source. You hate your DBA because he is the major slow-down for your agile development. You don’t want SQL. You don’t want databases. You don’t want DBA.

How can I encourage you to read this blog post? I was not always an DBA. I started as a developer, more than 20 years ago. And believe me, it was not prehistory at all. Object-Oriented design, Rapid Application Development, Automatic programming (remember C.A.S.E.?), visual programming (have you ever seen an IDE like IBM Visual Age?), query generation (early days of Business-Objects). All these evolved with more and more languages, frameworks, layers, micro-services, XML, SOA, JSON, REST,… but only one technology remained: the critial persistent data is still in a relational database and accessed by SQL.

What is a database

Most of developers think that a database is there to store and retrieve data. I’m sorry but that’s wrong. That may have been right a long time ago, with key-value storage and hierarchical databases, but that’s too old for me. When I started to work, databases were already doing far more than that. Let me explain. With those prehistoric databases, you retrieved data in the same way you stored it. You insert with a key, you fetch with that key. It is easy to explain to modern developers because they “invented” it few years ago, calling it CRUD (Create Read Update Delete). First argument of those CRUD methods is a key value. And you store unformatted data as XML or JSON associated to that value. If this is the only feature that you need, then for sure you don’t want a database.

Relational database management systems (RDBMS) are doing a lot more than that. First, you can query data in a completely different way than you inserted it. And this is the real life-cycle of data. For example, You take orders, one by one, with customer and product information for each of them. Of course you can update and read it with the order ID that has been generated, but that’s only a small use case and probably not the most critical. Warehouse users will query (and update) orders by product. Delivery users will query (and update) orders by customer. Marketing users will query by channels and many other dimensions. Finance will join with accounting. With a persistence only system, you have to code all that. Of course if you declared the mapping of associations, you can navigate through them. But the user requirement is to get a set of orders, or a quantity of products in stock, or a subset of customers, which is different from navigating through orders one by one. With a database, you dont need to code anything. With a proper data model what you inserted can be manipulated without its key value. All data that you have inserted can be accessed from any different point of view. And you don’t have to code anything for that. Imagine a Data Access Object with ‘QueryBy methods covering any combination of columns and operators.

A database system does not only store data, it processes data and provide a service to manipulate data.

SQL

SQL is not a language to code how to get the information. SQL only describes what you want. It’s a question you ask to a data service. Same idea as Uber where you enter your destination and desired car and the service manages everything for you: the path, the communication, the paiement, the security. You may not like the SQL syntax, but it can be generated. I’m not talking about generating CRUD statements here, but generating SQL syntax from a SQL semantic expressed in Java or example. There’s a very good example for that: jOOQ (look at the exemples there).

I understand that you can hate SQL for it’s syntax. SQL was build for pre-compilers, not for execution time parsing of text, and I’ll come back on that later with static SQL. But you can’t say that SQL semantic is not modern. It’s a 4th generation language that saves all the procedural coding you have to do with 3rd generation languages. SQL is a declarative language build on a mathematics theory. It goes far beyond the for() loops and if()else.

In SQL you describe the result that you want. How to retrieve the data is done by the database system. The optimizer builds the procedural code (know as the execution plan) and the execution engine takes care of everything (concurrency, maintaining redundant structures for performance, caching, multithreading, monitoring, debugging, etc). Do you really want to code all that or do you prefer to rely on a data service that does everything for you?

You know why developers don’t like SQL? Because SQL has not been designed for programmers. It was for users. The goal was that a non-programmer can ask its question to the system (such as “give me the country of the top customers having bought a specific product in last 3 months”) without the need of a developer. There was no GUI at that time, only Command Line Interface, and SQL was the User Friendly Interface to the database. Today we have GUIs and we don’t need SQL. But it is there so programmers build tools or framework to generate SQL from a programming language. Sure it is ridiculous and it would be better to have a programming language that directly calls the SQL semantic without generating plain old English text. We need a Structured Query Language (SQL) we just don’t need it to be in English.

Set vs loops

So why do people prefer to code everything in procedural language (3GL)? Because this is only what they learned. If at school you learned only loops and comparisons, then you are going to access data in loops. If you learned to think about data as sets, then you don’t need loops. Unfortunately, the set concepts are teached in mathematics classes but not in IT.

Imagine you have to print “Hello World” 5 times. Which pseudo-code so you prefer?


print("Hello World\n")
print("Hello World\n")
print("Hello World\n")
print("Hello World\n")
print("Hello World\n")

or


print ( "Hello World\n" + "Hello World\n" + "Hello World\n" + "Hello World\n" + "Hello World\n" )

I’ve put that in pseudo-code. I don’t want to play with String and StringBuffer here. But the idea is only to explain that if you have to process a set of things it is more efficient to process them as a set rather than one-by-one. That works for everything. And this is where databases rocks: they process sets of rows. If you have to increment the column N by one in every row of your table T, you don’t need to start a loop and increment the column row-by-row. Just ask your RDBMS data service to do it: ‘/* PLEASE */ UPDATE T set N=N+1′. The “please” is in comment because everything that is not there to describe the result is not part of SQL. You have to use hints to force the way to do it, but they are written as comments because SQL do not allow any way to tell how to do it. This was a joke of course, the “please” is not mandatory because we are talking to a machine.

ACID

I’m not sure you get all the magic that is behind:

UPDATE T set N=N+1;

it’s not a simple loop as:

for each row in T
set N=N+1

The RDBMS does more than that. Imagine that there is a unique index on the column N. How many lines of code do you need to do that N=N+1 row by row and be sure that at any point you don’t have duplicates? Imagine that after updating half of the rows you encounter someone else currently updating the same row. You have to wait for his commit. But then, if he updated the value of N, do you increment the past value or the new one? You can’t increment the old one or his modification will be lost. But if you increment the new one, your previous incremented rows are inconsistent because they were based on a previous state of data.
I was talking about having an index. You have to maintain this index as well. You have to be sure that what is in cache is consistent with what is in disk. That modifications made in the cache will not be lost in case of server failure. And if you run in a cluster, those caches must be synchronized.

Coding the same as this “UPDATE T set N=N+1″ in a procedural language is not easy and can become very complex in a multi-user environment.

Maybe you have all the tools you need to generate that code. But if you code it you have to test it. Are your tests covering all concurrency cases (sessions updating or reading same rows, or different rows from same table,…). What is already coded within the database has already been tested. It’s a service and you just have to use it.

Static SQL

I claimed above that SQL is there to be pre-compiled. Yes, SQL is witten in plain text, like most of programming languages, and must be parsed, optimized, compiled. It’s not only for performance. The main reason is that you prefer to get errors at compile time than at runtime. If you put SQL in text strings in your code it will remain text until execution time when it will be prepared. And only then you will get errors. The second reason is that when the SQL is parsed, it is easy to find the dependencies. Want to see all SQL statements touching to a specific column? Do you prefer to do guess on some text search or to methodically follow dependencies?

Yes, SQL is there to be static and not dynamic. That claim may look strange for an Oracle DBA because all statements are dynamic in Oracle. Even at the time of precompilers (such as Pro*C) the statements were parsed but were put as text in the binary. And at first execution, they are parsed again and optimized. If you want the execution plan to be defined at deployment time, you have to use Outlines or SQL Plan Baselines. There is no direct way to bind the execution plan at deployment time in Oracle. In my opinion the static SQL as it is known on DB2 for example is really missing in Oracle. OLTP Software Vendors would love to ship the optimized execution plans with their application. Imagine that all SQL statements in an OLTP application are parsed and optimized, compiled as bound procedures, similar to stored procedures, with procedural access (the execution plan) and you just have to call them. For reporting, DSS, BI you need the plans to adapt to the values and volume of data, but for OLTP you need stability. And from the application, you just call those static SQL like a data service.

Talking about procedural execution stored in the database, I’m coming to stored procedures and PL/SQL of course.

Stored Procedures

When you code in your 3GL language, do you have functions that update global variables (BASIC is the first language I learned and this was the idea) or do you define classes which encapsulate the function and the data definition? The revolution of Object Oriented concepts was to put data and logic at the same place. It’s better for code maintainability with direct dependency procedural code and data structures. It’s better for security because data is accessible only through provided methods. And it’s better for performance because procedural code access data at the same place.

Yes Object Oriented design rocks and this why you need to put business logic in the database. Putting the data on one component and running the code on another component of an information system is the worst you can do. Exactly as if in your Object Oriented application you store the object attributes on one node and run the methods on another one. And this is exactly what you do with the business logic outside of the database. Your DAO objects do not hold the data. The database does. Your objects can hold only a copy of the data, but the master copy where are managed concurrency management, high availability and persistance is in the database.

We will talk about the language later, this is only about the fact that the procedural code run in the same machine and the same processes than the data access.
There are a lot of myths about running business logic in the database. Most of them come from ignorance. Until last Monday I believed that one argument against running business logic in the database was unbeatable: You pay Oracle licences on the number of CPU, so you don’t want to use the database CPUs to run something that can run on a free server. I agreed with that without testing it, and this is where myths come from.

But Toon Koppelaars has tested it and he proved that you use more database CPU when you put the business logic outside of the database. I hope his presentation from Oak Table World 2016 will be available soon. He proved that by analyzing exactly what is running in the database, using linux perf and flame graphs: https://twitter.com/ChrisAntognini/status/778273744242352128

All those rountrips from remote compute server, all those row-by-row processing coming from that design have an higher footprint on the database CPUs that directly running the same on the database server.

PL/SQL

Running business logic on the database server can be done with any language. You can create stored procedures in Java. You can code external procedures in C. But those languages have not been designed for data manipulation. It is sufficient to call SQL statements but not when you need procedural data access. PL/SQL is a language made for data processing. It’s not only for stored procedure. But it’s the only language that is coupled with your data structure. As I said above, it’s better to think in sets with SQL. But it may be sometimes complex. With PL/SQL you have a procedural language which is intermediate between row-by-row and sets because it has some bulk processing capabilities.

In pseudo-code the Hello World above is something like that:


forall s in ["Hello World\n","Hello World\n","Hello World\n","Hello World\n","Hello World\n"] print(s)

It looks like a loop but it is not. The whole array is passed to the print() function and loop is done at lower level.

In PL/SQL you can also use pipeline functions where rows are processed with a procedural language but as a data stream (like SQL does) rather than loops and calls.

I’ll go to other advantages of PL/SQL stored procedures but here again there is one reason frequently raised to refuse PL/SQL. You can find more developers in .Net or Java than in PL/SQL. And because they are rare, they are more expensive. But there is a counter argument I heard this week at Oracle Open World (but I don’t remember who raised that point unfortunately). PL/SQL is easy to learn. Really easy. You have begin – exception – end blocks, you declare all variables, you can be modular with procedures and inline procedures, you separate signature and body, you have very good IDE, excellent debugger and easy profiler,… and you can run it on Oracle XE which is free. So, if you have a good Java developer he can write efficient PL/SQL in a few days. By good developer, I mean someone who understands multi-user concurrency problems, bulk processing, multi-threading, etc.

There are less PL/SQL developers than Java developers because you don’t use PL/SQL. It’s not the opposite. If you use PL/SQL you will find developers and there are many software vendors that code their application in PL/SQL. Of course PL/SQL is not free (except in Oracle XE) but it runs on all platforms and on all editions.

Continuous Integration and Deployment, dependency and versioning

I come back quickly to the advantages of using a language that is coupled with your data.

PL/SQL stored procedures are compiled and all dependencies are stored. With one query on DBA_DEPENDENCIES you can know which tables your procedure is using and which procedures use a specific table. If you change the data model, the procedures that have to be changed are immediately invalidated. I don’t know any other language that does that. You don’t want to break the continuous integration build every time you change something in a table structure? Then go to PL/SQL.

Let’s go beyond continuous integration. How do you manage database changes in continuous deployment? Do you know that with PL/SQL you can modify your data model online, with your application running and without breaking anything? I said above that procedures impacted by the change are invalidated and the must be adapted to be able to be compiled. But this is only for the new version. You can deploy a new version of those procedures while the previous version is running. You can test this new version and only when everything is ok you switch the application to the new version. The feature is called Edition Based Redefinition (EBR) it exists since 11g in all Oracle editions. It’s not known and used enough, but all people I know that use it are very happy with it.

In development environment and continuous integration, it is common to say that the database always cause problem. Yes it is true but it’s not inherent to the database but the data. Data is shared and durable and this is what makes it complex. The code can be deployed in different places, and can be re-deployed if lost. Data can be updated at only one place and visible to all users. Upgrading to a new version of application is easy: you stop the old version and start the new version. For data it is different: you cannot start from scratch and you must keep and upgrade the previous data.

Object-Relational impedance

I’m probably going too far in this blog post but the fact that data is shared and durable is the main reason why we cannot apply same concepts to data objects (business objects) and presentation objects (GUI components). Application objects are transient. When you restart the application, you create other objects. The identity of those objects is an address in memory: it’s different on other systems and it’s different once application is restarted. Business objects are different. When you manipulate a business entity, it must have the same identity for any users, and this identity do not change when application is restarted, not even when application is upgraded. All other points given as “object-relational impedance” are minor. But the sharing and durability of business object identity is the reason why you have to think differently.

Where to put business logic?

If you’re still there, you’ve probably understood that it makes sense to run the data logic in the database, with declarative SQL or procedural PL/SQL stored procedures, working in sets or at least in bulk, and with static SQL as much as possible, and versioned by EBR.

Where to put business logic then? Well, business logic is data logic for most of it. But you’ve always learned that business logic must be in the application tier. Rather than taking reasons given one by one and explain what’s wrong with them, let me tell you how came this idea of business logic outside of the database. The idea came from my generation: the awesome idea of client/server.

At first, data was processed on the servers and only the presentation layer was on the user side (for example ISAM was very similar with what we do with thin web pages). And this worked very well, but it was only green text terminals. Then came PCs and Windows 3.11 and we wanted graphical applications. So we built applications on our PCs. But that was so easy that we implemented all business logic there. Not because it’s a better architecture, but because anyone can build his application without asking to the sysops. This was heaven for developers and a nightmare for operations to deploy those applications on all the enterprise PCs.
But this is where offloading business logic started. Application written with nice IDEs (I did this with Borland Paradox and Delphi) connecting directly to the database with SQL. Because application was de-correlated from the database everything was possible. We even wanted to have applications agnostic of the database, running in any RDBMS. Using standard SQL and standard ODBC. Even better: full flexibility for the developer by using only one table with Entity-Value-Attribute.

Actually, the worst design anti-patterns have been invented at that time and we still see them in current applications – totally unscalable.

When finally the deployment of those client/server applications became a nightmare, and because internet was coming with http, html, java, etc. we went to 3-tier design. Unfortunately, the business logic remained offloaded in the application server instead of being part again of the database server.

I mentioned ODBC and it was another contributor to that confusion. ODBC looks like a logical separation of the application layer and the database layer. But that’s wrong. ODBC is not a protocol. ODBC is an API. ODBC do not offer a service: it is a driver running on both layers and that magically communicates through network: code to process data on one side and data begin on the other.

A real data service encapsulates many SQL statements and some procedural code. And it is exactly the purpose of stored procedures. This is how all data applications were designed before that client/server orgy and this is how they should be designed today when we focus on centralization and as micro-services applications.

So what?

This blog post is already too long. It comes from 20 years experience as developer, application DBA, and operation DBA. I decided to write this when coming back from the Oracle Open World where several people are still advocating for the right design, especially Toon Koppelaars about Thick Database at Oak Table World and the amazing panel about “Thinking clearly about application architecture” with Toon Koppelaars, Bryn Llewellyn, Gerald Venzl, Cary Millsap, Connor McDonald

The dream of every software architect should be to attend that panel w/ @ToonKoppelaars @BrynLite @GeraldVenzl @CaryMillsap @connor_mc_d pic.twitter.com/npLzpnktMK

— Franck Pachot (@FranckPachot) September 22, 2016

Beside the marketing stuff, I was really impressed by the technical content around the Oracle Database this year at OOW16.

 

Cet article Modern software architecture – what is a database? est apparu en premier sur Blog dbi services.

Wiederholung APEX 5.0 Best Practices

Denes Kubicek - Fri, 2016-09-23 10:30
Wegen der großen Nachfrage veranstalten wir jetzt einen Zusatztermin für den APEX 5.0 Best Practice Kurs am 21.Nov - 23. Nov. 2016.

Anmeldung zum Kurs

Categories: Development

#OOW16: Recap in Photos!

WebCenter Team - Fri, 2016-09-23 05:44

Even if you attended all 5 days of Oracle OpenWorld 2016 this past week, there was WAY more content and goings on over the span of 5 days than one could possibly hope to keep track of.  But if we had to sum it all up in one word for our customers and partners, it would be "Cloud". We want to sincerely thank everyone who attended and supported us before, during and after the event. We look forward to continuing to work together to expand on our Content and Experience Management solutions. Here are some photos to recap the week!

Executed PL/SQL kept in Stored procedure but while compile Compilation error is coming.

Tom Kyte - Fri, 2016-09-23 04:46
Hi Tom, Below PL/SQL is working fine. while same code of PL/SQL as below is used in Stored procedure compilation error is coming while compiling stored procedure. Could you please make a suggestion to keep below PL/SQL in below Stored Procedure so...
Categories: DBA Blogs

How to compare two tables of data????

Tom Kyte - Fri, 2016-09-23 04:46
Hi Tom, I have two tables of values(dept1 and dept2). How do I compare all the data in these two tables??? It will return true if both tables contain the same data & false for otherwise... Another thing is that I do not know how to use CREATE OPE...
Categories: DBA Blogs

Java Connection Pooling with Oracle VPD

Tom Kyte - Fri, 2016-09-23 04:46
Hi Tom, We have a 3-tier application that is built on Java and Oracle. In our application, we extensively make use of Oracle VPD policies for setting contexts and managing the data. Now, we are building in Java something on top of Oracle. We hit ...
Categories: DBA Blogs

PlSQL- Bulk Collect and Update (Better Approach)

Tom Kyte - Fri, 2016-09-23 04:46
Hi Tom, I am looking for a better coding approach than what I have in my current system. I have two tables dog_owner(16 Million Records) and dog_owner_stage(8 Million Records). In the current process. I usually insert based on a common owner_accou...
Categories: DBA Blogs

Performance issue in CLOB\BLOB data migration

Tom Kyte - Fri, 2016-09-23 04:46
(did not get any answer for https://asktom.oracle.com/pls/apex/f?p=100:24:0::NO::P24_ID:9531842300346462307 ) Hello Tom, First of all, i would like you to thank you for your immense support on Database issues.It helps us a lot !! Question : M...
Categories: DBA Blogs

Compile_Error when refreshing a Materialized View from a procedure

Tom Kyte - Fri, 2016-09-23 04:46
We have Materialized Views which reference tables in other schemas. We can refresh/compile the Materialized Views from the command line however when we refresh/compile the Materialized View from within a procedure the job immediately aborts with...
Categories: DBA Blogs

Practise question

Tom Kyte - Fri, 2016-09-23 04:46
Hi, I was practicing some question on sql challenge about dml operation using multiple tables and got some doubt. CREATE TABLE plch_departments ( department_id INTEGER PRIMARY KEY, department_name VARCHAR2(30) ) / CREATE TABLE p...
Categories: DBA Blogs

Optimiser Trace

Tom Kyte - Fri, 2016-09-23 04:46
Hi Tom, A tricky question, recently we upgraded our systems to 11.2.0.4 and started to observe some queries taking much longer (from mins to 10+ hours). On analysing the taces / explain we found the access path had changed from the previous one, so...
Categories: DBA Blogs

Converting an EE DB 12.1.0.2.0 to SE DB 12.1.0.1.0 - Version - empty table

Tom Kyte - Fri, 2016-09-23 04:46
Hi, Im trying to convert EE 12.1.0.2.0 to SE 12.1.0.1.0 with expdp/impdp. I've found example of converting EE 11.2.0.1 to SE 11.2.0.4 where it is stated: "During import to standard edition we must use keyword VERSION=11.1 to be able to import e...
Categories: DBA Blogs

Oracle Open World 2016 – Day 4 and 5

Yann Neuhaus - Fri, 2016-09-23 03:01

At the end of Oracle Open World my last BLOG concerning OOW 2016 covering day 4 and 5:

Wednesday is the day of the Party: Oracle’s appreciation event, a concert with Gwen Stefani and Sting at the AT&T Park (Stadium of the San Francisco Baseball team, the Giants). It was a great event with awesome musicians.

Before the party I visited the session “Oracle Active Data Guard: Power, Speed, Ease and Protection” provided by Larry M. Carpenter, the grandfather of Data Guard. Here a couple of nice new features of (Active) Data Guard in 12gR2:

  • Multi-Instance Redo Apply in RAC: Up until now the managed recovery processes (MRP) could only run on a single node on the standby RAC site and hence limited redo apply to the CPU and IO power of that one node. In the new release a coordinator process can distribute the redo data to MPR-processes on all nodes of the RAC cluster. This is called Multi-Instance Apply and is configured as follows:
    Without broker: recover managed standby database disconnect using instances 4;
    With broker: Through the ‘ApplyInstances’ property.
    Caveats in the first 12gR2 release:
    Using Multi-Instance Redo Apply disallows the use of the new feature In-Memory Column Store on Active Data Guard.
    RMAN block change tracking file is disabled.

 

  • Data Guard Broker enhancement for Multitenant: As Redo is generated at Container (CDB) level, switchover and failover will also happen for the whole CDB. In 12gR2 there is a new command when using the Data Guard Broker to migrate or failover a Pluggable DB (PDB) to another CDB on the same server:

    MIGRATE PLUGGABLE DATABASE PDBx TO CONTAINER CDB2 USING PDBx.xml CONNECT AS sys/mypassword@CDB2;

    Depending on what role the CDB I’m connected to has determines if a PDB is migrated to another CDB or is failed over to another CDB. So if e.g. a PDB has a failure on the primary site then I can “failover” its standby-equivalent to another Primary CDB on the standby machine and hence make the “standby PDB” a “primary PDB”. This works best when having 2 CDBs in 2 sites and replicate in opposite directions: CDB1 at site A replicates to CDB1 at site B. CDB2 at site B replicates to CDB2 at site A. So let’s assume PDBx in Primary DB CDB1 fails at site A. You can then migrate PDBx at site B to Primary DB CDB2. PDBx at site A in CDB1 will be dropped automatically, but the CDB2 at site A needs to be manually updated with the new data files of PDBx.

 

  • Use In-Memory Column Store on an Active Data Guard DB: As mentioned in my previous BLOG, in 12cR2 In-Memory can be used on an Active Data Guard Instance.
    Restrictions for In-Memory on Active Data Guard:
    In-Memory expressions are captured based on queries executed on the primary only. I.e. the expression statistics store (ESS) is maintained on the primary only.
    Automatic Data Optimization (ADO) policies are triggered only on access recorded on the primary database.
    In-Memory Fast-Start and In-Memory Join-Groups are not supported in an Active Data Guard

 

  • Diagnostics and Tuning for Active Data Guard: The Diagnostics Pack (AWR), the Tuning Pack features and SQL Plan Analyzer are supported in the new release on Active Data Guard.
    AWR: In an AWR catalog database the Active Data Guard DB is registered. From there remote snapshots can be taken from the Active Data Guard instance and stored in the AWR catalog: dbms_workload_repository.create_remote_snapshot("TYPICAL", ADG-id);
    SQL Tuning Advisor: All SQL Tuning Advisor Tasks are executed on the Active Data Guard instance. Necessary write activity are done through a DB-Link on the primary DB.

 

  • Repair blocks from NOLOGGING-operations: Blocks from NOLOGGING operations on primary can now be validated and repaired on Standby with rman commands:

    validate/recover ... nonlogged blocks;

    I.e. the primary DB does not necessarily need to be in FORCE LOGGING mode anymore. If NOLOGGING operations are necessary then they can be repaired on the Standby-DB. Previously complete datafiles had to be restored to repair NOLOGGING operations.

On Thursday I visited the panel discussion with the subject “Thinking clearly about Database Application Architecture”. Toon Koppelaars, Connor Mcdonald, Cary Milsap and Gerald Venzl discussed about the correct Application architecture when accessing data in an Oracle Database. The discussion was moderated by Bryn Llewellyn. Toon Koppelaars from the Real World Performance team at Oracle explained why the ThickDB approach by writing business logic (which need data processing) in PLSQL through set or bulk processing is the best method to have a well performing application (see also here). However, today the approach to process the data in layers outside the DB is being preferred (“data to processing” instead of “processing to data”). Unfortunately that results in row by row processing with lots of network roundtrips and higher CPU-usage on the DB-server due to the many times the whole stack on the DB-server has to be traversed.
It was clear and agreed in the audience that the ThickDB approach (“processing to data”) is correct, but why do developers not change their behavior since many years? The opinions on that differed, but also critical statements were expressed that “we as DBAs and DB-Consultants are part of the problem”, because there is no effort to change something in the base education of students to better understand the inner workings of a relational database system and the importance of “processing at the data”.

I’ll leave it to the reader to think about that and end my BLOGs about the Oracle Openworld 2016.

 

Cet article Oracle Open World 2016 – Day 4 and 5 est apparu en premier sur Blog dbi services.

Hadoop on IaaS - part 2

Pat Shuff - Fri, 2016-09-23 02:07
Today we are going to get our hands dirty and install a single instance standalong Hadoop Cluster on the Oracle Compute Cloud. This is a continuing series of installing public domain software on Oracle Cloud IaaS. We are going to base our installation on three components We are using Oracle Linux 6.7 because it is the easiest to install on Oracle Compute Cloud Services. We could have done Ubuntu or SUSE or Fedora and followed some of the tutorials from HortonWorks or Cloudera or Apache Single Node Cluster. Instead we are going old school and installing from the Hadoop home page by downloading a tar ball and configuring the operating system to run a single node cluster.

Step 1:

Install Oracle Linux 6.7 on an Oracle Compute Cloud instance. Note that you can do the same thing by installing on your favorite virtualization engine like VirtualBox, VMWare, HyperV, or any other cloud vendor. The only true dependency is the operating system beyond this point. If you are installing on the Oracle Cloud, go with the OL_67_3GB..... option, go with the smallest instance, delete the boot disk, replace it with a 60 GB disk, rename it and launch. The key reason that we need to delete the boot disk is that by default the 3 GB disk will not take the Hadoop binary. We need to grow it to at least 40 GB. We pad a little bit with a 60 GB disk. If you check the new disk as a boot disk it replaces the default Root disk and allows you to create an instance with a 60 GB disk.

Step 2:

Run yum to update the os, install w get, and java version 1.8. You need to login as opc to the instance so that you can run as root.

Note that we are going to diverge from the Hadoop for Dummies that we referenced yesterday. They suggest attaching to a yum repository and doing an install from the repository for the bigtop package. We don't have that option for Oracle Linux and need to do the install from the binaries by downloading a tar or src image. The bigtop package basically takes the Apache Hadoop bundle and translates them to rpm files for an operating system. Oracle does not provide this as part of the yum repository and Apache does not create one for Oracle Linux or RedHat. We are going to download the tar file from the links provided at Apache Hadoop homepage we are following install instructions for a single node cluster.

Step 3:

Get the tar.gz file by pulling it from http://apache.osuosl.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Step 4: We unpack the tar.gz file with the tar xvzf hadoop-2.7.2.tar.gz command

Step 5:

Next we add the following to the .bashrc file in the home directory to setup some environment variables. The java code is done in the same location by the yum command. The location of the hadoop code is based on downloading into the opc home directory.

export JAVA_HOME=/usr
export HADOOP_HOME=/home/opc/hadoop-2.7.3
export HADOOP_CONFIG_DIR=/home/opc/hadoop-2.7.3/etc/hadoop
export HADOOP_MAPRED_HOME=/home/opc/hadoop-2.7.3
export HADOOP_COMMON_HOME=/home/opc/hadoop-2.7.3
export HADOOP_HDFS_HOME=/home/opc/hadoop-2.7.3
export YARN_HOME=/home/opc/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin

Step 6

Source the .bashrc to pull in these environment variables

Step 7 Edit the /etc/hosts file to add namenode to the file.

Step 8

Setup ssh so that we can loop back to localhost and launch an agent. I had to edit the authorized_keys to add a return before the new entry. If you don't the ssh won't work.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
vi ~/.ssh/authorized_keys
ssh localhost
exit

Step 9 Test the configuration then configure the hadoop file system for single node.

cd $HADOOP_HOME
mkdir input
cp etc/hadoop/*.xml input
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
vi etc/hadoop/core-site.xml

When we ran this and there were a couple of warnings which we can ignore. The test should finish without error and generate a long output list. We then edit to core-site.xml file by changing the following lines at the end. (omit the spaces, the blog software masked them and the only way to show the full file was to add spaces)

< configuration >
 < property >
  < name >fs.defaultFS< /name >
  < value >hdfs://namenode:8020< /value >
 < /property >
< /configuration >

Step 10

Create the hadoop file system with the command hdfs namenode -format

Step 11

Verify the configuration with the command hdfs getconf -namenodes

Step 12

Start the hadoop file system with the command sbin/start-dfs.sh

At this point we have the hadoop filesystem up and running. We now need to configure MapReduce and test functionality. Step 13

Make the HDFS directories required to execute MapReduce jobs with the commands

  hdfs dfs -mkdir /user
  hdfs dfs -mkdir /user/opc
  hdfs dfs -mkdir input
  hdfs dfs -put etc/hadoop/*.xml input

Step 14 Run a MapReduce example and look at the output

  hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep 
    input output 'dfs[a-z.]+'
  hdfs dfs -get output output
  cat output/* output/output/*

Step 15

Create a test program to do a wordcount of two files. This example comes from an Apache MapReduce Tutorial

hdfs dfs -mkdir wordcount
hdfs dfs -mkdir wordcount/input
mkdir ~/wordcount
mkdir ~/wordcount/input
vi ~/wordcount/input/file01
 - add 
Hello World Bye World
vi ~/wordcount/input/file02
- add
Hello Hadoop Goodbye Hadoop
hdfs dfs -put ~/wordcount/input/* wordcount/input
vi ~/wordcount/WordCount.java

Create WordCount.java with the following code

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Step 16

Compile and run the WordCount.java code

cd ~/wordcount
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.101-3.b13.el6_8.x86_64
export HADOOP_CLASSPATH=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.101-3.b13.el6_8.x86_64/lib/tools.jar
hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
hadoop jar wc.jar WordCount wordcount/input wordcount/output
hadoop fs -cat wordcount/output/part-r-00000

At this point we have a working system and can run more MapReduce jobs, look at results, and play around with Big Data foundations.

In summary, this is a relatively complex example. We have moved beyond a simple install of an Apache web server or Tomcat server and editing some files to get results. We have the foundations for a Big Data analytics solution running on the Oracle Compute Cloud Service. The steps to install are very similar to the other installation tutorials that we referenced earlier on Amazon and Virtual Machines. Oracle Compute is a good foundation for public domain code. Per core the processes are cheaper than other cloud vendors. Networking is non-blocking and higher performance. Storage throughput is faster and optimized for compute high I/O and tied to the compute engine. Hopefully this tutorial has given you the foundation to start playing with Hadoop on Oracle IaaS.

Links for 2016-09-22 [del.icio.us]

Categories: DBA Blogs

Pages

Subscribe to Oracle FAQ aggregator