Feed aggregator

DataStax OpsCenter upgrade (4.1 to 5.1) for Cassandra – issue and resolution

Pythian Group - Tue, 2016-03-29 09:42

For the Apache Cassandra cluster (version C* 1.2.16) that I’ve supported, the monitoring of the cluster is through DataStax OpsCenter, version 4.1.2. As part of the effort to improve the monitoring capability for this cluster, my team decided first to upgrade OpsCenter to version 5.1.4, the latest available version of OpsCenter that is compatible with Cassandra 1.2.16. The same OpsCenter is also used to monitor another cluster of DataStax Enterprise (DSE) 4.5.2 (it corresponds to Apache Cassandra version 2.0.10).

During the upgrade we ran into an issue, and  we couldn’t find a similar problem to this one on Google. We’d like to use this post to document the problems that we faced as well as the solutions and findings we found during the upgrade.

 

Problem Overview

The OpsCenter upgrade procedure is as straightforward as what is described in the DataStax OpsCenter document. After OpsCenter upgrade, the OpsCenter web portal detects mismatched version of datastax-agents on all nodes. Choose the “FixMe” option from the portal to upgrade datastax-agents to version 5.1.4 on all nodes being monitored. After the datastax-agent upgrade, we addressed some datastax-agent configuration issues in “address.yaml” file to reflect the changes brought by the new version of OpsCenter/datastax-agent.

After all this was done, we double checked the log files for OpsCenter and datastax-agent. The OpsCenter log file was mostly clear, the datastax-agent log for DSE 4.5.2 cluster was also clear, but the datastax-agent log for Cassandra 1.2.16 was NOT. The corresponding OpsCenter web portal was not able to display Cassandra metrics for C* 1.2.16 cluster.

On each of the datastax-agent log files in the C* 1.2.16 cluster, we saw a lot of repeating  errors like the ones below:

          ERROR [async-dispatch-3] 2016-02-19 12:57:52,750 There was an error when attempting to load stored rollups.
          com.datastax.driver.core.exceptions.InvalidQueryException: Undefined name key in where clause (‘key EQ ‘<… …>”)
          at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)
          at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:291)
          at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
          at clojurewerkz.cassaforte.client$execute.invoke(client.clj:289)
          … …
          ERROR [cassandra-processor-1] 2016-02-19 13:00:02,295 Error when proccessing cassandra callcom.datastax.driver.core.exceptions.InvalidQueryException: Unknown identifier key

 

Problem Analysis and Solution

The fact that the error showed up in datastax-agent log file gave me a hint that the error might be related with datastax-agent failing to write collected metrics into OpsCenter tables. So as the first step of the analysis, I compared the schema of “OpsCenter” keyspace between the two clusters monitored. Below is the example of two OpsCenter table definition comparison between the two clusters.

C* 1.2.16 Cluster

DSE 4.5.3 Cluster

CREATE TABLE events (  “KEY” blob,  column1 blob,  value blob,  PRIMARY KEY (“KEY”, column1)CREATE TABLE events (  key text,  action bigint,  level bigint,  success boolean,  time bigint,  PRIMARY KEY ((key))CREATE TABLE events_timeline (  “KEY” blob,  column1 bigint,  value blob,  PRIMARY KEY (“KEY”, column1)CREATE TABLE events_timeline (  key text,  column1 bigint,  value blob,  PRIMARY KEY ((key), column1)

 

From this table, we can clearly see that the upgrade process of OpsCenter and datastax-agent to verion 5.1.4 somehow doesn’t migrate OpsCenter schema properly for C* 1.2.16 cluster. The theory for the error is that the upgraded datastax-agent in C* 1.2.16 cluster is trying to query or update Cassandra metrics from OpsCenter tables in a fashion that matches the OpsCenter schema as in the DSE 4.5.2 cluster. But the actual OpsCenter schema in C* 1.2.16 still has the old definition, thus causing the invalid query exception as presented in the log file.

Once the problem is clear, the solution is straightforward. The steps are summarized below:

In C* 1.2.16 cluster,

  1. Take a snapshot for OpsCenter keyspace on all nodes
  2. Stop DataStax agents on all nodes, so they won’t try to write metrics into OpsCenter tables.
  3. Use CQL to drop OpsCenter tables and re-create them, matching the OpsCenter schema for DSE 4.5.3 cluster. Make sure that all table properties are the same.
  4. Once OpsCenter schema is recreated. Start DataStax agents on all nodes in.
  5. Verify the agent log file that the error message is gone.
  6. Restart OpsCenter service.

 

After these steps, we double checked the log files for all datastax-agents, and for OpsCenter, and we can confirm that there were no errors. The OpsCenter web portal was also able to display the Cassandra metrics properly.

Categories: DBA Blogs

The NoSQL- Relational Comparison Series Continues- Relational SQL vs MongoDB Methods

Chris Foot - Tue, 2016-03-29 09:01

When we compare relational and NoSQL Systems, one of the critical analyses we have to perform is data access mechanisms.  As we’ll learn over the next few articles of this series on data access, the SQL language used by relational database management systems is much more feature rich and powerful than its NoSQL counterpart.  This statement isn’t intended to sway readers to relational systems, it is just the author’s evaluation of both systems’ access languages.  

Index Usage

Jonathan Lewis - Tue, 2016-03-29 04:53

There are some questions about Oracle that are like the mythical Hydra – you think you’ve killed it, but for every head you cut off another two grow. The claim that “the optimizer will switch between using an index and doing a tablescan when you access more than X% of the data” re-appeared on the OTN database forum a little while ago – it doesn’t really matter what the specific value of X was – and it’s a statement that needs to be refuted very firmly because it’s more likely to cause problems than it is to help anyone understand what’s going on.

At a very informal level we may have an intuitive feeling that for a “precise” query accessing a “small” amount of data an indexed access path should make sense while for a “big” query accessing a “large” amount of data we might expect to see a tablescan, but any attempt to give a meaning to “small” and “large” that is both general purpose and strictly quantified will be wrong: there are too many variables involved.

Just as a quick demonstration of how badly we can be misled by a simple numeric claim here’s a quick test I created on a newly installed instance of 11.2.0.4, which I happened to set up with a locally defined tablespace using uniform extents of 1MB using the default 8KB blocksize but with manual (freelist) space management:


rem
rem     Script:   index_usage_pct.sql
rem     Dated:    March 2016
rem     Author:   J P Lewis
rem

drop table t1;

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual 
        connect by 
                level <= 1e4
)
select
        cast(rownum as number(8,0))                              id,
        cast(trunc(dbms_random.value(0,1e6)) as number(8,0))     n1,
        lpad(rownum,6,'0')              v1,
        rpad('x',10,'x')                small_vc,
        rpad('x',151,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;
begin dbms_stats.gather_table_stats( ownname => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create index t1_i1 on t1(id);

spool index_usage_pct.lst

select  num_rows, blocks, avg_row_len, round(num_rows/blocks) rows_per_block
from    user_tables
where   table_name = 'T1'
;

set autotrace on explain
select count(v1) from t1 where id between 1 and 245000;
set autotrace off

spool off

I’ve created a table with 1 million rows; the rows are about 180 bytes long (you’ll see the sizes a few lines further down the page), so it’s not an unreasonable model for lots of tables in typical systems – if you want to experiment further you can adjust the rpad() in the padding column; and I’ve created an index on a sequentially  (rownum) generated column. My call to autotrace will produce a truthful execution plan for the query supplied – there’s no risk of unexpected type conversion and no problems from bind variable peeking. As you can easilky infer, my query will access 245,000 rows in the table of 1,000,000 – nearly a quarter of the table. Would you expect to see Oracle use the index ?

Here’s the output from the script on MY brand new database, instance, and tablespace:


  NUM_ROWS     BLOCKS AVG_ROW_LEN ROWS_PER_BLOCK
---------- ---------- ----------- --------------
   1000000      25642         180             39

1 row selected.


 COUNT(N1)
----------
    245000

1 row selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 269862921

--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |    10 |  6843   (1)| 00:01:23 |
|   1 |  SORT AGGREGATE              |       |     1 |    10 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |   245K|  2392K|  6843   (1)| 00:01:23 |
|*  3 |    INDEX RANGE SCAN          | T1_I1 |   245K|       |   553   (1)| 00:00:07 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("ID">=1 AND "ID"<=245000)


There are no tricks involved here, no cunning fiddling with data structures or parameters – this is just a simple, straightforward, test.

Of course the result is probably a little counter-intuitive; 24.5% of the data seems a lot for the optimizer to pick an index. There are many reasons for this, the first being that the data is very well clustered relative to the index – the index’s clustering_factor is the smallest it could be for a B-tree indexing every row in this table.

Another important feature, though, is that I haven’t done anything with the system statistics so the optimizer was using various default values which tell it that a multiblock read will be quite small (eight blocks) and a lot slower than a single block read (26 ms vs. 12 ms). One simple change that many people might have made during or shortly after installation (though it shouldn’t really be done in any modern version of Oracle) is to set the db_file_multiblock_read_count parameter to 128 – with just this change the optimizer would assume that a multiblock read really would be 128 blocks, but that it would now take 266 ms. That means the optimizer will assume that the read will be ten times slower than it was, but will read 32 times as much data – a fairly significant relative improvement thanks to which the access path for my initial query will switch to a full tablescan and won’t switch back to an index range scan until I reduce the range from 245,000 to something like 160,000.

I can go further, of course. With a few background queries running to exercise the database I executed the dbms_stats.gather_system_stats() procedure with the ‘start’ and ‘stop’ options to collect some figures about the hardware and expected workload. This gave me the following results,  interpreted from the sys.aux_stats$ table:


MBRC       :126
MREADTIM   :0.902
SREADTIM   :0.386
CPUSPEED   :976

With the optmizer using these figures to compare the relative speed and size of single and multiblock reads I had to reduce my selected range to roughly 51,000 before the optimizer would choose the index range scan.

I could go on to demonstrate the effects of the dbms_resource_manager.calibrate_io procedure and the effects of allowing different degrees of parallelism with different system stats, but I think I’ve probably made the point that there’s a lot of variation in the break point between index range scans and tablescans EVEN when you don’t change the data. With this very well ordered (perfect clustering_factor) data I’ve seen the break point vary between 51,000 rows and 245,000 rows (5% and 25%).

And finally …

Let’s just finish with a last (and probably the most important) variation:  changing the pattern in the data we want from perfectly clustered to extremely scattered. If you check the query that generated the data you’ll see that we can do this by creating an index on column n1 instead of column id, and changing the where clause in the test query to n1 between 1 and 4500 (which, in my case, returned slightly more that 4,500 rows thanks to a small amount of duplication generated by the call to dbms_random.value()). At slightly under 0.5% of the data (and with my most recent settings for the system statistics) the optimizer chose to use a tablescan.

Remember, there are many factors involved in the optimizer choosing between a tablescan and an index range scan and one of the most significant factors in the choice is the (apparent) clustering of the data so, if you haven’t come across it before, you should examine the “table_cached_blocks” option that appeared in 11.2.0.4 for the procedure dbms_stats.set_table_prefs() as this allows you to give the optimizer a better idea of how well your data really is clustered.

Addendum (April 2016)

Following on from the demonstration of how changes in parameters, patterns and statistics can make a difference in what we (or the optimizer) might consider a “small” amount of data and whether an indexed access path would be appropriate, it’s worth mentioning that the Exadata technologies of smart scans and hybrid columnar compression and Oracle’s latest technology of In-Memory Colum Store do not change the way you think about indexes – they only change the (unspecifiable) volume at which an index ceases to be the better option to use.

 


Deadpool

Tim Hall - Tue, 2016-03-29 04:11

Yesterday I went to a late showing of Deadpool. If you haven’t seen the trailers already, don’t watch them as they give too much away!

I’ve been wanting to see it for a while, but I can’t get my cinema mojo back. The last thing I went to see was the new Star Wars film. I used to really love going to the cinema, but these days I really don’t like it. Even when I enjoy the film, I find the process rather irritating.

The opening credits were funny and totally irreverent. They really set the mood for the whole film.

Probably the hardest thing about introducing a “new character” to the audience (let’s forget the X-Men Origins: Wolverine stuff, since this film has) is you have to go through the backstory, which is typically pretty boring. Deadpool also had to do the boring stuff, but at least it does it in a more interesting way. Switching between backstory and “current time” crazy action is probably the best way to get it done.

The Deadpool character is very different to every other super hero. He breaks the fourth wall (talks to the audience), which is especially interesting when the characters around him seem confused, like they don’t understand who he is talking to. Added to this he self-references, is edgier and funnier than most super heroes. Also, the content is significantly more adult than any of the X-Men films that came before it.

I guess the bit I liked the least was how they gave him his mutant powers. I preferred the idea they used in X-Men Origins: Wolverine. I thought the method they used in this film seemed kind-of lame.

Overall, I’m glad I went to see it. I just wish I hadn’t seen the trailers as they give away a lot of the good stuff, which I think would have been more epic and funny if I hadn’t already seen it.

Honourable mentions go out to the two female mutant characters (Negasonic Teenage Warhead and Angel Dust) who were awesome. I would have liked to see more of them in the film.

Cheers

Tim…

Deadpool was first posted on March 29, 2016 at 10:11 am.
©2012 "The ORACLE-BASE Blog". Use of this feed is for personal non-commercial use only. If you are not reading this article in your feed reader, then the site is guilty of copyright infringement.

My Demo Application

Denes Kubicek - Tue, 2016-03-29 00:34
If you experience a problem accessing My Demo Application (old and new version) at apex.oracle.com, that means there is a reason for that. One of the users deleted all the applications in my workspace and installed some of his own. I don't think this was intended but it is still a lot of work and I am trying to geta all the important stuff back online. After that I will need to rethink the way of accessing the workspace. It happened for the second time within the last four months.

Categories: Development

Another SQL Profile to the rescue!

Bobby Durrett's DBA Blog - Mon, 2016-03-28 18:57

We have had problems with set of databases over the past few weeks. Our team does not support these databases, but my director asked me to help. These are 11.2.0.1 Windows 64 bit Oracle databases running on Windows 2008. The incident reports said that the systems stop working and that the main symptom was that the oracle.exe process uses all the CPU. They were bouncing the database server when they saw this behavior and it took about 30 minutes after the bounce for the CPU to go back down to normal. A Windows server colleague told me that at some point in the past a new version of virus software had apparently caused high CPU from the oracle.exe process.

At first I looked for some known bugs related to high CPU and virus checkers without much success. Then I got the idea of just checking for query performance. After all, a poorly performing query can eat up a lot of CPU. These Windows boxes only have 2 cores so it would not take many concurrently running high CPU queries to max it out. So, I got an AWR report covering the last hour of a recent incident. This was the top SQL:

Top SQL

The top query, sql id 27d8x8p6139y6, stood out as very inefficient and all CPU. It seemed clear to me from this listing that the 2 core box had a heavy load and a lot of waiting for CPU queuing. %IO was zero but %CPU was only 31%. Most likely the rest was CPU queue time.

I also looked at my sqlstat report to see which plans 27d8x8p6139y6 had used over time.

PLAN_HASH_VALUE END_INTERVAL_TIME     EXECUTIONS Elapsed ms
--------------- --------------------- ---------- -----------
     3067874494 07-MAR-16 09.00.50 PM        287  948.102286
     3067874494 07-MAR-16 10.00.03 PM        292  1021.68191
     3067874494 07-MAR-16 11.00.18 PM        244  1214.96161
     3067874494 08-MAR-16 12.00.32 AM        276  1306.16222
     3067874494 08-MAR-16 01.00.45 AM        183  1491.31307
      467860697 08-MAR-16 01.00.45 AM        125      .31948
      467860697 08-MAR-16 02.00.59 AM        285  .234073684
      467860697 08-MAR-16 03.00.12 AM        279  .214354839
      467860697 08-MAR-16 04.00.25 AM        246   .17147561
      467860697 08-MAR-16 05.00.39 AM         18        .192
     2868766721 13-MAR-16 06.00.55 PM         89    159259.9
     3067874494 13-MAR-16 06.00.55 PM          8  854.384125
     2868766721 13-MAR-16 07.00.50 PM         70  1331837.56

Plan 2868766721 seemed terrible but plan 467860697 seemed great.

Our group doesn’t support these databases so I am not going to dig into how the application gathers statistics, what indexes it uses, or how the vendor designed the application. But, it seems possible that forcing the good plan with a SQL Profile could resolve this issue without having any access to the application or understanding of its design.

But, before plunging headlong into the use of a SQL Profile I looked at the plan and the SQL text.  I have edited these to hide any proprietary details:

SELECT T.*
    FROM TAB_MYTABLE1 T,
         TAB_MYTABLELNG A,
         TAB_MYTABLE1 PIR_T,
         TAB_MYTABLELNG PIR_A
   WHERE     A.MYTABLELNG_ID = T.MYTABLELNG_ID
         AND A.ASSIGNED_TO = :B1
         AND A.ACTIVE_FL = 1
         AND T.COMPLETE_FL = 0
         AND T.SHORTED_FL = 0
         AND PIR_T.MYTABLE1_ID = T.PIR_MYTABLE1_ID
         AND ((PIR_A.FLOATING_PIR_FL = 1 
               AND PIR_T.COMPLETE_FL = 1)
              OR PIR_T.QTY_PICKED IS NOT NULL)
         AND PIR_A.MYTABLELNG_ID = PIR_T.MYTABLELNG_ID
         AND PIR_A.ASSIGNED_TO IS NULL
ORDER BY T.MYTABLE1_ID

The key thing I noticed is that there was only one bind variable. The innermost part of the good plan uses an index on the column that the query equates with the bind variable. The rest of the plan is a nice nested loops plan with range and unique index scans. I see plans in this format in OLTP queries where you are looking up small numbers of rows using an index and join to related tables.

-----------------------------------------------------------------
Id | Operation                        | Name                     
-----------------------------------------------------------------
 0 | SELECT STATEMENT                 |                          
 1 |  SORT ORDER BY                   |                          
 2 |   NESTED LOOPS                   |                          
 3 |    NESTED LOOPS                  |                          
 4 |     NESTED LOOPS                 |                          
 5 |      NESTED LOOPS                |                          
 6 |       TABLE ACCESS BY INDEX ROWID| TAB_MYTABLELNG           
 7 |        INDEX RANGE SCAN          | AK_MYTABLELNG_BY_USER    
 8 |       TABLE ACCESS BY INDEX ROWID| TAB_MYTABLE1             
 9 |        INDEX RANGE SCAN          | AK_MYTABLE1_BY_MYTABLELNG
10 |      TABLE ACCESS BY INDEX ROWID | TAB_MYTABLE1             
11 |       INDEX UNIQUE SCAN          | PK_MYTABLE1              
12 |     INDEX UNIQUE SCAN            | PK_MYTABLELNG            
13 |    TABLE ACCESS BY INDEX ROWID   | TAB_MYTABLELNG           
-----------------------------------------------------------------

The bad plan had a gross Cartesian merge join:

Plan hash value: 2868766721

----------------------------------------------------------------
Id | Operation                       | Name                     
----------------------------------------------------------------
 0 | SELECT STATEMENT                |                          
 1 |  NESTED LOOPS                   |                          
 2 |   NESTED LOOPS                  |                          
 3 |    MERGE JOIN CARTESIAN         |                          
 4 |     TABLE ACCESS BY INDEX ROWID | TAB_MYTABLE1             
 5 |      INDEX FULL SCAN            | PK_MYTABLE1              
 6 |     BUFFER SORT                 |                          
 7 |      TABLE ACCESS BY INDEX ROWID| TAB_MYTABLELNG           
 8 |       INDEX RANGE SCAN          | AK_MYTABLELNG_BY_USER    
 9 |    TABLE ACCESS BY INDEX ROWID  | TAB_MYTABLE1             
10 |     INDEX RANGE SCAN            | AK_MYTABLE1_BY_MYTABLELNG
11 |   TABLE ACCESS BY INDEX ROWID   | TAB_MYTABLELNG           
12 |    INDEX RANGE SCAN             | AK_MYTABLELNG_BY_USER    
----------------------------------------------------------------

Reviewing the SQL made me believe that there was a good chance that a SQL Profile forcing the good plan would resolve the issue. Sure, there could be some weird combination of data and bind variable values that make the bad plan the better one. But, given that this was a simple transactional application it seems most likely that the straightforward nested loops with index on the only bind variable plan would be best.

We used the SQL Profile to force these plans on four servers and so far the SQL Profile has resolved the issues. I’m not saying that forcing a plan using a SQL Profile is the only or even best way to resolve query performance issues. But, this was a good example of where a SQL Profile makes sense. If modifying the application, statistics, parameters, and schema is not possible then a SQL Profile can come to your rescue in a heartbeat.

Bobby

Categories: DBA Blogs

printing from apex using pl/sql

Pat Shuff - Mon, 2016-03-28 15:55
As part of our exercise to convert an excel spreadsheet to APEX, we ran into some simple logic that required more logic and less database select statements. The question comes up how to do an if - then - else statement in APEX. In this example we are trying to advise between using compute on demand or dedicated compute resources. With dedicated compute resources, we can purchase 50 OCPUs as an aggregate. With dedicated we can go as low as 1 OCPU. If we request 14 systems with 4 OCPUs it might be better to request 50 OCPUs.

A typical question form would look like the following image allowing us to ask processor shape as well as quantity. If the total mount exceeds, 50 processors, we output a message suggesting dedicated compute rather than compute on demand.

To get this message on the screen, we first had to pull in the questions that we ask using variables. In this example, we read in the UNMETERED_COMPUTE_SHAPE which is a pick list that allows you to select (1, 2, 4, 8, or 16) OCPU shapes. You can also type in a quantity number into UNMETERED_COMPUTE_QUANTITY. The product of these two values allows us to suggest dedicated or compute on demand for economic reasons.

To execute pl/sql commands, we have to change the content type. To create this area we first create a sub-region. We change the name of the sub-region to represent the question that we are trying to answer. For this example we use the title "Compute on Demand vs Dedicated Compute" as the sub-region header. We then change the type to "PL/SQL Dynamic Content". Under this we can then enter our dynamic code. The sub-region looks like

If you click on the expand button it opens a full editor allowing you to edit the code. In our example we are going to read the variables :UNMETERED_COMPUTE_SHAPE and :UNMETERED_COMPUTE_QUANTITY. Notice the colon in front of these names. This is how we treat the values as variables read from APEX. The code is very simple. It starts with a begin statement followed by an if statement. The if statements looks to see if we are allocating more than 50 processors. We then output a statement suggesting dedicated or compute on demand using the htp.p function call. This call prints what is passed to it to the screen. The code should look like .

Overall, this is a simple way of outputting code that requires control flow. In the previous example we used a select statement to output calculations. In this example we are outputting different sections and different recommendations based on our selections. We could also set variables that would expose or hide different sub-regions below this section. This is done by setting :OUTPUT_VARIABLE = desired_value. If we set the value inside the pl/sql code loop, we can hide or expose sections as we did in a previous blog by setting a value from a pull down menu.

The code used to output the recommendation is as follows

BEGIN
if (:UNMETERED_COMPUTE_SHAPE * :UNMETERED_COMPUTE_QUANTITY > 50) THEN
    htp.p('You might consider dedicated compute since you have ' 
     || :UNMETERED_COMPUTE_SHAPE * :UNMETERED_COMPUTE_QUANTITY 
     || ' OCPUs which is greater than the smallest dedicated compute of 50 OCPUs');
else
    htp.p('Compute on Demand for a total of ' 
     || :UNMETERED_COMPUTE_SHAPE * :UNMETERED_COMPUTE_QUANTITY || ' OCPUs');
end if;
END;

JRE 1.8.0.77 Certified with Oracle EBS 12.x

Steven Chan - Mon, 2016-03-28 13:46

Java logo

Java Runtime Environment 1.8.0_77 (a.k.a. JRE 8u77-b3) and later updates on the JRE 8 codeline are now certified with Oracle E-Business Suite 12.1 and 12.2 for Windows desktop clients.

This certification addresses Oracle Security Alert for CVE-2016-0636.

All JRE 6, 7, and 8 releases are certified with EBS upon release

Our standard policy is that all E-Business Suite customers can apply all JRE updates to end-user desktops:

  • From JRE 1.6.0_03 and later updates on the JRE 6 codeline
  • From JRE 1.7.0_10 and later updates on the JRE 7 codeline 
  • From JRE 1.8.0_25 and later updates on the JRE 8 codeline
We test all new JRE releases in parallel with the JRE development process, so all new JRE releases are considered certified with the E-Business Suite on the same day that they're released by our Java team. 

You do not need to wait for a certification announcement before applying new JRE 6, 7, or 8 releases to your EBS users' desktops.

32-bit and 64-bit versions certified

This certification includes both the 32-bit and 64-bit JRE versions for various Windows operating systems. See the respective Recommended Browser documentation for your EBS release for details.

Where are the official patch requirements documented?

All patches required for ensuring full compatibility of the E-Business Suite with JRE 8 are documented in these Notes:

For EBS 12.1 & 12.2

EBS + Discoverer 11g Users

This JRE release is certified for Discoverer 11g in E-Business Suite environments with the following minimum requirements:

Implications of Java 6 and 7 End of Public Updates for EBS Users

The Oracle Java SE Support Roadmap and Oracle Lifetime Support Policy for Oracle Fusion Middleware documents explain the dates and policies governing Oracle's Java Support.  The client-side Java technology (Java Runtime Environment / JRE) is now referred to as Java SE Deployment Technology in these documents.

Starting with Java 7, Extended Support is not available for Java SE Deployment Technology.  It is more important than ever for you to stay current with new JRE versions.

If you are currently running JRE 6 on your EBS desktops:

  • You can continue to do so until the end of Java SE 6 Deployment Technology Extended Support in June 2017
  • You can obtain JRE 6 updates from My Oracle Support.  See:

If you are currently running JRE 7 on your EBS desktops:

  • You can continue to do so until the end of Java SE 7 Deployment Technology Premier Support in July 2016
  • You can obtain JRE 7 updates from My Oracle Support.  See:

If you are currently running JRE 8 on your EBS desktops:

Will EBS users be forced to upgrade to JRE 8 for Windows desktop clients?

No.

This upgrade is highly recommended but remains optional while Java 6 and 7 are covered by Extended Support. Updates will be delivered via My Oracle Support, where you can continue to receive critical bug fixes and security fixes as well as general maintenance for JRE 6 and 7 desktop clients. Note that there are different impacts of enabling JRE Auto-Update depending on your current JRE release installed, despite the availability of ongoing support for JRE 6 and 7 for EBS customers; see the next section below.

Impact of enabling JRE Auto-Update

Java Auto-Update is a feature that keeps desktops up-to-date with the latest Java release.  The Java Auto-Update feature connects to java.com at a scheduled time and checks to see if there is an update available.

Enabling the JRE Auto-Update feature on desktops with JRE 6 installed will have no effect.

With the release of the January Critical patch Updates, the Java Auto-Update Mechanism will automatically update JRE 7 plug-ins to JRE 8.

Enabling the JRE Auto-Update feature on desktops with JRE 8 installed will apply JRE 8 updates.

Coexistence of multiple JRE releases Windows desktops

The upgrade to JRE 8 is recommended for EBS users, but some users may need to run older versions of JRE 6 or 7 on their Windows desktops for reasons unrelated to the E-Business Suite.

Most EBS configurations with IE and Firefox use non-static versioning by default. JRE 8 will be invoked instead of earlier JRE releases if both are installed on a Windows desktop. For more details, see "Appendix B: Static vs. Non-static Versioning and Set Up Options" in Notes 290807.1 and 393931.1.

What do Mac users need?

JRE 8 is certified for Mac OS X 10.8 (Mountain Lion), 10.9 (Mavericks), and 10.10 (Yosemite) desktops.  For details, see:

Will EBS users be forced to upgrade to JDK 8 for EBS application tier servers?

No.

JRE is used for desktop clients.  JDK is used for application tier servers.

JRE 8 desktop clients can connect to EBS environments running JDK 6 or 7.

JDK 8 is not certified with the E-Business Suite.  EBS customers should continue to run EBS servers on JDK 6 or 7.

Known Iusses

Internet Explorer Performance Issue

Launching JRE 1.8.0_73 through Internet Explorer will have a delay of around 20 seconds before the applet starts to load (Java Console will come up if enabled).

This issue fixed in JRE 1.8.0_74.  Internet Explorer users are recommended to uptake this version of JRE 8.

Form Focus Issue

Clicking outside the frame during forms launch may cause a loss of focus when running with JRE 8 and can occur in all Oracle E-Business Suite releases. To fix this issue, apply the following patch:

References

Related Articles
Categories: APPS Blogs

JRE 1.7.0_99 Certified with Oracle E-Business Suite 12.x

Steven Chan - Mon, 2016-03-28 13:40

Java logo

Java Runtime Environment 1.7.0_99 (a.k.a. JRE 7u99-b4) and later updates on the JRE 7 codeline are now certified with Oracle E-Business Suite Release 12.x for Windows-based desktop clients.

This certification addresses Oracle Security Alert for CVE-2016-0636.

All JRE 6, 7, and 8 releases are certified with EBS upon release

Our standard policy is that all E-Business Suite customers can apply all JRE updates to end-user desktops:

  • From JRE 1.6.0_03 and later updates on the JRE 6 codeline
  • From JRE 1.7.0_10 and later updates on the JRE 7 codeline 
  • From JRE 1.8.0_25 and later updates on the JRE 8 codeline
We test all new JRE releases in parallel with the JRE development process, so all new JRE releases are considered certified with the E-Business Suite on the same day that they're released by our Java team. 

You do not need to wait for a certification announcement before applying new JRE 6, 7, or 8 releases to your EBS users' desktops.

Effects of new support dates on Java upgrades for EBS environments

Support dates for the E-Business Suite and Java have changed.  Please review the sections below for more details:

  • What does this mean for Oracle E-Business Suite users?
  • Will EBS users be forced to upgrade to JRE 7 for Windows desktop clients?
  • Will EBS users be forced to upgrade to JDK 7 for EBS application tier servers?

32-bit and 64-bit versions certified

This certification includes both the 32-bit and 64-bit JRE versions for various Windows operating systems. See the respective Recommended Browser documentation for your EBS release for details.

Where are the official patch requirements documented?

EBS + Discoverer 11g Users

This JRE release is certified for Discoverer 11g in E-Business Suite environments with the following minimum requirements:

JRE 7 End of Public Updates

The JRE 7u79 release was the last JRE 7 update available to the general public.  Java is an integral part of the Oracle E-Business Suite technology stack, so EBS users will continue to receive Java SE 7 updates to the end of Java SE 7 Premier Support to the end of July 2016.

How can EBS customers obtain Java 7 updates after the public end-of-life?

EBS customers can download Java 7 patches from My Oracle Support.  For a complete list of all Java SE patch numbers, see:

Both JDK and JRE packages are now contained in a single combined download.  Download the "JDK" package for both the desktop client JRE and the server-side JDK package. 

Coexistence of multiple JRE releases Windows desktops

The upgrade to JRE 8 is recommended for EBS users, but some users may need to run older versions of JRE 6 or 7 on their Windows desktops for reasons unrelated to the E-Business Suite.

Most EBS configurations with IE and Firefox use non-static versioning by default. JRE 8 will be invoked instead of earlier JRE releases if both are installed on a Windows desktop. For more details, see "Appendix B: Static vs. Non-static Versioning and Set Up Options" in Notes 290807.1 and 393931.1.

Java Auto-Update Mechanism

With the release of the January 2015 Critical patch Updates, the Java Auto-Update Mechanism will automatically update JRE 7 plug-ins to JRE 8.

Coexistence of multiple JRE releases Windows desktops

The upgrade to JRE 8 is recommended for EBS users, but some users may need to run older versions of JRE 6 or 7 on their Windows desktops for reasons unrelated to the E-Business Suite.

Most EBS configurations with IE and Firefox use non-static versioning by default. JRE 8 will be invoked instead of earlier JRE releases if both are installed on a Windows desktop. For more details, see "Appendix B: Static vs. Non-static Versioning and Set Up Options" in Notes 290807.1 and 393931.1.

What do Mac users need?

Mac users running Mac OS X 10.7 (Lion), 10.8 (Mountain Lion), 10.9 (Mavericks), and 10.10 (Yosemite) can run JRE 7 or 8 plug-ins.  See:

Will EBS users be forced to upgrade to JDK 7 for EBS application tier servers?

JRE is used for desktop clients.  JDK is used for application tier servers

JDK upgrades for E-Business Suite application tier servers are highly recommended but currently remain optional while Java 6 is covered by Extended Support. Updates will be delivered via My Oracle Support, where you can continue to receive critical bug fixes and security fixes as well as general maintenance for JDK 6 for application tier servers. 

Java SE 6 is covered by Extended Support until June 2017.  All EBS customers with application tier servers on Windows, Solaris, and Linux must upgrade to JDK 7 by June 2017. EBS customers running their application tier servers on other operating systems should check with their respective vendors for the support dates for those platforms.

JDK 7 is certified with E-Business Suite 12.  See:

Known Issues

When using Internet Explorer, JRE 1.7.0_01 had a delay of around 20 seconds before the applet started to load. This issue is fixed in JRE 1.7.0_95.

References

Related Articles
Categories: APPS Blogs

Adrift in a Data Lake – an Oracle Developer’s first steps in Hadoop

The Anti-Kyte - Mon, 2016-03-28 11:49

We live in interesting times. As I write, the political life of a great nation is currently in thrall to a wealthy right-wing populist with a rather distinctive hairstyle.
But enough about Boris Johnson.

For someone used to the way things are done in Oracle, Hadoop can be something of a culture shock.
My initial introduction to Hadoop and it’s “vibrant ecosystem” was some internally written documentation.
As with most technical documentation, it was written with the assumption of a certain level of prior knowledge on the part of the reader. For this particular reader, it proved to be an erroneous assumption.

After a half an hour of struggling through this text , I was left wondering what you would use a Khafka Banana Oozie for.

Maybe it’s for killing zombie processes ? Let’s face it, with all that Java running on your system there are bound to be a few knocking around.
I’m a little hazy on my Zombie Lore, so I’m not entirely clear as to why a rapid application of fresh fruit would pose an existential threat to your average zombie. Maybe they’re potassium intolerant ?

There are a bewildering array of tools associated with Hadoop, many of which appear to provide almost identical functionality.
For example, a relational database traditionally requires only one SQL Engine…and I’ll be taking a look at two of them.

Moving from Oracle to Hadoop can feel rather like trading your car in for a box of Lego.
If the box does contain any instructions they seem to have been translated from Java into English…by someone who doesn’t speak either.
Note to reader : please substitute your own language as appropriate.

Fortunately, there are Hadoop distributions available which bundle the core tools required to get up an running. We’ll gloss over the fact that these distributions – Cloudera and Hortonworks – don’t contain the exact same tools.

In my search to find a coherent explanation of how to use Hadoop, I’ve found that the VM provided by Cloudera, together with the introductory tutorial, is a valuable aid to familiarising myself with the basics.

To start with then, I’ll cover getting hold of the Cloudera VM and setting it up in VirtualBox.
Then I’ll go through some of the tools available and what they do.
I’ll do this from the perspective of an Oracle developer (mainly because I don’t have much option) and will point out the driftwood of familiarity that you might be able to cling to in order to stay afloat in your Data Lake.

What I’ll cover is :

  • The core components of Hadoop
  • HDFS commands
  • Transferring data between a relational database and hadoop using SQOOP
  • Querying structured data using Hive and Impala
  • Uploading data using Kite

Ready to Dive in ?

Cloudera VM setup for Virtualbox

Before we head off to get the VM, there are a couple of points worth mentioning.

First of all, the VM is running on a 64-bit version of Centos 6.4. Therefore you need to make sure that your host operating system ( i.e. the one that you’ll be running Virtual Box on) is also 64-bit.

Secondly, the minimum recommended memory allocation for the VM is 4GB.

Finally, I’m using the Cloudera Quickstart 5.5 VM. The behaviour of some of the tools covered here differs in 5.4 and 5.5
The Cloudera VM can be downloaded from here .

You’ll need to fill in some details about yourself before moving on.

Once you’re at the downloads page, select VirtualBox from the Platform drop-down list.
VMs for other platforms are also available.

Once you hit the “Download Now” button you’ll be asked for some further details. However, these do not appear to be validated.

The download is quite chunky, just over 5GB.

Once it’s completed you should have a zip file called :

cloudera-quickstart-vm-5.5.0-0-virtualbox.zip

You can use a standard unzip utility to extract this file ( e.g. Winzip on Windows, plain old zip on Linux).
Once extracted, you’ll see a directory called cloudera-quickstart-vm-5.5.0-0-virtualbox which contains two files :

cloudera-quickstart-vm-5.5.0-0-virtualbox-disk1.vmdk
cloudera-quickstart-vm-5.5.0-0-virtualbox.ovf

The .ovf file is the VirtualBox image and the .vmdk is the VM’s hard-drive image.

To set up the VM in Virtualbox, you’ll need to import the .ovf file as an Appliance.
The steps to do this are the same as those for importing an Oracle Developer Day 12c Image.

When you first start the VM it does take a while to come up. When it does, you should see

cloudera_vm_start

Tweaking the VM settings

If you don’t happen to live in Boston and/or you don’t have a US keyboard, you may want to make a couple of configuration changes to the VM.

To change the Location ( and consequently the Time Zone) :

  1. Click on the Date and Time on the Status Bar. It’s in the top right-hand corner of the screen.
  2. Hit the Edit button next to Locations
  3. Add your location by typing it in the Location Name dialog box (a drop-down list will appear as you start typing).
  4. Now highlight Boston and click the Remove button. Your new location should now show at the bottom of the Time window
  5. Move your mouse over your location and an Edit button should appear. Click this and the Date and Time should now reflect your new location

set_location

You will be prompted for a password to confirm this change. You need to enter cloudera

To change the keyboard layout :

  1. Go to the System Menu and select Preferences and then Keyboard
  2. Navigate to the Layouts tab and click Add
  3. Select your preferred layout from the drop-down list
  4. Once your new layout is shown, click the Default radio button next to it

set_keyboard

These changes should take effect without the need to re-start the VM.

OK, so let’s get started….

The Core components of Hadoop

Hadoop itself comprises three main components :

  • HDFS – a distributed file system
  • The MapReduce framework
  • Yarn – a Job scheduler / resource manager for the parallel execution of MapReduce jobs
MapReduce

The MapReduce framework consists of a Mapper and a Reducer.
In SQL terms, the Mapper program applies a query predicate against the data you are querying – essentially, it does the filtering.
The Reducer then does any aggregation on the result set produced from the Mapper process.

Yarn

Yarn is the default job scheduler and resource manager. It facilitates parallel execution of MapReduce jobs.

HDFS

HDFS – Hadoop File System – is a distributed file system. The idea is that datafiles are replicated across multiple nodes (physical servers) in a cluster. Essentially, any program can run on any node. By replicating the data to each node, network latency is minimised for these programs.
It comes with it’s own set of commands which you can use interactively.
These appear to be largely a subset of those you’d find on a Linux OS.

The format of these commands is :

hadoop fs -command [some arguments]

However, the way that the VM is configured, we need to run these commands as the hdfs user.
Therefore, in the examples that follow the commands will follow the format..

sudo -u hdfs hadoop fs -command [some arguments]

To start with, let’s see what we’d need to do to upload a csv file to HDFS.

The file in question is called tools.csv, which I’ve created in the cloudera user’s home directory on the vm.
It contains the following :

tools.csv

To start with, let’s see what’s currently in hdfs :

sudo -u hdfs hadoop fs -ls /

Found 5 items
drwxrwxrwx   - hdfs  supergroup          0 2015-11-18 10:57 /benchmarks
drwxr-xr-x   - hbase supergroup          0 2016-03-23 12:08 /hbase
drwxrwxrwt   - hdfs  supergroup          0 2016-03-22 12:23 /tmp
drwxr-xr-x   - hdfs  supergroup          0 2015-11-18 11:01 /user
drwxr-xr-x   - hdfs  supergroup          0 2015-11-18 11:00 /var

This shows us the top-level directories. Let’s take a look at what’s in /user :

sudo -u hdfs hadoop fs -ls /user

Found 8 items
drwxr-xr-x   - cloudera cloudera            0 2015-11-18 10:56 /user/cloudera
drwxr-xr-x   - mapred   hadoop              0 2015-11-18 10:57 /user/history
drwxrwxrwx   - hive     supergroup          0 2015-11-18 11:01 /user/hive
drwxrwxrwx   - hue      supergroup          0 2015-11-18 10:58 /user/hue
drwxrwxrwx   - jenkins  supergroup          0 2015-11-18 10:58 /user/jenkins
drwxrwxrwx   - oozie    supergroup          0 2015-11-18 10:59 /user/oozie
drwxrwxrwx   - root     supergroup          0 2015-11-18 10:58 /user/root
drwxr-xr-x   - hdfs     supergroup          0 2015-11-18 11:01 /user/spark

For the purposes of this test, I’ll create a directory under the /user/cloudera directory, and then check that it’s been created as expected :

sudo -u hdfs hadoop fs -mkdir /user/cloudera/test
sudo -u hdfs hadoop fs -ls /user/cloudera
Found 1 items
drwxr-xr-x   - hdfs cloudera          0 2016-03-23 14:25 /user/cloudera/test

Notice that only the directory owner has write permissions on the directory.
As I’m feeling reckless, I want to grant write permissions to everyone.
This can be done as follows :

sudo -u hdfs hadoop fs -chmod a+w /user/cloudera/test
sudo -u hdfs hadoop fs -ls /user/cloudera
Found 1 items
drwxrwxrwx   - hdfs cloudera          0 2016-03-23 14:25 /user/cloudera/test

In HDFS, the chmod command seems to accept the same arguments as it’s Linux counterpart.

To check that we can now see the directory :

sudo -u hdfs hadoop fs -ls /home/cloudera/test

The simplest way to load our csv is to use the put command :

sudo -u hdfs hadoop fs -put tools.csv /user/cloudera/test/put_tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 1 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 14:49 /user/cloudera/test/put_tools.csv

There is another way to do this :

sudo -u hdfs hadoop fs -copyFromLocal tools.csv /user/cloudera/test/tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 2 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 14:49 /user/cloudera/test/put_tools.csv
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 17:04 /user/cloudera/test/tools.csv

If we want to delete a file then :

sudo -u hdfs hadoop fs -rm /user/cloudera/test/put_tools.csv
16/03/23 17:06:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/cloudera/test/put_tools.csv
sudo -u hdfs hadoop fs -ls /user/cloudera/test
Found 1 items
-rw-r--r--   1 hdfs cloudera        301 2016-03-23 17:04 /user/cloudera/test/tools.csv

You can display the contents of a file in hdfs by using the cat command :

sudo -u hdfs hadoop fs -cat /user/cloudera/test/tools.csv
tool,description
hue,Web-based UI for Hadoop
sqoop,Transfer structured data between an RDBMS and Hadoop
flume,stream a file into Hadoop
impala,a query engine
hive,another query engine
spark,a query engine that is not hive or impala
khafka,a scheduler
banana,a web UI framework
oozie,another scheduler

In order to demonstrate copying a file from hdfs to the local filesystem in the VM, we’ll need to create a directory that the hdfs user has access to :

mkdir test
chmod a+rw test
cd test
ls -ld
drwxrwxrwx 2 cloudera cloudera 4096 Mar 23 17:13 .

Now, as the hdfs user, we can retrieve our file from hdfs onto the local file system using copyFromLocal :

sudo -u hdfs hadoop fs -copyToLocal /user/cloudera/test/tools.csv /home/cloudera/test/welcome_back.csv
ls -l welcome_back.csv
-rw-r--r-- 1 hdfs hdfs 301 Mar 23 17:18 welcome_back.csv

Another method of doing this is using get :

sudo -u hdfs hadoop fs -get /user/cloudera/test/tools.csv /home/cloudera/test/got_it.csv
ls -l got_it.csv
-rw-r--r-- 1 hdfs hdfs 301 Mar 23 17:21 got_it.csv

One final hdfs command that may come in useful is du, which shows the amount of space used by a directory or file on hdfs :

sudo -u hdfs hadoop fs -du /user/cloudera
301  301  /user/cloudera/test
SQOOP

The VM comes with a MySQL database from which data is loaded into Hadoop via SQOOP.
SQOOP is a tool for transferring structured data between an RDBMS and Hadoop.

The documentation does say that SQOOP is capable of loading data into Oracle using the command (from the Local File System) :

sqoop import --connect jdbc:oracle:thin:@//db_name --table table_name

However, said documentation says that it’s been tested with Oracle 10.2 Express Edition, so you may want to have a play around with it before using it in anger.

The tutorial directs us to use SQOOP to ingest all of the data from the MySQL database by running the following command :

sqoop import-all-tables \
    -m 1 \
    --connect jdbc:mysql://quickstart:3306/retail_db \
    --username=retail_dba \
    --password=cloudera \
    --compression-codec=snappy \
    --as-parquetfile \
    --warehouse-dir=/user/hive/warehouse \
    --hive-import

There’s a fair amount going on here, we’re connecting to MySQL, then outputting the data as a compressed file onto hdfs in the /user/hive/warehouse directory.
The compression library being used is Snappy.

It’s instructive to see the output when we run this command as it shows both MapReduce and Yarn in action. You’ll probably see lines like :

...
16/03/23 17:36:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1458734644938_0001
16/03/23 17:36:08 INFO impl.YarnClientImpl: Submitted application application_1458734644938_0001
16/03/23 17:36:08 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1458734644938_0001/
16/03/23 17:36:08 INFO mapreduce.Job: Running job: job_1458734644938_0001
16/03/23 17:37:28 INFO mapreduce.Job: Job job_1458734644938_0001 running in uber mode : false
16/03/23 17:37:28 INFO mapreduce.Job:  map 0% reduce 0%
16/03/23 17:38:29 INFO mapreduce.Job:  map 100% reduce 0%
16/03/23 17:38:34 INFO mapreduce.Job: Job job_1458734644938_0001 completed successfully
...

After a fair amount of time, the command should end with :

...
16/03/23 17:53:53 INFO mapreduce.ImportJobBase: Transferred 46.1318 KB in 157.9222 seconds (299.1283 bytes/sec)
16/03/23 17:53:53 INFO mapreduce.ImportJobBase: Retrieved 1345 records.

If we now check, we can see that a directory has been created for each table :

hadoop fs -ls /user/hive/warehouse
Found 6 items
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:38 /user/hive/warehouse/categories
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:41 /user/hive/warehouse/customers
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:44 /user/hive/warehouse/departments
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:48 /user/hive/warehouse/order_items
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:51 /user/hive/warehouse/orders
drwxrwxrwx   - cloudera supergroup          0 2016-03-23 17:53 /user/hive/warehouse/products

Digging further into the catogories directory and it’s children, we find that the table structure has also been transferred across :

hadoop fs -cat /user/hive/warehouse/categories/.metadata/schemas/1.avsc
{
  "type" : "record",
  "name" : "categories",
  "doc" : "Sqoop import of categories",
  "fields" : [ {
    "name" : "category_id",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "category_id",
    "sqlType" : "4"
  }, {
    "name" : "category_department_id",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "category_department_id",
    "sqlType" : "4"
  }, {
    "name" : "category_name",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "category_name",
    "sqlType" : "12"
  } ],
  "tableName" : "categories"
}

The data is stored in a .parquet file :

hadoop fs -ls /user/hive/warehouse/categories
Found 3 items
drwxr-xr-x   - cloudera supergroup          0 2016-03-23 17:35 /user/hive/warehouse/categories/.metadata
drwxr-xr-x   - cloudera supergroup          0 2016-03-23 17:38 /user/hive/warehouse/categories/.signals
-rw-r--r--   1 cloudera supergroup       1956 2016-03-23 17:38 /user/hive/warehouse/categories/a3db2e78-2861-4906-a769-e8035c03d7d2.parquet

There are a number of file formats you can use with Hadoop, each with their own pros and cons.
You can see a discussion of some of these formats here.

NOTE – the first time I ran this, I shutdown the VM after I got to here. When I started it again, Impala (see below) refused to see any databases. I had to trash and re-create the VM to get it to work. I’m not sure why this happened ( I did have a hunt around) but just thought I’d mention it in case you’re thinking of doing the same.

Anyway, now we have data in Hadoop, it would be good if we could interrogate it…

Not Quite SQL – Hive and Impala

Cloudera comes with a Web based UI for Hadoop in the form of Hue.
Note that Hortonworks seem to be standardazing on a different UI tool – Apache Ambari.

In the Cloudera VM, there is a link to Hue on the Bookmarks bar in Firefox.
Click on this link and then connect as cloudera (password cloudera).

The favoured query engine for interactive queries in Cloudera is called Impala.
I believe that, at the time of writing, Hortonworks are sticking with Hive.

As far as I can tell, it seems that Hive has been retained by Cloudera to handle what, in RDBMS terms, would be called the catalog.
In Oracle terms this would be the Data Dictionary.

This is essentially the metadata for the tables in the database.
This metadata seems to be read by all of the SQL Engines irrespective of where it is updated from.

Once you’re connected via Hue you can select either of these tools from the Query Editors drop-down.

Initially, the tutorial directs you to Impala.

The Data Dictionary

The first thing to note about Impala is that it doesn’t bother reading the table metadata unless you tell it to. Therefore, if you make any changes DML or DDL changes, you probably need to tell Impala to check the metadata for any query results to pick up these changes.
For example, we know that we’ve got six tables in our database which we’ve created via SQOOP. However, if you ask Impala about it :

show tables;

…you get the not entirely helpful :

The operation has no results

By contrast, if you try this in Hive (Query Editors/Hive), the tables are all present and correct.

To persuade Impala to see these changes you need to run :

invalidate metadata;

Note that you can also run this command for individual tables should the need arise, e.g. :

invalidate metadata categories;

Anyway, now we can see the tables in Impala, we can run some queries against them.
Whilst we’re at it, we can do a simple comparison between Impala and Hive in terms of how they process the same query.

Comparative performance

The query in question (taken from the Getting Started Tutorial provided with the VM) is :

-- Most popular product categories
select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id = p.product_id
inner join categories c on c.category_id = p.product_category_id
group by c.category_name
order by count desc
limit 10;

The syntax looks reassuringly familiar.

In Impala this ran in around 35 seconds.
I then ran the same query in Hive, which took about 7 minutes.

Once the query is run, Hue shows an Explain button. Click on this and you can see the execution plan for the query.

In Hive the plan looks like this :

STAGE DEPENDENCIES:
  Stage-9 is a root stage
  Stage-3 depends on stages: Stage-9
  Stage-4 depends on stages: Stage-3
  Stage-0 depends on stages: Stage-4

STAGE PLANS:
  Stage: Stage-9
    Map Reduce Local Work
      Alias -&amp;gt; Map Local Tables:
        c
          Fetch Operator
            limit: -1
        p
          Fetch Operator
            limit: -1
      Alias -&amp;gt; Map Local Operator Tree:
        c
          TableScan
            alias: c
            Statistics: Num rows: 24 Data size: 2550 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: category_id is not null (type: boolean)
              Statistics: Num rows: 12 Data size: 1275 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 _col10 (type: int)
                  1 category_id (type: int)
        p
          TableScan
            alias: p
            Statistics: Num rows: 5737 Data size: 45896 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (product_id is not null and product_category_id is not null) (type: boolean)
              Statistics: Num rows: 1435 Data size: 11479 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 order_item_product_id (type: int)
                  1 product_id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: oi
            Statistics: Num rows: 206028 Data size: 1648231 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: order_item_product_id is not null (type: boolean)
              Statistics: Num rows: 103014 Data size: 824115 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 order_item_product_id (type: int)
                  1 product_id (type: int)
                outputColumnNames: _col3, _col10
                Statistics: Num rows: 113315 Data size: 906526 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col10 (type: int)
                    1 category_id (type: int)
                  outputColumnNames: _col3, _col20
                  Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col20 (type: string), _col3 (type: int)
                    outputColumnNames: _col20, _col3
                    Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: count(_col3)
                      keys: _col20 (type: string)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 124646 Data size: 997178 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: bigint)
      Local Work:
        Map Reduce Local Work
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col1 (type: bigint)
              sort order: -
              Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col0 (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: bigint)
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 62323 Data size: 498589 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 10
            Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

The Impala plan looks a bit different :

Estimated Per-Host Requirements: Memory=4.16GB VCores=1
WARNING: The following tables are missing relevant table and/or column statistics.
default.categories, default.order_items, default.products

11:MERGING-EXCHANGE [UNPARTITIONED]
|  order by: count(order_item_quantity) DESC
|  limit: 10
|
06:TOP-N [LIMIT=10]
|  order by: count(order_item_quantity) DESC
|
10:AGGREGATE [FINALIZE]
|  output: count:merge(order_item_quantity)
|  group by: c.category_name
|
09:EXCHANGE [HASH(c.category_name)]
|
05:AGGREGATE
|  output: count(order_item_quantity)
|  group by: c.category_name
|
04:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: p.product_category_id = c.category_id
|
|--08:EXCHANGE [BROADCAST]
|  |
|  02:SCAN HDFS [default.categories c]
|     partitions=1/1 files=1 size=1.91KB
|
03:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: oi.order_item_product_id = p.product_id
|
|--07:EXCHANGE [BROADCAST]
|  |
|  01:SCAN HDFS [default.products p]
|     partitions=1/1 files=1 size=43.80KB
|
00:SCAN HDFS [default.order_items oi]
   partitions=1/1 files=1 size=1.57MB

As well as being somewhat shorter, the Impala plan appears more familiar to someone looking at an Oracle Query Plan.

To check for the possible effect of caching, I then ran these queries again.
For Impala, the runtime dropped to around 13 seconds.
For Hive, the runtime was the same as for the first run.

The Impala plan was unchanged, despite the fact that the runtime was drastically reduced.
From this behaviour I would infer that there is some caching effect for Impala, although a cache similar to the Buffer Cache in Oracle is not shared between the two Query Engines.

The explanation offered in the tutorial is that Hive compiles SQL queries into MapReduce jobs wheras Impala was designed as a SQL engine.
It would appear that Impala is better for small-scale interactive queries wheras Hive is more suited to large-scale ETL.

There’s a benchmarking comparison between Hive, Impala, and Spark that may be of interest here.

Optimizer Statistics

You’ll notice that, at the start of the Impala query plan, there’s a warning about missing statistics.

If you roll the mouse over one of the tables on the left-hand side of the Hue Query Editor Window, you’ll see an icon appear which enables you to explore the table’s metadata.

Looking at one of these tables we can see that there are no stats present :

no_stats

We can fix this easily enough in Impala by gathering stats for each of our tables. For example :

compute stats categories;

If we now check the metadata for the CATEGORIES table, we can see that stats are present :

with_stats

If we now re-run the original query, the plan will no longer show the warning. However, the rest of the plan remains unchanged.
Given that these are quite small tables, this is probably not surprising.

There’s more information about stats gathering in Impala here.

External Tables

One thing that Hive is good for apparently is creating External Tables.

In the tutorial, some unstructured data ( a log file) is loaded and then external tables created in Hive as using the following code :


CREATE EXTERNAL TABLE intermediate_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
    'output.format.string' = '%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s')
LOCATION '/user/hive/warehouse/original_access_logs';

CREATE EXTERNAL TABLE tokenized_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/tokenized_access_logs';

ADD JAR /usr/lib/hive/lib/hive-contrib.jar;

INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs;

Once again, it looks familiar for the most part.
One point of interest is the INSERT OVERWRITE command used to insert data into the table.
This has the effect of removing any pre-existing data in the table before inserting the new data.

String functions and other SQL Stuff

We can query the new external table in Impala :

invalidate metadata tokenized_access_logs;
select * from tokenized_access_logs;

The values in the url column contain %20 characters instead of spaces.
Let’s see what Impala can do in terms of the standard SQL string manipulation functions.

Unlike Oracle, there’s no REPLACE function, there is however a REGEXP_REPLACE…

select regexp_replace(url, '%20', ' ')
from tokenized_access_logs;

When we run this, we can see that the REGEXP_REPLACE has done the job :

regexp_replace

The logs we’re really interested in are where a specific product has been viewed.
If we can get the name of the product from the url, then maybe that will help when relating this data back to the most popular items in terms of sales…

select substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%';

When we run this, we can see that the INSTR function also works as expected…to a point. Note that the ‘/’ characters need to be escaped with a ‘\’.
The result looks like this :

products_url

Now, there are a couple of records with the product name and “/add_to_cart” appended. With INSTR in Oracle, you can search for a pattern starting at the end of a string – e.g. :

select instr('/product/Pelican Sunstream 100 Kayak/add_to_cart', '/',-1,1)
from dual;

INSTR('/PRODUCT/PELICANSUNSTREAM100KAYAK/ADD_TO_CART','/',-1,1)
---------------------------------------------------------------
							     37

There is no such option in the Impala equivalent. It simply searches the string from the start and reports the first occurrence.
Fortunately, we want to strip out these results for the stuff we’re going to do in a minute. Therefore, I’ve just amended the query to be :

select substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%';

We can use fairly standard SQL to get a listing of the products by number of views :

select count(*), substr(regexp_replace( url, '%20', ' '), instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9)
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%'
group by url
order by 1 desc;

In the tutorial, it mentions that there is one product that has a lot of views but no sales. I wonder if we can find this just using SQL ?

with viewed_products as
(
select count(*) as times_viewed,
substr(regexp_replace( url, '%20', ' '),
       instr( regexp_replace(url, '%20', ' '), '\/product\/') + 9) as product_name
from tokenized_access_logs
where url like '%\/product\/%'
and url not like '%\/add_to_cart%'
group by url
)
select v.times_viewed, v.product_name
from viewed_products v
where upper(v.product_name) not in
(
  select upper(p.product_name)
  from products p
  inner join
  (
    select oi.order_item_product_id as product_id,
        count(oi.order_item_id) as units_sold
    from order_items oi
    inner join orders o
        on oi.order_item_order_id = o.order_id
    where o.order_status not in ('CANCELED', 'SUSPTECTED_FRAUD')
    group by order_item_product_id
  ) s
    on p.product_id = s.product_id
);

OK, it’s not the most elegant SQL I’ve ever written but it does demonstrate that :

  • You can define in-line views using the WITH clause
  • You can use a NOT IN subquery

…better than that, it gives the correct answer :

adidas Kids' RG III Mid Football Cleat

In the tutorial, the reason for the lack of sales is a coding error in the product page. You can ask your own snarky question about Unit Testing practices here.

Going back to the csv files we were playing around with on hdfs earlier, how would we get those into our database ?

Go fly a Kite

Yes, it’s another tool. This one is called Kite. It’s purpose, in this case at least, is to help us create a table based on the tools.csv file and allow us to then make use of it in Impala.

First of all, I’ve created the tools.csv file on the local file system :

l tools.csv
-rw-rw-r-- 1 cloudera cloudera 303 Mar 26 17:21 tools.csv

Now to get kite to create a “table” definition based on the csv :

kite-dataset csv-schema tools.csv --class Tools -o tools.avsc
ls -l tools.*
-rw-rw-r-- 1 cloudera cloudera 373 Mar 26 17:22 tools.avsc
-rw-rw-r-- 1 cloudera cloudera 303 Mar 26 17:21 tools.csv

If we have a look at the new avro file that kite has generated, it looks similar in terms of format to the one that SQOOP generated earlier :

{
  "type" : "record",
  "name" : "Tools",
  "doc" : "Schema generated by Kite",
  "fields" : [ {
    "name" : "tool",
    "type" : [ "null", "string" ],
    "doc" : "Type inferred from 'hue'",
    "default" : null
  }, {
    "name" : "description",
    "type" : [ "null", "string" ],
    "doc" : "Type inferred from 'Web-based UI for Hadoop'",
    "default" : null
  } ]
}

The next step is to create the table metadata ( in kite terminology this is a dataset)…

kite-dataset create tools -s tools.avsc

…and finally add the data itself to our table…

kite-dataset csv-import tools.csv tools
The url to track the job: http://localhost:8080/
Added 9 records to "tools"

To check in Impala, we can head over to Hue, select Impala from the Query Editors drop-down and run :

invalidate metadata tools;
select * from tools;

The result should look like this :

tools_query

Back on the command line, we can see that kite has added files to hdfs :

hadoop fs -ls /user/hive/warehouse/tools
Found 3 items
drwxr-xr-x   - cloudera supergroup          0 2016-03-26 17:32 /user/hive/warehouse/tools/.metadata
drwxr-xr-x   - cloudera supergroup          0 2016-03-26 17:35 /user/hive/warehouse/tools/.signals
-rw-r--r--   1 cloudera supergroup        622 2016-03-26 17:35 /user/hive/warehouse/tools/8baf8440-41b0-4f29-8750-16aeb3aec9b0.avro

The .avro file appears to have been compressed using the Snappy compression tool also used by SQOOP earlier.
You can however read the file by running :

hadoop fs -text /user/hive/warehouse/tools/8baf8440-41b0-4f29-8750-16aeb3aec9b0.avro
{"tool":{"string":"banana"},"description":{"string":"a web UI framework"}}
{"tool":{"string":"flume"},"description":{"string":"stream a file into Hadoop"}}
{"tool":{"string":"hive"},"description":{"string":"another query engine"}}
{"tool":{"string":"hue"},"description":{"string":"Web-based UI for Hadoop"}}
{"tool":{"string":"impala"},"description":{"string":"a query engine"}}
{"tool":{"string":"khafka"},"description":{"string":"a scheduler"}}
{"tool":{"string":"oozie"},"description":{"string":"another scheduler"}}
{"tool":{"string":"spark"},"description":{"string":"a query engine that is not hive or impala ?"}}
{"tool":{"string":"sqoop"},"description":{"string":"Transfer structured data between an RDBMS and Hadoop"}}
Summary

The Getting Started Tutorial goes on to cover various other tools available in the Cloudera distribution for doing data manipulation and analysis.
Additionally, you may find this presentation on Hadoop internals for Oracle Devs by Tanel Poder useful.

For now though, I have enough to keep my head above water.


Filed under: Oracle, SQL Tagged: Hadoop, hdfs, Hive, Impala, Kite, SQOOP

Customer Service Thinking

Floyd Teter - Mon, 2016-03-28 11:28
My definition of good customer service is pretty simple:  deliver what you promise.  Period.  You can over-deliver and still make me happy.  But other than that, it's pretty simple.  Tell me what you promise you'll do.  If I sign up for the service, then delivery that service:  on-time, at the price you promised, and make it as easy for me as you promised.  That's it.

Earlier this week, I encountered back-to-back customer service failures with my preferred airline.  The first fail took place when winter weather struck on the runway...waited in the plane on the runway for two hours because the airline failed to have the wing de-icers ready to go.  The second failure occurred when the plane failed a brake inspection prior to boarding; lost two hours and rebooked a later flight.

In both cases, the airline did quite well in providing details for the cause of the delays and expressed profound apologies.  But here is a tip for the airline:  that is not customer service.  You blew it when you failed to deliver your service on-time at the price you promised.  Providing details and status is about mitigating the damages from your failure to provide customer service...it's not good customer service.  In fact, the line defining a customer service failure has already been crossed.

One more customer experience failure this week; I went to my favorite hardware store to shop for a few tools and place a larger order for some building materials (more house remodeling).  The tool shopping went well, but it took two hours to place the order for materials.  The cause of the delay?  Not one of the five service representatives knew how to enter the order into the store's order entry system.  An utter failure of customer service due to a lack of effective training for the team that works with customers.

Now you may be thinking that I've just had a crummy week and I'm using this forum to vent.  On the contrary, it was a good week...because this experience got me to thinking.

You see, Software-as-a-Service is not just about software hosted on a public cloud.  It's about delivering a service.  Telling customers what you promise to do.  Then delivering on that promise.  Including provisioning, implementing, and support...all the activities and exchanges that go into the "Service".

We see too many service fails in the SaaS world.  All the time.  Every day.  Regardless of software vendor.  The industry is still working through the transition from thinking about providing software applications to providing a service...a much, much wider scope of responsibility to our customers.  We need to up the standard - quickly!

Thoughts?  Find the comments.

Video : XMLTABLE : Convert XML into Rows and Columns using SQL

Tim Hall - Mon, 2016-03-28 11:21

Here’s an Easter present for everyone out there.

A Designer’s Introduction to Oracle JET

I’m thrilled to be presenting at Collaborate 2016 with my colleague John Sim, on the recently open-sourced Oracle JET! We front-end developers had been seeking a better UI/UX solution from Oracle for quite some time, and they have delivered in a big way.

graphs

Oracle JavaScript Extension Toolkit (JET) is a powerful and intuitive framework that provides a modular toolkit for front-end developers. Rather than re-inventing the wheel, Oracle JET integrates tried and tested open-source frameworks and libraries like jQuery, KnockoutJS, and RequireJS with JET-specific pre-built UI components. Oracle JET is a flexible client-side framework that can integrate with a variety of back-end services. The combination makes for speedy assembly of client-side applications that integrate with Oracle products and services, especially Oracle Cloud services.

libraries

While Java may be the leading language in the industry, JavaScript has quickly gained popularity, especially for developing client-side user interactions. JET was initially built for Oracle in-house use, but Oracle customers and partners continued to ask for JavaScript-based UI development that could interact with existing Oracle platforms. JET provides the capabilities and flexibility of JS UI development, while complying with Oracle standards (internationalization and accessibility, for example) for product delivery.

Part of the beauty of JET, is in its modularity. It allows developers to use as much or as little as they need for a particular project. In addition, different libraries can be incorporated. As JS libraries evolve, and new frameworks are developed, the idea is that they can be incorporated, as well. Oracle JET’s flexibility ensures that it can change with the JS development world.

Our session is targeted at front-end developers looking to leverage their Javascript, CSS3, and HTML5 front- end design skills to create modular single page applications with JET. We’ll include an overview of the JET framework, and discuss the features and benefits along with a quick comparison with ADF (Application Development Framework). We will walk through and explain JET’s integrated open source libraries and frameworks, and then take developers through the creation of a simple JET application using Oracle DoCS REST API.

Please join us at 10:30AM on April 11 in Jasmine D for a dive into this flexible, open-source(!) and exciting new JS framework, and please come see all of the Fishbowl Solutions gang at Booth #1028!

The post A Designer’s Introduction to Oracle JET appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

Amazon Database Migration Service – first try

Pythian Group - Mon, 2016-03-28 09:04

Recently, checking Amazon Web Services, I stumbled upon a service I hadn’t tested before. It was Data Migration Service (DMS). I read documentation and checked other resources. I found a good, fresh blog post AWS Database Migration Service written by Jeff Barr. It was really interesting and I decided to give a try and test the service.

I created an Oracle RDS on AWS as a target and an Oracle Linux box on Azure with Oracle 12c EE as a source database for migration. The source database sid was “test” and destination was “orcl”. I created tablespaces and users on both sides with the name “testmig” and created a table on the source database. Initially I loaded 1000000 records to the table and created an index. The schema on destination database was empty. I also enabled archivelog mode on the source database.

Creating user and table on the source:

test> create user testmig identified by welcome1 default tablespace testmig temporary tablespace temp;

User created.

test> grant connect,resource to testmig;

Grant succeeded.

test> conn test

test> create table test_tab_1 (pk_id number, rnd_str_1 varchar2(15),use_date date,rnd_str_2 varchar2(15), acc_date date);

Table created.

test>

Loading the data:

[oracle@oradb1 patchdepot]$ head test_tab_1.dat
340,MLBO07LV,10/30/13 15:58:04,NABCFVAQ,12/08/17 18:22:48
341,M48R4107,12/09/13 12:30:41,ACA79WO8,12/15/16 08:13:40
342,KARMF0ZQ,04/21/14 08:53:33,JE6SOE0K,06/18/17 07:12:29
343,8NTSYDIS,11/09/14 23:41:48,FBJXWQNX,08/28/15 20:47:39
344,0LVKBJ8T,09/28/12 06:52:05,VBX3FWQG,10/28/15 06:10:42
345,Z22W1QKW,06/06/13 11:14:32,26BCTA9L,08/21/17 08:35:15
346,CGGQO9AL,08/27/14 02:37:41,15SRXZSJ,11/09/17 19:58:58
347,WKHINIUK,07/02/13 14:31:53,65WSGVDG,08/02/15 10:45:50
348,HAO9X6IC,11/17/12 12:08:18,MUQ98ESS,12/03/15 20:37:20
349,D613XT63,01/24/15 16:49:11,3ELW98N2,07/03/16 11:03:40
[oracle@oradb1 patchdepot]$ export NLS_DATE_FORMAT="MM/DD/YY HH24:MI:SS"
[oracle@oradb1 patchdepot]$ sqlldr userid=testmig table=test_tab_1
Password:

SQL*Loader: Release 12.1.0.1.0 - Production on Wed Mar 16 13:07:50 2016

Copyright (c) 1982, 2013, Oracle and/or its affiliates.  All rights reserved.

Express Mode Load, Table: TEST_TAB_1
Path used:      External Table, DEGREE_OF_PARALLELISM=AUTO

Table TEST_TAB_1:
  100000 Rows successfully loaded.

Check the log files:
  test_tab_1.log
  test_tab_1_%p.log_xt
for more information about the load.
[oracle@oradb1 patchdepot]$

On the target system:

rdsorcl> create tablespace testmig;

Tablespace TESTMIG created.

rdsorcl> create user testmig identified by welcome1 default tablespace testmig;

User TESTMIG created.

rdsorcl>

In the blog post mentioned, the migration was done without replication and I was curious to test it with some kind of ongoing DML activity on the source database. I setup a linux box with Jmeter and started my load with pace about 15 transactions per second. The transactions were inserts and updates on the created table.

Everything was working fine so far and I switched to the Data Migration Service on AWS. The service has a pretty easy and clear workflow. You need just push the button “Create migration” and it will guide you through the process. In general, you need to create a replication instance, endpoints for source and target and task to start initial load and replication.

I created a replication instances and while it was creating (it took some time) was asked to setup endpoints for source and target. The first issue I hit when I tried to use a DNS name for my Azure instance. The test connection was failing by timeout and it was not clear where the problem were. It could be either connection or DNS problem. The issue was solved by providing IP address instead of domain name for my Azure instance.
Screen Shot 2016-03-16 at 1.26.40 PM
The test for target endpoint failed with the same timeout, but the reason was totally different. It was not DNS, but rather a connection issue. At first, I couldn’t figure that out because I was able to connect to my RDS instance from my laptop using server name and port but test endpoint in DMS was not working. Eventually I figured out that the problem was in security groups for endpoint in RDS. By default the AWS RDS instance was created with security group allowing connections outside but somehow restricting connections from DMS. I changed the security group for AWS RDS to “default” and was able to successfully test the endpoint in DMS.

The next step was to create a task. I created a task with initial load and ongoing replication for my testmig schema. The task was supposed to drop any tables on the target (you can choose truncate instead if you want) create objects, move data and keep replication until cutover day when you will be able to switch your applications to the new database. It will tell you that you need to setup supplemental logging for replication. Unfortunately it doesn’t tell you what kind of supplemental logging you have to setup.

So, I enabled minimal data supplemental logging on my Azure test instance.

test> alter database add supplemental log data;
Database add SUPPLEMENTAL altered.

test> exec dbms_capture_adm.prepare_table_instantiation('testmig.test_tab_1','keys')

PL/SQL procedure successfully completed.

test>

It was not enough and I got the error. By default you are not getting logging for your task but only configuration and statistics about replicated and loaded objects. As a result if you get an error, it is not clear where to look. I enabled supplemental logging for primary key on my replicated table and recreated task checking and logging checkbox. I got error again but I had a log and was able to see what was causing the issue.

2016-03-16T19:41:11 [SOURCE_CAPTURE  ]I:  Oracle compatibility version is 12.1.0.0.0  (oracle_endpoint_conn.c:86)
2016-03-16T19:41:11 [SOURCE_CAPTURE  ]I:  Oracle capture start time: now  (oracle_endpoint_capture.c:701)
2016-03-16T19:41:12 [SOURCE_CAPTURE  ]I:  New Log Miner boundaries in thread '1' : First REDO Sequence is '4', Last REDO Sequence is '4'  (oracdc_reader.c:589)
2016-03-16T19:41:18 [SOURCE_UNLOAD   ]W:  Supplemental logging is not defined for table with no key 'TESTMIG.TEST_TAB_1'  (oracle_endpoint_utils.c:831)
2016-03-16T19:41:18 [SOURCE_UNLOAD   ]E:  Supplemental logging for table 'TESTMIG.TEST_TAB_1' is not enabled properly [122310] Supplemental logging is not correct (oracle_endpoint_unload.c:245)
2016-03-16T19:41:18 [SOURCE_UNLOAD   ]I:  Unload finished for table 'TESTMIG'.'TEST_TAB_1' (Id = 1). 0 rows sent.  (streamcomponent.c:2567)
2016-03-16T19:41:18 [SOURCE_UNLOAD   ]E:  Failed to init unloading table 'TESTMIG'.'TEST_TAB_1' [122310] Supplemental logging is not correct (oracle_endpoint_unload.c:441)

It looked like my supplemental logging was not enough. So, I added supplemental logging for all columns and for entire schema testmig. I recreated task and started it again.

test> exec dbms_capture_adm.prepare_table_instantiation('testmig.test_tab_1','all');
PL/SQL procedure successfully completed.

test> exec dbms_capture_adm.prepare_schema_instantiation('testmig');
PL/SQL procedure successfully completed.

test>

It was working fine and was able to perform initial load.

2016-03-16T19:49:19 [SOURCE_CAPTURE  ]I:  Oracle capture start time: now  (oracle_endpoint_capture.c:701)
2016-03-16T19:49:20 [SOURCE_CAPTURE  ]I:  New Log Miner boundaries in thread '1' : First REDO Sequence is '4', Last REDO Sequence is '4'  (oracdc_reader.c:589)
2016-03-16T19:49:31 [SOURCE_UNLOAD   ]I:  Unload finished for table 'TESTMIG'.'TEST_TAB_1' (Id = 1). 100723 rows sent.  (streamcomponent.c:2567)
2016-03-16T19:49:31 [TARGET_LOAD     ]I:  Load finished for table 'TESTMIG'.'TEST_TAB_1' (Id = 1). 100723 rows received. 0 rows skipped. Volume transfered 45929688  (streamcomponent.c:2787)

What about ongoing changes? Yes, it was keeping the replication on and the tables were in sync. Replication lag for my case was minimal but we need to note that it was just one table with a low transaction rate. By the end I switched my load to AWS RDS database, stopped and deleted the DMS task. Migration was completed. I compared data in tables running a couple of simple checks for count and rows and running also one table “minus” other. Everything was fine.

rdsorcl> select max(pk_id) from testmig.test_tab_1;

      MAX(PK_ID)
----------------
         1000843

rdsorcl> select * from testmig.test_tab_1 where pk_id=1000843;

           PK_ID RND_STR_1       USE_DATE                    RND_STR_2       ACC_DATE
---------------- --------------- --------------------------- --------------- ---------------------------
         1000843 OUHRTHQ8        02/11/13 07:27:44           NFIAODAU        05/07/15 03:49:29

rdsorcl>

----------------

test> select max(pk_id) from testmig.test_tab_1;

      MAX(PK_ID)
----------------
         1000843

test> select * from testmig.test_tab_1 where pk_id=1000843;

           PK_ID RND_STR_1       USE_DATE                    RND_STR_2       ACC_DATE
---------------- --------------- --------------------------- --------------- ---------------------------
         1000843 OUHRTHQ8        02/11/13 07:27:44           NFIAODAU        05/07/15 03:49:29

test>

test> select count(*) from (select * from test_tab_1 minus select * from test_tab_1@rdsorcl);

        COUNT(*)
----------------
               0

test>

A summary of DMS:

    • We may need to adjust security groups for target RDS or EC2 systems. It may prevent connections.
    • Better to use IP for source endpoints since DNS may be not reliable.
    • Enable logging when you create task.
    • If you enable replication from Oracle database you have to setup full supplemental logging for the replicated schemas on your source system.
    • It requires basic knowledge about replication and how it works to understand and fix the error.

Next time I will try heterogeneous replication from MySQL to Oracle and then the other way around.

Categories: DBA Blogs

Visit Our Gadget Lab Virtually over the Web

Oracle AppsLab - Mon, 2016-03-28 08:59

This is 2016, and seems this is the year for VR. Of course, we at the AppsLab can’t miss the beat!

While Osvaldo (@vaini11a) started to look into Unity based VR capability and prototype, I wanted to take a look into the WebVR based approach. The prospect of delivering VR experience in a browser, and over the web, suddenly makes VR so much more accessible – WebVR can be designed in a way to work with or without a VR headset. In a sense it is an extension of responsive web, to adjust to different renderers/viewers gracefully.

Longtime friend of the ‘Lab, Ed Jones (@edhjones), gave us a shout about A-Frame, which was released very recently. As its tagline puts, it is “building blocks for the virtual reality web”, using markup and Javascript! Now it makes creating VR contents so much more accessible too.

The first thing came to my mind was to VR-enable some of our visualization demos, and I picked John’s Transforming Table for the first try. After a series of hacks, I got a half-baked, and non-functional result; if we had the Oculus Rift, we could get it working. A-Frame is at a very early stage, and there is still a lot to be desired.

I realized I needed a perspective change – instead of fitting the existing presentation and behavior into VR, WebVR/A-Frame is better suited to create a new presentation and behavior that blends with VR naturally.

Jake (@jkuramot) and Noel (@noelportugal) had just come back from a trip to Sydney, and told us a story about someone following our team and reading our posts, from far far away – the other side of planet

MySQL Memory Usage Docs Get a FaceLift

Pythian Group - Mon, 2016-03-28 08:59

The MySQL Documentation team recently gave these docs on how MySQL uses memory a much needed face-lift. The new page provides a much clearer overview on how MySQL allocates memory, and provides many helpful links to be able to dig deeper.

For instance, if you weren’t aware of how Performance Schema memory utilization changed in 5.7, there is this helpful paragraph (emphasis mine):

The MySQL Performance Schema is a feature for monitoring MySQL server execution at a low level. As of MySQL 5.7, the Performance Schema dynamically allocates memory incrementally, scaling its memory use to actual server load, instead of allocating required memory during server startup. Once memory is allocated, it is not freed until the server is restarted. For more information, see Section 22.14, “The Performance Schema Memory-Allocation Model”.

Therefore, if you are starting a new project on MySQL 5.7, or upgrading an existing environment, and you have Performance Schema enabled, you might see your memory footprint rising inexplicably. According to the linked Performance Schema Memory-Allocation Model documentation, one reason might because of auto-scaling Performance Schema variables:


performance_schema_accounts_size
performance_schema_hosts_size
performance_schema_max_cond_instances
performance_schema_max_file_instances
performance_schema_max_index_stat
performance_schema_max_metadata_locks
performance_schema_max_mutex_instances
performance_schema_max_prepared_statements_instances
performance_schema_max_program_instances
performance_schema_max_rwlock_instances
performance_schema_max_socket_instances
performance_schema_max_table_handles
performance_schema_max_table_instances
performance_schema_max_table_lock_stat
performance_schema_max_thread_instances
performance_schema_users_size

Of course, you can limit each variable by supplying a value to prevent autoscaling beyond a point.

There might me some areas missing, such as explicit MEMORY tables, but by-and-large it is a vast improvement.

Other honorable mentions that I’ve seen updates in the documentation include Limiting memory utilization of range optimizations and Configuring innodb_buffer_pool_size

Happy reading!

Categories: DBA Blogs

Log Buffer #466: A Carnival of the Vanities for DBAs

Pythian Group - Mon, 2016-03-28 08:47

This Log Buffer Edition covers weekly round up of blog posts from Oracle, SQL Server and MySQL.

Oracle:

The Universal Theme introduced with APEX 5 is immensely good looking and comes with a powerful Theme Roller to customize it.

The implementation of Ksplice has been greatly simplified. Now you just need to register your system(s) with Unbreakable Linux Network (ULN), subscribe to the appropriate Ksplice channel, use the yum command to install the uptrack package, and perform any custom configuration. Your systems will be automatically updated with the latest kernel and user space patches.

Every business book you read talks about delegation. It’s a core requirement for successful managers: surround yourself with good people, delegate authority and responsibility to them, and get out of their way.

Accelerating SQL Queries that Span Hadoop and Oracle Database

Oracle Big Data SQL 3.0 adds support for Hortonworks Data Platform and commodity clusters

SQL Server:

Instant File Initialization : Impact During Setup

Enumerate Windows Group Members

How to execute an SSIS package from the command line or a batch file

When AUTO_UPDATE_STATISTICS Doesn’t Happen

SQL Server Table Smells

MySQL:

MySQL replication primer with pt-table-checksum / pt-table-sync, part 2

How do you dig down into the JSON data, say like in comments on a blog post?

Percona XtraBackup 2.3.4 is now available

Connection timeout parameters in MySQL

What have we learnt in two decades of MySQL?

Categories: DBA Blogs

Links for 2016-03-27 [del.icio.us]

Categories: DBA Blogs

GNW01: In-Memory Processing for Databases

Tanel Poder - Mon, 2016-03-28 00:39

Hi, it took a bit longer than I had planned, but here’s the first Gluent New World webinar recording!

You can also subscribe to our new Vimeo channel here – I will announce the next event with another great speaker soon ;-)

A few comments:

  • Slides are here
  • I’ll figure a good way to deal with offline follow-up Q&A later on, after we’ve done a few of these events

If you like this stuff, please share it too – let’s make this series totally awesome!

 

 

NB! If you want to move to the "New World" - offload your data and workloads to Hadoop, without having to re-write your existing applications - check out Gluent. We are making history! ;-)

Create a PrimaryKey on existing table without unique values

Learn DB Concepts with me... - Mon, 2016-03-28 00:25
lets assume that there was a table ABC that was already existing in the database and you want to add an additional column with unique primary key values.


sql to create table ABC :

  CREATE TABLE "ABC"
   (    "USERNAME" VARCHAR2(30 BYTE) NOT NULL ENABLE,
    "USER_ID" NUMBER NOT NULL ENABLE,
    "CREATED" DATE NOT NULL ENABLE )
   TABLESPACE "USERS" ;

Now we  can add an additional column ID which will be populated with all unique values for PrimaryKey.

alter table abc add(ID NUMBER);

We will now create a sequence and get the values from the seq and insert them into table ABC new ID column:

CREATE SEQUENCE SEQ_ID
START WITH 1
INCREMENT BY 1
MAXVALUE 999999
MINVALUE 1
NOCYCLE;

Now insert the unique values into the database with below sql:
UPDATE abc SET ID = SEQ_ID.NEXTVAL;


now you can add unique constraint (or) add primary key constraint to table,so that it wont take any more duplicate value into the table.

alter table abc add primarykey (ID);
Categories: DBA Blogs

Pages

Subscribe to Oracle FAQ aggregator