Feed aggregator

The ten commandments of IT projects

Steve Jones - Mon, 2016-08-01 13:42
And lo a new project did start and there was much wailing and gnashing of teeth, for up on the board had been nailed ten commandments that the project must follow and the developers were sore afraid. Thou shalt put everything in version control, yeah even the meeting minutes, presentations and "requirements documents that aren't even finished yet" for without control everything is chaos Thou
Categories: Fusion Middleware

Named Parameters w/REF CURSOR

Tom Kyte - Mon, 2016-08-01 12:46
Hi Tom, This question is about the named parameter syntax. I've successfully called functions using this method: v_order_id := CREATE_CUSTOMER_ORDER ( p_customer_id => p_customer_id, p_total_amount =...
Categories: DBA Blogs

HASH JOIN Vs SORT MERGE

Tom Kyte - Mon, 2016-08-01 12:46
Hello Asktom Team, Hope you could throw some light on this. We have two Oracle 11g instances where one MERGE statement has radically different execution plans. One finishes in less than a minute and one times out with ORA-01652: unable to exten...
Categories: DBA Blogs

Distribution of DOP

Tom Kyte - Mon, 2016-08-01 12:46
I'm using Oracle 12c. I'm calling mutiple procedures from Python in parallel (usually 4-5). I make sure nothing else is running at that time. All the target and source tables has Parallel 32 Clause . I'm using hints ENABLE_PARALLEL_DML, PARALLEL in...
Categories: DBA Blogs

Count groups when next value = current value plus one

Tom Kyte - Mon, 2016-08-01 12:46
I have this table: create table test (myname varchar2(1), dts_id number); With this data: insert into test (myname, dts_id) values ('A',1); insert into test (myname, dts_id) values ('A',2); insert into test (myname, dts_id) values ('A',3); in...
Categories: DBA Blogs

Pivot table row to column covert

Tom Kyte - Mon, 2016-08-01 12:46
Hi Tom, Please consider below scenario. Table A ID Attribute Value 1 'X' 3 1 'Y' 4 2 'X' 5 3 'X' 6 3 'Y' 7 3 'Z' 8 3 'A' 9 Expected output ID X Y Z A 1 3 4 nul...
Categories: DBA Blogs

Please give the sql to get query performance in oracle database, i don't want to use statspack.

Tom Kyte - Mon, 2016-08-01 12:46
I was SQL Server DBA over 1 year, now i am oracle DBA so i used to use below query against sql server database to get query performance. <code>SELECT TOP 5 ObjectName = OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.obje...
Categories: DBA Blogs

DBMS_DATAPUMP

Tom Kyte - Mon, 2016-08-01 12:46
Can one export and import data only using dbms_datapump? I do not want to specify a schema on the export nor do I want it to default to the user's schema. I do not want to specify a "from" schema on the import. Is this possible?
Categories: DBA Blogs

email attachment but not enable to open it

Tom Kyte - Mon, 2016-08-01 12:46
I am trying to send mail with attachment pdf file i recieved the mail but when am trying to open the pdf file is return error massege please help am stuck DECLARE v_From VARCHAR2(80) :='test@ttt.net'; v_Recipient VARCHAR2(80) := 'r...
Categories: DBA Blogs

IO throughput problem

Tom Kyte - Mon, 2016-08-01 12:46
Dear Sirs, My question is about throughput\bandwith problem. We have a data warehouse solution, database size aroud 1 TB, nightly\morning run times around 15 hours. The optimal performance would be that users can have their daily generated repo...
Categories: DBA Blogs

Oracle Buys NetSuite

OracleApps Epicenter - Mon, 2016-08-01 11:26
Oracle set the ball rolling with Netsuite acquisition. Oracle OpenWorld Keynote—Cloud Innovation, Mike Hurd (Oracle CEO) predicted that by year 2025 - "Two software-as-a-service (SaaS) suite providers will have 80 percent of the cloud enterprise application market. I volunteer us to be one of them Everything came full circle with NetSuite last week, when Oracle […]
Categories: APPS Blogs

Oracle 12c New Features for Developers

Complete IT Professional - Mon, 2016-08-01 06:00
Oracle 12c is Oracle’s latest version of their database engine. Learn about all of the new Oracle 12c new features for developers in this massive guide. Introduction Oracle 12c comes with a range of new features. Many of these features are great improvements for developers, and some of them are more focused on database administrators. […]
Categories: Development

PeopleSoft PUBLIC User Security

PeopleSoft Public users are not required to authenticate (sign on). These are generic accounts created for specific purposes, for example informational pages and/or company directories. Public users are also not subject to timeouts (session inactivity). Because no authentication is required, no sensitive data should be accessible to these users. It also goes without saying, that if you don’t need Public accounts, don’t use them.

When performing a PeopleSoft security audit, Integrigy identifies Public users and analyzes their authorization privileges. To do this yourself, use the SQL below to list your public users and then query the application or database to look at their authorization privileges.

--List the public users
SELECT O.OPRID, O.OPRDEFNDESC, O.ACCTLOCK, O.LASTPSWDCHANGE, O.FAILEDLOGINS,O.ENCRYPTED, O.EMPLID
FROM SYSADM.PSWEBPROFILE P, SYSADM.PSOPRDEFN O
WHERE P.BYPASSSIGNON = 'Y'
AND P.DEFAULTUSERID = O.OPRID;

If you have questions, please contact us at info@integrigy.com

Michael A. Miller, CISSP-ISSMP, CCSP

References

PeopleSoft Security Quick Reference

Auditing, Oracle PeopleSoft
Categories: APPS Blogs, Security Blogs

What's new in IaaS

Pat Shuff - Mon, 2016-08-01 02:07
After taking the month of July off (don't ask) we are going to restart our discussion in August with a focus on infrastructure. The best place to start this discussion is to look at what has changed in June and July. The best place to do this is the documentation what's new page. If we look at this page we see that the following are new features Over the next few weeks we will dive into some of these new changes and look at how to capitalize on these features. We will compare and contrast these features with features available from Amazon and Azure as usual. We will start with Orchestrations since this is probably the biggest impact item for provisioning services as well as scaling and controlling instances. The goal of August is to dive deeper into the infrastructure to look at not only how you can leverage this in the Oracle Cloud but utilize it with the Oracle Cloud Machine if you get one installed in your data center. All of the features and functionality detailed in the IaaS updates are made available a month or two after release in the public cloud in the Oracle Cloud Machine.

File Upload Improvements in Apex 5.1

Jeff Kemp - Sun, 2016-07-31 22:28

Warning: this is based on the Apex 5.1 Early Adopter and details may change.

file_upload_5_1_ea

The standard File Upload item type is getting a nice little upgrade in Apex 5.1. By simply changing attributes on the item, you can allow users to select multiple files (from a single directory) at the same time.

In addition, you can now restrict the type of file they may choose, according to the MIME type of the file, e.g. image/jpg. This file type restriction can use a wildcard, e.g. image/*, and can have multiple patterns separated by commas, e.g. image/png,application/pdf.

file_upload_5_1_ea_demo

Normally, to access the file that was uploaded you would query APEX_APPLICATION_TEMP_FILES with a predicate like name = :P1_FILE_ITEM. If multiple files are allowed, however, the item will be set to a comma-delimited list of names, so the suggested code to get the files is:

declare
  arr apex_global.vc_arr2;
begin
  arr := apex_util.string_to_table(:P1_MULTIPLE_FILES);
  for i in 1..arr.count loop
    select t.whatever
    into   your_variable
    from   apex_application_temp_files t
    where  t.name = arr(i);
  end loop;
end;

You can play with a simple demo here: https://apexea.oracle.com/pls/apex/f?p=UPLOAD_DEMO&cs=JK64 (login as demo / demodemo).

If you want to support drag-and-drop, image copy&paste, load large files asynchronously, or restrict the maximum file size that may be uploaded, you will probably want to consider a plugin instead, like Daniel Hochleitner’s DropZone.


Filed under: APEX Tagged: APEX, apex-5.1

Test Driven Development and PL/SQL – The Odyssey Begins

The Anti-Kyte - Sun, 2016-07-31 14:58

In the aftermath of the Brexit vote, I’m probably not alone in being a little confused.
Political discourse in the UK has focused on exactly who it was who voted to Leave.
The Youth Spokesperson you get on a lot of political programs right now, will talk accusingly of older voters “ruining” their future by opting to Leave.
Other shocked Remainers will put it down to people without a University Degree.
I’m not sure where that leaves me as someone who is ever so slightly over the age of 30, does not have a degree…and voted to Remain. I’m pretty sure I’m not Scottish…unless there’s some dark family secret my parents haven’t let me in on.
I suppose I must be a member of the “Metropolitan Elite” the Leave side was always muttering darkly about.
After all, I do pay a great deal of money to be driven from my country residence to London to work every day…although I do have to share the train with the odd one or two fellow elitists who’ve made the same extravagant choice.
This does of course assume that Milton Keynes qualifies as being “in the country” and that my living there is a matter of choice rather than a question of being able to afford living any closer to London.

With all the excrement…er…excitement of the Referendum Campaign and it’s aftermath, I somehow never got around to writing my application to track the progress of the Euros (or the Copa America for that matter).
Whenever a major football tournament comes around, I always resolve to do this, if only to evoke memories of my youth when a large part of my bedroom wall was taken up with a World Cup Wallchart where you could fill in the results as they happened. That’s without mentioning the months leading up to the tournament and trying to complete the Panini collection – the only time you’d ever hear a conversation such as “OK, I’ll let you have Zico in exchange for Mick Mills”.

In order to prevent this happening again, I’ve resolved to write an application capable of holding details of any major international football tournament.
In the course of writing this application, I’d like to take the opportunity to have a look at an aspect of PL/SQL development that maybe isn’t as commonly used as it should be – Unit Testing.

Over the next few weeks, I plan to take a look at some of the Testing Frameworks available for PL/SQL and see how they compare.
The objective here is not so much to find which framework is the best/most suitable, but to perform an objective comparison between them using the same set of tests which implement fairly commonly encountered functionality.

If you’re looking for recommendations for a framework, then this article by Jacek Gebal is probably a good place to start.

In this post, I’ll be outlining the functionality that I’ll be testing in the form of User Stories, together with the application data model (or at least, the bit of it I need to execute the tests).
I’ll also have a look at the common pattern that tests written in these frameworks tend to follow.
Just to highlight why using a Test Framework might be useful, I’ll also script a couple of simple tests in SQL to see just how much code you have to write to implement tests without using a framework.

Unit Testing Approach

Taking the purist approach to Test-Driven Development, we’d need to :

  1. Write the test first and ensure that it fails
  2. Write the minimum amount of code required for the test to pass
  3. Run the test to make sure it passes

Additionally, we’d need to make sure that the tests were independent of each other – that the execution of one test is not dependent on the successful execution of a prior test.

Following this approach to the letter would cause one or two issues.
Firstly, if the procedure your testing does not exist, your test will not run and fail. It will error.
As the point of this step is, essentially, to ensure that your test code is sound (i.e. it won’t pass irrespective of the code it runs against), this is not what we’re after.
The second issue is specific to PL/SQL.
When defining PL/SQL procedures that interact with database tables, it’s usually a good idea to use anchored declarations where appropriate.
Even if we write a stub procedure, if the tables it will interact with do not exist, we’d have to use native types for our parameters and update the signature of the procedure once the tables had been created.
There is always the danger that this additional step would be missed.

So, in terms of PL/SQL then, I’d suggest that the pre-requisites for writing our test are :

  • The data model components (tables, views, RI constraints) that the procedure will interact with
  • A stub procedure with correctly typed parameters

Many testing frameworks seem to adopt four basic steps for each test, some of which are optional. They are :

  1. Setup – put the application into a known state from which the test can run
  2. Execute – run the code that you want to test
  3. Verify – check that what you expected to happen did happen
  4. Teardown – return the system to the state it was in before this test was run

This is the general pattern that I’ll be following for my tests.

The Stories to Test

My story finding is complete and the backlog has been populated.
The stories selected for Sprint 1 are on the board. OK, they’ve been chosen so that they cover some of the more common scenarios that we might need to test.
The stories are :

  1. Add a competition – tests insert into a table with a Primary Key
  2. Add a Tournament – insert into a table with a Foreign Key constraint
  3. Remove Tournament – delete from a table
  4. Remove Competitions – delete from a Parent table
  5. Edit a Tournament – update of a record in a table
  6. View Tournaments by Competition – select multiple records using an in/out ref cursor
  7. Bulk Upload Competitions – insert records in bulk using LOG ERRORS
  8. Add a Team to a Tournament – insert a record using a synthetic foreign key

Each of these stories have multiple Acceptance Criteria.
It’s worth noting that, as some of the functionality (i.e. the data integrity) is implemented in the data model ( Primary Keys, Foreign Keys etc), the Acceptance Criteria for these stories needs to cover this as well as the functionality implemented in the PL/SQL code itself.

The Data Model

I’ve taken the approach that, for a story to be Sprint Ready, the Data Model to support it must already be in place.
Currently, the data model looks like this :

sprint1_data_model

The DDL to create the application owner is :

create user footie identified by password
/

alter user footie default tablespace USERS
/

grant create session, create table, create procedure, create sequence to footie
/

alter user footie quota unlimited on users
/

grant read, write on directory my_files to footie
/

…where password is the password you want to give this user.
Note that a directory called MY_FILES already exists in my database.

The DDL to create the Data Model includes the COMPETITIONS table…

create table competitions
(
    comp_code varchar2(5) constraint comp_pk primary key,
    comp_name varchar2(50) not null,
    description varchar2(4000)
)
/

comment on table competitions is
    'International Football competitionss for which tournament data can be added'
/

comment on column competitions.comp_code is
    'Internal code to uniquely identify this competitions'
/

comment on column competitions.comp_name is
    'The name of the competitions'
/

comment on column competitions.description is
    'A description of the competitions'
/

…an external table to facilitate the bulk upload of COMPETITIONS…

create table competitions_xt
(
    comp_code varchar2(5),
    comp_name varchar2(50),
    description varchar2(4000)
)
    organization external
    (
        type oracle_loader
        default directory my_files
        access parameters
        (
            records delimited by newline
            skip 1
            fields terminated by ','
            badfile 'competitions.bad'
            logfile 'competitions.log'
            (
                comp_code char(5),
                comp_name char(50),
                description char(4000)
            )
        )
            location('competitions.csv')
    )
reject limit unlimited
/

….the TOURNAMENTS table…

create table tournaments
(
    id number constraint tourn_pk primary key,
    comp_code varchar2(5),
    year_completed number(4) not null,
    host_nation varchar2(100),
    year_started number(4),
    number_of_teams number(3) not null,

    constraint tourn_uk unique( comp_code, year_completed, host_nation)
)
/

comment on table tournaments is
    'A (finals) tournament of an International Football competition. Table alias is tourn'
/

comment on column tournaments.id is
    'Synthetic PK for the table as the Natural Key includes host_nation, which may be null. Values taken from sequence tourn_id_seq'
/
comment on column tournaments.comp_code is
    'The Competition that this tournament was part of (e.g. World Cup). Part of the Unique Key. FK to COMPETITIONS(comp_code)'
/

comment on column tournaments.year_completed is
    'The year in which the last match of this tournament took place. Manatory. Part of Unique Key'
/

comment on column tournaments.host_nation is
    'The nation where the tournament was played (if a finals tournament). Part of Unique Key but is optional'
/
comment on column tournaments.year_started is
    'The year in which the first match was played ( if applicable). Cannot be later than the value in YEAR_COMPLETED'
/

comment on column tournaments.number_of_teams is
    'The number of teams taking part in the tournament'
/

…the TOURNAMENT_TEAMS table…

create table tournament_teams
(
    tourn_id number,
    team_name varchar2(100),
    group_stage varchar2(2),

    constraint team_pk primary key (tourn_id, team_name)
)
/

comment on table tournament_teams is
    'Teams participating in the Tournament. Alias is TEAM'
/

comment on column tournament_teams.tourn_id is
    'The ID of the tournament the team is participating in. Foreign Key to TOURNAMENTS(id).'
/

comment on column tournament_teams.team_name is
    'The name of the participating team'
/

comment on column tournament_teams.group_stage is
    'If the tournament has an initial group stage, the group identifier that the team is drawn in'
/

…this being 11g, a sequence to provide the values for the TOURNAMENT.ID synthetic key (in 12c you can define this as part of the table ddl)…

create sequence tourn_id_seq
/

… a Foreign Key constraint from TOURNAMENTS to COMPETITIONS….

alter table tournaments
    add constraint tourn_comp_fk foreign key
        (comp_code) references competitions(comp_code)
/

…and a Foreign Key from TOURNAMENT_TEAMS to TOURNAMENTS…

alter table tournament_teams
    add constraint team_tourn_fk foreign key
        (tourn_id) references tournaments(id)
/

I’ve kept the Foreign Keys in separate files to make the initial deployment of the application simpler. By doing this, I can create the tables in any order without worrying about RI constraints. I can then add these as a separate step after all of the tables have been created.
The tables’ non-RI constraints (Primary, Unique Keys, Not Null constraints etc.) are created along with the table.

One other point to note is that I know there are one or two issues with the first-cut of the DDL above. This is so that I can see how well the tests I write highlight these issues.
As we know, before we begin writing a test, we’ll need to have a stub procedure for it to compile against.

The first of these is :

create or replace package footie.manage_competitions
as
    procedure add_competition
    (
        i_code footie.competitions.comp_code%type,
        i_name footie.competitions.comp_name%type,
        i_desc footie.competitions.description%type
    );
end manage_competitions;
/

create or replace package body footie.manage_competitions
as
    procedure add_competition
    (
        i_code footie.competitions.comp_code%type,
        i_name footie.competitions.comp_name%type,
        i_desc footie.competitions.description%type
    )
    is
    begin
        null;
    end add_competition;
end manage_competitions;
/
Scripting the Tests

I want to write a test script for my first story – Add a Competition.
There are two Acceptance Criteria that I need to test :

  • A new competition is added
  • A competition cannot be added more than once

That’s pretty simple, so the test should be fairly straight-forward. Using a SQL script, the first test would probably look something like this :

set serveroutput on size unlimited
declare
    l_result varchar2(4);
    l_err_msg varchar2(4000) := null;

    l_rec_count pls_integer;
    l_code footie.competitions.comp_code%type;
    l_name footie.competitions.comp_name%type := 'UT1 Comp';
    l_desc footie.competitions.description%type := 'Test';

    l_counter pls_integer := 1;

begin
    -- Setup - make sure that the competition we're using for the test does not already exist
    select 'UT'||numeral
    into l_code
    from
    (
        select max( to_number( substr( comp_code, regexp_instr( comp_code, '[[:digit:]]')))) + 1 as numeral
        from footie.competitions
        where comp_code like 'UT%'
        and regexp_instr( substr( comp_code, -1, 1), '[[:digit:]]') = 1
        union
        select 1 from dual
    )
    where numeral is not null;
    -- execute - nested block so that we can handle/report any exceptions
    begin
        footie.manage_competitions.add_competition(l_code, l_name, l_desc);
        l_result := 'PASS';
    exception
        when others then
            l_err_msg := dbms_utility.format_error_backtrace;
            l_result := 'FAIL';
    end; -- execute block
    -- validate
    if l_result = 'PASS' then
        select count(*) into l_rec_count
        from footie.competitions
        where comp_code = l_code
        and comp_name = l_name
        and description = l_desc;

        if l_rec_count != 1 then
            l_result := 'FAIL';
            l_err_msg := 'Record not added';
        end if;
    end if;
    -- teardown
    rollback;
    -- Display Result
    dbms_output.put_line('Add Competition : '||l_result);
    if l_result = 'FAIL' then
        dbms_output.put_line(l_err_msg);
    end if;
end;
/

If we run this, we’d expect it to fail, as things stand :

PL/SQL procedure successfully completed.

Add Competition : FAIL
Record not added

Our second test will probably look like this :

set serveroutput on size unlimited
declare
    l_result varchar2(4);
    l_err_msg varchar2(4000) := null;
    l_err_code number;

    l_rec_count pls_integer;
    l_code footie.competitions.comp_code%type := 'UT2';
    l_name footie.competitions.comp_name%type := 'UT2 Comp';
    l_desc footie.competitions.description%type := null;

    l_counter pls_integer := 1;

begin
    -- Setup - make sure that the competition we're using for the test exists
    merge into footie.competitions
    using dual
    on (comp_code = l_code)
    when not matched then
    insert( comp_code, comp_name, description)
    values(l_code, l_name, l_desc);

    -- execute - nested block so that we can handle/report any exceptions
    begin
        footie.manage_competitions.add_competition(l_code, l_name, l_desc);
        l_result := 'FAIL';
    exception
        when others then
            l_err_msg := dbms_utility.format_error_backtrace;
            l_result := 'PASS';
    end; -- execute block
    -- validate
    if l_result = 'PASS' then
        select count(*) into l_rec_count
        from footie.competitions
        where comp_code = l_code
        and comp_name = l_name
        and nvl(description, 'X') = nvl(l_desc, 'X');

        if l_rec_count &gt; 1 then
            l_result := 'FAIL';
            l_err_msg := 'Duplicate record has been added';
        end if;
    end if;
    -- teardown
    rollback;
    -- Display Result
    dbms_output.put_line('Add Competition : '||l_result);
    if l_result = 'FAIL' then
        dbms_output.put_line(l_err_msg);
    end if;
end;
/

That looks rather similar to our first test. Furthermore, the two test scripts combined add up to quite a lot of code.
At least when we run it, it fails as expected…

PL/SQL procedure successfully completed.

Add Competition : FAIL

The next step is to write the code to pass the tests…

create or replace package body manage_competitions
as
	procedure add_competition
	(
		i_code competitions.comp_code%type,
		i_name competitions.comp_name%type,
		i_desc competitions.description%type default null
	)
	is
	begin
        insert into competitions( comp_code, comp_name, description)
        values( i_code, i_name, i_desc);
	end add_competition;
end manage_competitions;
/

Now when we run our tests we get …

PL/SQL procedure successfully completed.

Add Competition : PASS

PL/SQL procedure successfully completed.

Add Competition : PASS

Now, I could carry on writing SQL scripts for Acceptance Criteria for all of the other stories in the sprint, but I think you get the idea.
Writing tests this way requires a large amount of code, much of which is being replicated in separate tests.

Of course, you could go down the route of moving some of your repeating test routines into PL/SQL packages and deploying your test code along with your application code to non-production environments.
Before going to all of that extra effort though, it’s probably worth checking to see if there’s a framework out there that can help reduce the testing burden.
Fortunately, I’ve got a couple of years until the next major international tournament so, I’ll be taking some time to do just that.
So, tune in next week ( or sometime soon after) when I’ll be taking a look at the first of these frameworks…SQLDeveloper Unit Testing


Filed under: Oracle, PL/SQL, SQL Tagged: external table, foreign key, Test Driven Development, unique key

Notes on Spark and Databricks — technology

DBMS2 - Sun, 2016-07-31 09:30

During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:

  • Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
  • A lot of this is via a focus on simplified APIs, based on
    • Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
    • Machine learning and Spark Streaming both work with Spark DataFrames.
  • There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
  • SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.

The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:

  • Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
  • Databricks has been working on security, and even on the associated certifications.

Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.

The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.

Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.

  • Some of the reason is that SparkSQL is too immature to be great.
  • Some is annoyance that Databricks isn’t putting everything it has into open source.
  • Some is that everything has its architectural trade-offs.

To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.

Categories: Other

Notes on Spark and Databricks — generalities

DBMS2 - Sun, 2016-07-31 09:29

I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:

  • Spark is indeed the replacement for Hadoop MapReduce.
  • Spark is becoming the default platform for machine learning.
  • SparkSQL (nee’ Shark) is puttering along predictably.
  • Databricks reports good success in its core business of cloud-based machine learning support.
  • Spark Streaming has strong adoption, but its position is at risk.
  • Databricks, the original authority on Spark, is not keeping a tight grip on that role.

I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.

Spark is the replacement for Hadoop MapReduce.

This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.

The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy: 

  • Data-transformation-only use cases are important, but they don’t dominate.
  • Most other use cases typically have a data transformation element as well …
  • … which has to be started before any other work can be done.

And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.

Spark is becoming the default platform for machine learning.

Largely, this is a corollary of:

  • The previous point.
  • The fact that Spark was originally designed with machine learning as its principal use case.

To do machine learning you need two things in your software:

  • A collection of algorithms. Spark, I gather, is one of numerous good alternatives there.
  • Support for machine learning workflows. That’s where Spark evidently stands alone.

And thus I have conversations like:

  • “Are you doing anything with Spark?”
  • “We’ve gotten more serious about machine learning, so yes.”

SparkSQL (nee’ Shark) is puttering along.

SparkSQL is pretty much following the Hive trajectory.

  • Useful from Day One as an adjunct to other kinds of processing.
  • A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature.

Databricks reports good success in its core business of cloud-based machine learning support.

Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:

  • As you might expect based on my comments above, the majority of usage is for data transformation, but a lot of that is in anticipation of doing machine learning/predictive modeling in the near future.
  • Databricks customers typically already have their data in the Amazon cloud.
  • Naturally, a lot of Databricks customers are internet companies — ad tech startups and the like. Databricks also reports “strong” traction in the segments:
    • Media
    • Financial services (especially but not only insurance)
    • Health care/pharma
  • The main languages Databricks customers use are R and Python. Ion said that Python was used more on the West Coast, while R was used more in the East.

Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.

Spark Streaming’s long-term success is not assured.

To a first approximation, things look good for Spark Streaming.

  • Spark Streaming is definitely the leading companion to Kafka, and perhaps also to cloud equivalents (e.g. Amazon Kinesis).
  • The “traditional” alternatives of Storm and Samza are pretty much done.
  • Newer alternatives from Twitter, Confluent and Flink aren’t yet established.
  • Cloudera is a big fan of Spark Streaming.
  • Even if Spark Streaming were to generally decline, it might keep substantial “good enough” usage, analogously to Hive and SparkSQL.
  • Cool new Spark Streaming technology is coming out.

But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?

Databricks is not keeping a tight grip on Spark leadership.

For starters:

  • Databricks’ main business, as noted above, is its cloud service. That seems to be going well.
  • Databricks’ secondary business is licensing stuff to Spark distributors. That doesn’t seem to amount to much; it’s too easy to go straight to the Apache distribution and bypass Databricks. No worries; this never seemed it would be a big revenue opportunity for Databricks.

At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:

  • If you want the story on where Spark is going, you do what I did — you ask Databricks.
  • Similarly, if you’re thinking of pushing the boundaries on Spark use, and you have access to the Databricks folks, that’s who you’ll probably talk to.
  • Databricks employs ~1/3 of Spark committers.
  • Databricks organizes the Spark Summit.

But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.

Related links

Starting with my introduction to Spark, previous overview posts include those in:

Categories: Other

Terminology: Data scientists vs. data engineers

DBMS2 - Sun, 2016-07-31 09:10

I learned some newish terms on my recent trip. They’re meant to solve the problem that “data scientists” used to be folks with remarkably broad skill sets, few of whom actually existed in ideal form. So instead now it is increasingly said that:

  • “Data engineers” can code, run clusters, and so on, in support of what’s always been called “data science”. Their knowledge of the math of machine learning/predictive modeling and so on may, however, be limited.
  • “Data scientists” can write and run scripts on single nodes; anything more on the engineering side might strain them. But they have no-apologies skills in the areas of modeling/machine learning.

Related link

Categories: Other

Partner Webcast – Enterprise Database Backup on Oracle Cloud

With today’s 24x7 always-on business pace, failure is simply not an option. SLAs are getting more demanding and application owners are struggling realizing that data protection is arguably one of the...

We share our skills to maximize your revenue!
Categories: DBA Blogs

Pages

Subscribe to Oracle FAQ aggregator