Skip navigation.


The worst database developers in the world?

DBMS2 - 2 hours 49 min ago

If the makers of MMO RPGs (Massive Multi-Player Online Role-Playing Games) aren’t quite the worst database application developers in the world, they’re at least on the short list for consideration. The makers of Guild Wars didn’t even try to have decent database functionality. A decade later, when they introduced Guild Wars 2, the database-oriented functionality (auction house, real-money store, etc.) would crash for days at a time. Lord of the Rings Online evidently had multiple issues with database functionality. Now I’m playing Elder Scrolls Online, which on the whole is a great game, but which may have the most database screw-ups of all.

ESO has been live for less than 3 weeks, and in that time:

1. There’s been a major bug in which players’ “banks” shrank, losing items and so on. Days later, the data still hasn’t been recovered. After a patch, the problem if anything worsened.

2. Guild functionality has at times been taken down while the rest of the game functioned.

3. Those problems aside, bank and guild bank functionality are broken, via what might be considered performance bugs. Problems I repeatedly encounter include:

  • If you deposit a few items, the bank soon goes into a wait state where you can’t use it for a minute or more.
  • Similarly, when you try to access a guild — i.e. group — bank, you often find it in an unresponsive state.
  • If you make a series of updates a second apart, the game tells you you’re doing things too quickly, and insists that you slow down a lot.
  • Items that are supposed to “stack” appear in 2 or more stacks; i.e., a very simple kind of aggregation is failing. There are also several other related recurring errors, which I conjecture have the same underlying cause.

In general, it seems like that what should be a collection of database records is really just a list, parsed each time an update occurs, periodically flushed in its entirety to disk, with all the performance problems you’d expect from that kind of choice.

4. Even stupider are the in-game stores, where fictional items are sold for fictional money. They have an e-commerce interface that is literally 15+ years out of date — items are listed with VERY few filtering options, and there is no way to change the sort. But even that super-primitive interface doesn’t work; in particular, filter queries frequently return incorrect empty-set responses.

5. Much as in other games, over 10 minutes of state changes can be lost.

Except perhaps for #5, these are all functions that surely are only loosely coupled to the rest of the game. Hence the other difficulties of game scaling and performance should have no bearing on them. Hence there’s no excuse for doing such a terrible job of development on large portions of gameplay functionality.

Based on job listings, ESO developer Zenimax doesn’t see database functionality as a major area to fix. This makes me sad.

Categories: Other

Collaborate 14: Taking the WebCenter Portal User Experience to the Next Level!

Come join me and Ted Zwieg at Collaborate14 on our presentation Taking UX and development to the next level.
Fri, Apr 11, 2014 (09:45 AM – 10:45 AM) : Level 3, San Polo 3501A. 

Here is our session overview -

Taking the WebCenter Portal User Experience to the Next Level!


Learn techniques to create unique, award winning portals that not only supports todays need for Mobile responsive and adaptive content but take the next steps towards innovative design – enhancing both the user journey and experience for creating todays modern portal and the way in which developers  can expand the reach and potential of the portal with these new modern techniques.
Attendees will not only learn about new approaches but will be shown live portals using these techniques today to create a modern experience. Learn how to develop your portal for future and enable marketing/design teams to react and generate interactive content fast with no ADF knowledge.


Target Audience

Designed for users wanting to learn the art of the possible and discover what is achievable with WebCenter Portal and ADF – creating compelling user experiences and keeping up to date with modern techniques and design approaches that can be combined to create a faster more interactive ways of navigating through portlets and the Portal.


Executive Summary

This session will demonstrate a couple award winnings examples of live clients who have taken their ADF WebCenter Portal environment to the next level – showing how by combining HTML5 techniques, third party libraries and responsive/adaptive design with ADF; when used in the correct way can not only improve the performance but the way in which users and developers can interact with portal using modern web design techniques.


Learner will be able to:

  • Identify art of the possible with ADF. (everything is achievable…)
  • Discuss achievable concepts and methods to enhancing the ways in which users can interact with Portal.
  • Improved understanding of Responsive and Adaptive techniques – not only targeted for Mobile devices
  • Understand how to structure the portal for faster response times with new frontend techniques
  • Integrate with Non ADF third party components for a more dynamic experience
  • Developers will learn new methods to manage and maintain key core components

The post Collaborate 14: Taking the WebCenter Portal User Experience to the Next Level! appeared first on C4 Blog by Fishbowl Solutions.

Categories: Fusion Middleware, Other

NoSQL vs. NewSQL vs. traditional RDBMS

DBMS2 - Fri, 2014-03-28 08:09

I frequently am asked questions that boil down to:

  • When should one use NoSQL?
  • When should one use a new SQL product (NewSQL or otherwise)?
  • When should one use a traditional RDBMS (most likely Oracle, DB2, or SQL Server)?

The details vary with context — e.g. sometimes MySQL is a traditional RDBMS and sometimes it is a new kid — but the general class of questions keeps coming. And that’s just for short-request use cases; similar questions for analytic systems arise even more often.

My general answers start:

  • Sometimes something isn’t broken, and doesn’t need fixing.
  • Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
  • Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.

In particular, migration away from legacy DBMS raises many issues: 

  • Feature incompatibility (especially in stored-procedure languages and/or other vendor-specific SQL).
  • Your staff’s programming and administrative skill-sets.
  • Your investment in DBMS-related tools.
  • Your supply of hockey tickets from the vendor’s salesman.

Except for the first, those concerns can apply to new applications as well. So if you’re going to use something other than your enterprise-standard RDBMS, you need a good reason.

Commonly, the good reason to change DBMS is one or more of:

  • Programming model. Increasingly often, dynamic schemas seem preferable to fixed ones. Internet-tracking nested data structures are just one of the reasons.
  • Performance (scale-out). DBMS written in this century often scale out better than ones written in the previous millennium. Also, DBMS with fewer features find it easier to scale than more complex ones; distributed join performance is a particular challenge.
  • Geo-distribution. A special kind of scale-out is geo-distribution, which is sometimes a compliance requirement, and in other cases can be a response time nice-to-have.
  • Other stack choices. Couchbase gets a lot of its adoption from existing memcached users (although they like to point out that the percentage keeps dropping). HBase gets a lot of its adoption as a Hadoop add-on.
  • Licensing cost. Duh.

NoSQL products commonly make sense for new applications. NewSQL products, to date, have had a harder time crossing that bar. The chief reasons for the difference are, I think:

  • Programming model!
  • Earlier to do a good and differentiated job in scale-out.
  • Earlier to be at least somewhat mature.

And that brings us to the 762-gigabyte gorilla — in-memory DBMS performance – which is getting all sorts of SAP-driven marketing attention as a potential reason to switch. One can of course put any database in memory, providing only that it is small enough to fit in a single server’s RAM, or else that the DBMS managing it knows how to scale out. Still, there’s a genuine category of “in-memory DBMS/in-memory DBMS features”, principally because:

  • In-memory database managers can and should have a very different approach to locking and latching than ones that rely on persistent storage.
  • Not all DBMS are great at scale-out.

But Microsoft has now launched Hekaton, about which I long ago wrote:

I lack detail, but I gather that Hekaton has some serious in-memory DBMS design features. Specifically mentioned were the absence of locking and latching.

My level of knowledge about Hekaton hasn’t improved in the interim; still, it would seem that in-memory short-request database management is not a reason to switch away from Microsoft SQL Server. Oracle has vaguely promised to get to a similar state one of these years as well.

Of course, HANA isn’t really a short-request DBMS; it’s an analytic DBMS that SAP plausibly claims is sufficiently fast and feature-rich for short-request processing as well.* It remains to be seen whether that difference in attitude will drive enough sustainable product advantages to make switching make sense.

*Most obviously, HANA is columnar. And it has various kinds of integrated analytics as well.

Related links

Categories: Other

How to get Google Glass, Cordova, HTML5 APIs Working (WebRTC, WebSockets).

One of the challenges I’ve noticed when developing with Cordova hybrid apps is the lack of support for the HTML5 APIs that work on desktop devices but not the default Android browser (Ice cream sandwich, Jelly bean) like webSockets and WebRTC.

Now, Google Glass currently runs on Android 4.0.3 Ice cream sandwich  and from the rumours circulating around the next update fingers crossed will put us onto 4.4 KitKat. (Can’t Wait!!)

The problem with Android 4.0.3 is that its running an old edition of WebKit customised to support Google Glass and added gesture support. Unfortunately if you’re developing rich internet applications with Cordova and want to use the latest support HTML5 techniques supported by desktop browser today – you just can’t as Cordova uses the default browser that is part of android..

Read on to find out how to get HTML5 APIs working with Cordova.

There are plugins out there ie here’s one for WebSockets.

Now for those looking purely for WebRTC support you could setup appRTC but I’ve seen a lot of bugs and issues reported with using this on glass. Another option would be to look at Addlive and try out their SDK which you can see working here with Google Glass.

A third option which I’ve gone for is to package Chromium with Cordova and use it as the default browser as it supports the majority of HTML5 APIs (ie – WebSockets, WebRTC and more..).

To do this there is a project called Crosswalk that’s designed for HTML5 mobile apps to bring in support for chromium with the blink engine on Android 4.x+.

I’ve used this with 3 custom apps I’ve built and it works great with Glass and cordova! Now when Glass KitKat arrives you should have the use of the Chromium browser as the default browser with cordova without the need to package Crosswalk. Although this does have the added benefit of blink and if you are like me and have created a single Android app to support 4.x+ devices I feel it just makes sense to spend that extra time to get Crosswalk packaged in with your app.

Here’s a quick simple example of getting your webcam from either your phone or glass to stream into the app/browser viewport.

<!DOCTYPE html>
	<meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=1">

	<title>Video Cam Display</title>

	<video />

<script src='js/lib/adapter.js'></script>
	var constraints = {
		video : true

	function successCallback(stream) { = stream;
		// stream available to console
		var video = document.querySelector("video");
		video.src = window.URL.createObjectURL(stream);;

	function errorCallback(error) {
		console.log("getUserMedia error: ", error);

	getUserMedia(constraints, successCallback, errorCallback);


You can grab the adapter js file -from
This provides additional browser support for WebRTC.

For those using Glass make sure to update the constraints to prevent overheating and lag by reducing frame-rate and camera size.

You can play around with constraints by editing the values from here:







The post How to get Google Glass, Cordova, HTML5 APIs Working (WebRTC, WebSockets). appeared first on C4 Blog by Fishbowl Solutions.

Categories: Fusion Middleware, Other

My Experience Developing for Google Glass using Cordova

I managed to get my hands on a Google Glass Explorer kit R2 last week!

Last week in the post – all the way from Minneapolis Fishbowl HQ arrived my Google Glass Kit. I’ve been fortunate and had the opportunity to play around and try these out thanks to the Oracle AppsLab Team sporting theirs at many of the Oracle events; but never the opportunity to write an app.

So as you can imagine ideas began forming – what could I develop with these?

Read on to find out some of my first experiences developing for Google Glass using Cordova.

So the first thing – why not push Fishbowl Connect (Fishbowl Solutions Hybrid Cordova based Android App) onto Google Glass and see what happens.. “Gulp”!

Plugged in Google Glass installed the drivers -

Android SDK Manager

 Checked to see if Glass was detected -

adb devices

Then side loaded Fishbowl Connect into the Google Glass.

adb install fishbowlConnect.apk

The app installed successfully but unfortunately nothing displayed under the installed app list.

Next step update the Cordova app with Google Glass plugin -

Provided by the clever guys at Sencha.

cordova plugin add

Updated the {app}/platforms/android/res/values/glass.xml

<string name="app_launch_voice_trigger">Launch Fishbowl Connect</string>

Package and side load the app in!

Using the voice command -

“OK Glass, Launch Fishbowl Connect”

Wow, it works!.. Well, when I say works – the voice command “Launch Fishbowl Connect” is recognised and it displays the first screen of Fishbowl Connect (Win!) but the problem is that there’s no screen touch support with glass. (don’t try it! you might poke your eye out).

With Glass you just get the following inputs voice commands and gesture support via the touch pad (tap, longpress, swipeup, swipeleft, swiperight, twotap, twolongpress, twoswipeup, twoswipedown, twoswipeleft, twoswiperight, threetap, threelongpress).

Luckily with adaptive design it was possible to extend the app and supply a new set of templates targeted and loaded for Glass to allow you swipe through the Fishbowl Connect options.

It’s still early days for Fishbowl Connect and Google Glass but its a great start. And as ADF-Mobile is cordova based it should be possible to install it onto Google Glass unfortunately the AMX view don’t support the gestures but I wouldn’t be surprised if Oracle have been working on that behind the scenes just waiting to release a wearables interface for ADF-Mobile.. Not only for Google Glass but other wearable devices smart watches etc…


If your familiar with developing apps for Android when you jump over to Glass, well at least for me the experience of writing/migrating an existing app was very very easy!.



The post My Experience Developing for Google Glass using Cordova appeared first on C4 Blog by Fishbowl Solutions.

Categories: Fusion Middleware, Other

Collaborate 14 Preview: Oracle WebCenter 11g Upgrades – What You Need to Know

Title: #996 – A Successful Oracle WebCenter Upgrade: What You Need to Know

Date: Monday, April 7th
Time: 9:00 am to 3:00 pm
Location: Level 3, San Polo 3405

Upgrading to the next major release of software can sometimes be a complex and arduous task for organizations. In determining the how and when to perform the upgrade, organizations typically go through an evaluation process that includes new feature/function analysis, new technology and architecture analysis, and the overall time they expect the upgrade to take. This is especially the case for software upgrades that add a new layer of complexity and technology architecture that organizations have to plan and adapt for. Such is the case with Oracle WebCenter, as it added WebLogic Server as the application server from 10g to 11g. This addition, although beneficial in many areas, came with a set of new technologies and complexities that organizations with little to no exposure to had to first understand, perhaps get trained on, procure the necessary hardware to run, and in many cases deploy a separate team to manage. In considering all of these steps to perform the upgrade, organizations have undoubtedly gone through the process of “trade-off” analysis, where they weigh the pros and cons of performing the upgrade immediately versus putting it off until, for example, support for their current version runs out. This “trade-off” analysis describes many WebCenter customers as a great number have still not upgraded to 11g.

If this sounds like your organization, then make plans now to attend this session to receive overviews, examples, and actionable tasks that can be used when planning and performing your WebCenter upgrade.

Fishbowl Solutions is happy to be joined by WebCenter customers Ryan Companies, Cascade, Aleris International and AAM as they share their stories on upgrading to Oracle WebCenter 11g. Join us and hear directly from these customers to learn the tips, tricks and best practices for a successful WebCenter upgrade. Here is the tentative schedule for the presentation:

More information on this presentation, as well as all of Fishbowl’s activities at Collaborate, can be found here.

We hope to see you at Collaborate 14!

The post Collaborate 14 Preview: Oracle WebCenter 11g Upgrades – What You Need to Know appeared first on C4 Blog by Fishbowl Solutions.

Categories: Fusion Middleware, Other

DBMS2 revisited

DBMS2 - Sun, 2014-03-23 05:52

The name of this blog comes from an August, 2005 column. 8 1/2 years later, that analysis holds up pretty well. Indeed, I’d keep the first two precepts exactly as I proposed back then:

  • Task-appropriate data managers. Much of this blog is about task-appropriate data stores, so I won’t say more about them in this post.
  • Drastic limitations on relational schema complexity. I think I’ve vindicated on that one by, for example:
    • NoSQL and dynamic schemas.
    • Schema-on-read, and its smarter younger brother schema-on-need.
    • Limitations on the performance and/or allowed functionality of joins in scale-out short-request RDBMS, and the relative lack of complaints about same.
    • Funky database design from major Software as a Service (SaaS) vendors such as Workday and
    • A whole lot of logs.

I’d also keep the general sense of the third precept, namely appropriately-capable data integration, but for that one the specifics do need some serious rework.

For starters, let me say:

  • I’ve mocked the concept of “logical data warehouse” in the past, for its implausible grandiosity, but Gartner’s thoughts on the subject are worth reviewing even so.
  • I generally hear that internet businesses have SOAs (Service-Oriented Architectures) loosely coupling various aspects of their systems, and this is going well. Indeed, it seems to be going well that it’s not worth talking about, and so I’m unclear on the details; evidently it just works. However …
  • … evidently these SOAs are not set up for human real-time levels of data freshness.
  • ETL (Extract/Transform/Load) is criticized for two reasons:
    • People associate it with the kind of schema-heavy relational database design that’s now widely hated, and the long project cycles it is believed to be bring with it.
    • Both analytic RDBMS and now Hadoop offer the alternative of ELT, in which the loading comes before the transformation.
    • There are some welcome attempts to automate aspects of ETL/ELT schema design. I’ve written about this at greatest length in the context of ClearStory’s “Data Intelligence” pitch.
    • Schema-on-need defangs other parts of the ETL/ELT schema beast.
    • If you have a speed-insensitive problem with the cost or complexity of your high-volume data transformation needs, there’s a good chance that Hadoop offers the solution. Much of Hadoop’s adoption is tied to data transformation.

Next, I’d like to call out what is generally a non-problem — when a query can go to two or more systems for the same information, which one should it tap? In theory, that’s a much harder problem in theory than ordinary DBMS optimization. But in practice, only the simplest forms of the challenge tend to arise, because when data is stored in more than one system, they tend to have wildly different use cases, performance profiles and/or permissions.

So what I’m saying is that most traditional kinds of data integration problems are well understood and on their way to being solved in practice. We have our silos; data is replicated as needed between silos; and everything is more or less cool. But of course, as traditional problems get solved, new ones arise, and those turn out to be concentrated among real-time requirements.

“Real-time” of course means different things in different contexts, but for now I think we can safely partition it into two buckets:

  • Human real-time — fast enough so that it doesn’t make a human wait.
  • Machine real-time — as fast as ever possible, because machines are racing other machines.

The latter category arises in the case of automated bidding, famously in high-frequency securities trading, but now in real-time advertising auctions as well. But those vertical markets aside, human real-time integration generally is fast enough.

Narrowing the scope further, I’d say that real-time transactional integration has worked for a while. I date it back to the initially clunky EAI (Enterprise Application Integration) vendors of the latter 1990s. The market didn’t turn out to be that big, but neither did the ETL market, so it’s all good. SOAs, as previously noted, are doing pretty well.

Where things still seem to be dicier is in the area of real-time analytic integration. How can analytic processing be tougher in this regard than transactional? Two ways. One, of course, is data volume. The second is that it’s more likely to involve machine-generated data streams. That said, while I hear a lot about a BI need-for-speed, I often suspect it of being a want-for-speed instead. So while I’m interested in writing a more focused future post on real-time data integration, there may be a bit of latency before it comes out.

Categories: Other

Wants vs. needs

DBMS2 - Sun, 2014-03-23 05:51

In 1981, Gerry Chichester and Vaughan Merlyn did a user-survey-based report about transaction-oriented fourth-generation languages, the leading application development technology of their day. The report included top-ten lists of important features during the buying cycle and after implementation. The items on each list were very similar — but the order of the items was completely different. And so the report highlighted what I regard as an eternal truth of the enterprise software industry:

What users value in the product-buying process is quite different from what they value once a product is (being) put into use.

Here are some thoughts about how that comes into play today.

Wants outrunning needs

1. For decades, BI tools have been sold in large part via demos of snazzy features the CEO would like to have on his desk. First it was pretty colors; then it was maps; now sometimes it’s “real-time” changing displays. Other BI features, however, are likely to be more important in practice.

2. In general, the need for “real-time” BI data freshness is often exaggerated. If you’re a human being doing a job that’s also often automated at high speed — for example network monitoring or stock trading — there’s a good chance you need fully human real-time BI. Otherwise, how much does a 5-15 minute delay hurt? Even if you’re monitoring website sell-through — are your business volumes really high enough that 5 minutes matters much? eBay answered “yes” to that question many years ago, but few of us work for businesses anywhere near eBay’s scale.

Even so, the want for speed keeps growing stronger. :)

3. Similarly, some desires for elastic scale-out are excessive. Your website selling koi pond accessories should always run well on a single server. If you diversify your business to the point that that’s not true, you’ll probably rewrite your app by then as well.

4. Some developers want to play with cool new tools. That doesn’t mean those tools are the best choice for the job. In particular, boring old SQL has merits — such as joins! — that shiny NoSQL hasn’t yet replicated.

5. Some developers, on the other hand, want to keep using their old tools, on which they are their employers’ greatest experts. That doesn’t mean those tools are the best choice for the job either.

6. More generally, some enterprises insist on brand labels that add little value but lots of expense. Yes, there are many benefits to vendor consolidation, and you may avoid many headaches if you stick with not-so-cutting-edge technology. But “enterprise-grade” hardware failure rates may not differ enough from “consumer-grade” ones to be worth paying for.

7. Some enterprises still insist on keeping their IT operations on-premises. In a number of cases, that perceived need is hard to justify.

8. Conversely, I’ve steered clients away from data warehouse appliances and toward, say, Vertica, because they had a clear desire to be cloud-ready. However, I’m not aware that any of those companies ever actually deployed Vertica in the cloud.

Needs ahead of wants

1. Enterprises often don’t realize how much their lives can be improved via a technology upgrade. Those queries that take 6 hours on your current systems, but only 6 minutes on the gear you’re testing? They’d probably take 15 minutes or less on any competitive product as well. Just get something reasonably modern, please!

2. Every application SaaS vendor should offer decent BI. Despite their limited scope, dashboards specific to the SaaS application will likely provide customer value. As a bonus, they’re also apt to demo well.

3. If your customer personal-identity data that resides on internet-facing systems isn’t encrypted – why not? And please don’t get me started on passwords that are stored and mailed around in plain text.

4. Notwithstanding what I said above about elasticity being overrated, buyers often either underrate their needs for concurrent usage, or else don’t do a good job of testing concurrency. A lot of performance disappointments are really problems with concurrency.

5. As noted above, it’s possible to underrate one’s need for boring old SQL goodness.

Wants and needs in balance

1. Twenty years ago, I thought security concerns were overwrought. But in an internet-connected world, with customer data privacy and various forms of regulatory compliance in play, wants and needs for security seem pretty well aligned.

2. There also was a time when ease of set-up and installation were to be underrated. Not any more, however; people generally understand its great importance.

Categories: Other

Real-time analytics for everybody, uniquely from us!!

DBMS2 - Tue, 2014-03-18 21:54

In my latest post, I noted that

The “real-time analytics” gold rush I called out last year continues.

I also recently mocked the slogan

Analytics for everybody!

So when I saw today an email subject line

[Vendor X] to announce real-time analytics for everyone …

I laughed. Indeed, I snorted so loudly that Linda — who was on a different floor of our house — called to check that I was OK. :)

As the day progressed, I had a consulting call with a client, and what did I see on the first substantive slide? There were references to:

broader audience


real-time data analysis

The trends — real or imaginary — are melting into each other!

Categories: Other

Notes and comments, March 17, 2014

DBMS2 - Mon, 2014-03-17 01:09

I have ever more business-advice posts up on Strategic Messaging. Recent subjects include pricing and stealth-mode marketing. Other stuff I’ve been up to includes:

The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.

Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing. :D

The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.

Basho had a major change in leadership. A Twitter exchange ensued. :) Joab Jackson offered a more sober — figuratively and literally — take.

Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.

*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.

WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.

Disclosure: My fingerprints are all over that deal.

In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.

I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.

I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.

I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.

*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.

Categories: Other

Node.js running on Oracle ADF-Mobile!

I recently came across this project – which provides the ability to run node.js on a JVM – got me thinking – “Hey wait a sec ADF-Mobile has a JVM” true not latest and greatest but I’m sure its only a matter of time before Oracle upgrade ADF-Mobiles JVM.

For those new to the node.js framework – it basically allows you to write JavaScript server/device side built on top of Google Chrome’s Javascript Runtime aka the V8 JS Engine. Its event-driven, non-blocking I/O model that makes it lightweight and very efficient it’s success has grown and grown within the community and the big players like LinkedIn, Microsoft, Walmart, paypal, Ebay, Yahoo.. etc. – More here

Nodyn, a project sponsored by Red Hat via its Project:Odd team, works by leveraging two other projects: the DynJS project, which provides the actual JavaScript runtime (ECMAScript, actually) for the JVM, and the Vert.x application platform/event bus system.

So your probably all reading this going.. and.. .. why would I want to write JS on the device and not through the webview whats the point??

It’s not for the novelty of writing server/device side JS (although I personally really like this idea) or even importing prebuilt node.js packages onto the device; its about allowing node.js apps to work directly with existing Java apps or even apps that may also be running on the JVM.

Now, before you all get your hopes up if you write node.js packages you need to be aware that it isn’t a direct port of node.js not everything works; but I feel it is a step forward in the right direction with the backing of Redhat.

This might be a pipe dream of mine at the moment but the hope is in the next year or so we may be able to use node.js on ADF-Mobile.. It may even be possible to run it now although doubtful with the current JVM release running on ADF-Mobile.



The post Node.js running on Oracle ADF-Mobile! appeared first on C4 Blog by Fishbowl Solutions.

Categories: Fusion Middleware, Other

Splunk and inverted-list indexing

DBMS2 - Thu, 2014-03-06 06:55

Some technical background about Splunk

In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):

Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.

It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.

I also wrote:

The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.

That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.

I further wrote:

Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.

Splunk’s ILM story turns out to be simple indeed.

  • As data streams in, Splunk adds it to the most recent — “hot” — bucket. Once a bucket is full, it becomes immutable — “warm” — and a new hot bucket is opened to receive data.
  • Splunk executes queries against whichever of these time-slice buckets make sense, then unions results together as needed.

Finally, I wrote:

I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.


I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.

The point of what I in October, 2013 called

a high(er)-performance data store into which you can selectively copy columns of data

and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.

Inverted-list indexing

Inverted list technology is confusing for several reasons, which start: 

  • It has two names that — rightly or wrongly — are used fairly interchangeably: inverted index and inverted list.
  • Inverted indexes have played different roles at different times. in particular:
    • They were the architecture of the best pre-relational general-purpose DBMS, namely ADABAS, Datacom/DB, and Model 204.
    • They are the core architecture of text search.
    • They are the architecture of certain document- or object-oriented DBMS, such as MarkLogic.
    • They are the core architecture of Splunk. :)

What’s more, inverted list technology can take several different forms.

  • In the simplest case, for each of many keywords, the inverted index lists the documents that contain it. Splunk does a form of this, where the “keyword” is the field — i.e. name — in a (field, value) pair.
  • Another option is to store, for each keyword or name, not just document_IDs, but additional information.
    • In the case of (field, value) pairs, the value can be stored. Splunk sometimes does that too.
    • In the case of text documents, the index can store the position(s) in the document that the word occurs. This is irrelevant to Splunk.
  • When you list all the records that have a certain field in them, and the list mentions the values, you’re getting pretty close to having a column-group NoSQL DBMS (e.g. Cassandra or HBase). Indeed, you might even be on your way to a columnar RDBMS; after all, SAP HANA grew out of a text indexing system.

Splunk, HPAS, and inverted indexes

With all that background, we can finally summarize Splunk’s “High Performance Analytic Store” story.

  • Splunk’s classic data store is an inverted list system that:
    • Tracks (field, value) pairs for a few fields that are always the same, such as Source_System.
    • Otherwise tracks fields only.
  • Splunk HPAS is an inverted list system that tracks (field, value) pairs for arbitrary fields. This gives much higher performance for queries that SELECT on or GROUP BY those fields.
  • As of Splunk 6, Splunk Classic and Splunk HPAS are tightly and almost transparently integrated.

While I haven’t probed for full specifics, I did gather:

  • Queries execute against both data stores at once, without any syntax change. At least, they do if you press some button; that’s the “almost” in the transparency.
  • ­HPAS time-slices the data it stores by the same time intervals that Splunk Classic does. Hence for each time range, integrated Splunk can interrogate the HPAS first and, if it can’t answer, go to the slower traditional Splunk store.
  • There are two basic ways to populate the HPAS:
    • As the data streams in.
    • Via the result sets of Splunk queries. Splunk talks as if this is the preferred way, which fits with Splunk’s long-time argument that it’s nice not to have to make any schema choices before you start streaming the data in.
Categories: Other

Analytics for everybody!

DBMS2 - Wed, 2014-03-05 15:39

For quite some time, one of the most frequent marketing pitches I’ve heard is “Analytics made easy for everybody!”, where by “quite some time” I mean “over 30 years”. “Uniquely easy analytics” is a claim that I meet with the greatest of skepticism.*  Further confusing matters, these claims are usually about what amounts to business intelligence tools, but vendors increasingly say “Our stuff is better than the BI that came before, so we don’t want you to call it ‘BI’ as well.”

*That’s even if your slide deck doesn’t contain a picture of a pyramid of user kinds; if there actually is such a drawing, then the chance that I believe you is effectively nil.

All those caveats notwithstanding, there are indeed at least three forms of widespread analytics:

  • Fairly standalone, eas(ier) to use business intelligence tools, sometimes marketed as focusing on “data exploration” or “data discovery”.
  • Charts and graphs integrated or at least well-embedded into production applications. This technology is on a long-term rise. But in some sense, integrated reporting has been around since the invention of accounting.
  • Predictive analytics built into automated systems, for example ad selection. This is not what is usually meant by the “easy analytics” claim, and I’ll say no more about it in this post.

It would be nice to say that the first two bullet points represent a fairly clean operational/investigative BI split, but that would be wrong; human real-time dashboards can at once be standalone and operational.

Often, the message “Our BI is easy to use by everybody, unlike every other BI offering in the past 40 years” is unsupported by facts; vendors just offer me-too BI technology and falsely claim it’s something special. But sometimes there is actual substance, usually in one or more aspects of time-to-answer. For example:

  • Sometimes the BI itself has a particularly good interface for navigation.
  • I think it’s still possible to be differentiated in mobile BI delivery.
  • It’s definitely still possible to be differentiated in real-time/streaming BI interfaces.
  • Sometimes the visible BI is just part of a specialized stack, whose other elements make it much easier to set up working UI than in the traditional model.
    • Some claims along these lines are bogus, drawing false comparisons to worst-case scenarios in which enterprises take a year or two setting up their first-ever data warehouse.
    • Some of these claims, however, are more legitimate, at least to the extent that the stack includes leading-edge smart data integration, schema-on-need data management, and so on.

One items I’m leaving off the list is the capability to easily design charts, graphs or whole dashboards. When BI vendors add that functionality, they often present it as an industry innovation; but it’s been years since I saw something in that vein beyond the me-too.

Categories: Other

Android – Bridging the Gap Between Native and HTML5 Mobile Apps (ADF-Mobile And Cordova)

At Fishbowl Solutions; we’ve been looking at ways to enhance and bring the Android experience closer to that native experience that every user wants: – here are some solutions that we are using today to help bridge the Android <4.4 gap and also bring in the latest web technology like WebSocket and WebRTC support not available with either cordova or ADF Mobile today..

When it comes to mobile development with ADF-Mobile or Cordova (HTML5) you will notice that there’s a significant difference in performance and support with the HTML5, CSS3 standards between Android and IOS when building HTML5 hybrid mobile apps.

Today; IOS is closer at bridging the gap – giving it a near to native experience in IOS7; – however with Android if your running a device lower than KitKat OS you will still notice a big hit with performance – this is due to ADF-Mobile and Cordova surfacing the standard webview with an old WebKit engine using the default browser with android and not the chromium browser (now part of KitKat).


With those developing ADF-Mobile be aware of the following -

- An old version of jquery (1.7.1) is used with the AMX views (1.7.1 was not designed for mobile) – hopefully Oracle will be upgrading this to the latest supported jquery release targeted for mobile or alternatively swap to mobile jquery syntaxed framework like Zepto.js.

- You can push the updates in manually – this will improve response times and animations;  however be aware that some jquery methods may of been depreciated and may cause you some issues – although I have not come across anything major when manually enhancing ADF Mobile.

There is also a bug with the initial load times of apps in ADF Mobile (Android) – I believe Oracle are working to fix this with the next ADF-Mobile update – this is outside of Cordova or the webview issue and I believe to be more related to the JVM setup. (Correct me if I’m wrong – anyone…)

Cordova 2.2.0 is also used on ADF Mobile (the currently cordova release is 3.4.0) – I’m hoping in the future that Oracle will make it easier for us to upgrade the Cordova Release and supply better release notes on ADF Mobile compatibility with Cordova.
i.e. 2.x is supported in the current release 3.x to be supported on the roadmap – 9 months time, etc.

- If you run into any issues with ADF Mobile; its worth taking a look to see what issues/bugs were in Cordova 2.2.0 release and follow up with Oracle Support to supply an fix for the issue or risk patching the framework yourself.


Creating that Native UX with Android (Cordova and potentially ADF-Mobile)

At Fishbowl Solutions we split our Apps to a Single Page App view outside of AMX Business component view for content management – this allows our clients web and marketing teams to quickly enhance content and brand mobile apps; without the need to learn ADF-Mobile using best practices for developing hybrid html5 mobile applications. This approach allows us to deploy our core apps to Cordova or lifecycle management systems like IBM-Worklight when clients do not need the power of ADF-Mobile and JAVA support to integrate with other Oracle systems.

Suggested Frontend JS Libraries (Cordova)
After working with Phonegap/Cordova for the last 4 years these are my recommendation of libraries to use for mobile app development outside of ADF-Mobile AMX views.

Rethinking Best Practices

1. ReactJS developed by Facebook/Instagram is a perfect open sourced library for developing Single Page Apps optimised for mobile development with its virtualised DOM and JS Engine makes animations and transitions effortless – If you are new to ReactJS you need to watch Rethinking Best Practices to give you an underlining understanding and appreciation of why virtualisation of DOM makes complete sense – when developing mobile apps – no need for acceleration on your browser to create clean transition touch events. Alternatively you could take a look at AngularJS by Google; but in terms of performance for mobile I personally believe ReactJS is the way to go even though its still fairly new to the industry – it will provide a closer native experience if used correctly.

2. Director part of the Flatiron framework is a great match for Reactjs – it’s a great URL Router to handle page history and template requests for single paged mobile apps.

3. RequireJS library – is a module loader that will improve the speed and quality of your code; compressing both CSS files and JS libraries into a single compressed file.

4. i18next translation library is a great solution for marketing teams to manage internationalisation string for your apps.

5. jQuery 2.1.0 the latest release is now optimised for mobile development a year or so ago I would of recommended Zepto.js but today jQuerys latest release is just as good for mobile development.

Getting Rid of Androids Old WebKit Browser and enhancing with Chromium and Blink!!

So this is where things get interesting!…

I’ve been working with the Crosswalk-Project runtime this last month upgrading Fishbowl Solutions Mobile Cordova Apps – which in effect has given me the OOTB power and experience achieved with IOS7 Cordova apps and more!

CrossWalk Overview

At the heart of Crosswalk is the Blink rendering and layout engine – this provides the same HTML5 features and capabilities you would expect to find in any modern web brower ie webSocket webRTC etc..

Building on Blink, Crosswalk uses the Chromium Content module to provide a multi-process architecture, designed for security and performance.

For anyone developing Cordova or hybrid apps I’d recommend taking a look at this project and incorporating the runtime if you are working on Android Mobile apps.. This month we’ll be looking at the potential to incorporate this runtime with ADF-Mobile – we’ll let you know how we get on.



The post Android – Bridging the Gap Between Native and HTML5 Mobile Apps (ADF-Mobile And Cordova) appeared first on C4 Blog by Fishbowl Solutions.

Categories: Fusion Middleware, Other

Confusion about metadata

DBMS2 - Sun, 2014-02-23 00:50

A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.

“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:

  • Data about data structure. This is the classical sense of the term. But please note:
    • In a relational database, structural metadata is rather separate from the data itself.
    • In a document database, each document might carry structure information with it.
  • Other inputs to core data management functions. Two major examples are:
    • Column statistics that inform RDBMS optimizers.
    • Value ranges that inform partition pruning or, more generally, data skipping.
  • Inputs to ancillary data management functions — for example, security privileges.
  • Support for human decisions about data — for example, information about authorship or lineage.

What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.

And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:

  • Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
  • Some document databases store structural metadata right with the document data itself.
  • Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
  • Actual text documents carry the structure imposed by grammar and syntax.

Related links

Categories: Other

ADF Mobile – Packaged WebCenter Accelerator (Cordova Application)

[Fishbowl Solutions WebCenter Accelerator] [ADF Mobile]
[Portal, Content And BPM/SOA] 

At Fishbowl Solutions we’ve been focusing heavily on Mobile solutions over the last 4 years – working on both award winning native and hybrid apps as well as adaptive and responsive web sites and portals for tablet and mobile devices. Last year we decided to expand our reach with ADF Mobile and Cordova based hybrid applications targeting the WebCenter Suite for Android and IOS  (as well as Blackberry, Microsoft for those using plain Cordova or IBM Worklight).

Both Oracle and Fishbowl offer a suite of powerful Native apps available on the Android Google Play and IOS iStore; however these solutions to date are not extendible or brand-able from Oracles side or require customisations and consultancy services from Fishbowl Solutions to extend and enhance the native apps to meet unique client requirements.

With the Oracle ADF Mobile Framework java developers and clients can now easily pull together hybrid applications within JDevelopers visual app designer; however and unfortunately there are currently no Oracle supported open ADF Mobile application for the WebCenter suite that allow clients to easily extend and enhance existing functionality for their corporate users running on mobile devices – and currently the WebCenter apps are not targeted to provide the clean User eXperience for mobile devices and lack features like offline or native device support.

All existing WebCenter Apps that clients want mobile access too, now have to be rebuilt for mobile devices as either native or hybrid applications like the ADF Mobile Framework – so at Fishbowl we’ve decided to build out Fishbowl supported mobile app accelerators for WebCenter Spaces, Content and BPM/SOA (More to come on our Mobile Roadmap). That will allow both your Marketing (Web designers with no ADF knowledge) and ADF Mobile Development teams to extend and quickly and easily customise, enhance and build mobile applications that provide all of the key out of the box features that come with the WebCenter Suite more importantly if you don’t have a requirement for any of the modules these can easily be switched off until you need to enable them in the future – these can also be packaged as a single unified app or multiple individual applications based on your requirements.

The core Fishbowl WebCenter accelerators (Portal, Content, BPM/SOA) are packaged and cannot be customised directly this is so that we can send out software updates as we follow our roadmap for more integrations and accelerators to all our clients without the need to worry that our core accelerators have been tweaked – what we do provide is a key set of APIs and templates that enable you to easily extend, enhance, brand the application as well as create or extend the UI or User Flow of the app to support and integrate with customised bespoke client WebCenter services or custom portal taskflow services or even third party core business applications – empowering you to manage and maintain the app whilst we at Fishbowl provide the key core support, services and UI Shell for future WebCenter OOTB releases.

For more information or a demo – Please contact Fishbowl Solutions here.







The post ADF Mobile – Packaged WebCenter Accelerator (Cordova Application) appeared first on C4 Blog by Fishbowl Solutions.

Categories: Fusion Middleware, Other

MemSQL 3.0

DBMS2 - Mon, 2014-02-10 14:38

Memory-centric data management is confusing. And so I’m going to clarify a couple of things about MemSQL 3.0 even though I don’t yet have a lot of details.* They are:

  • MemSQL has historically been an in-memory row store, which as of last year scales out.
  • It turns out that the MemSQL row store actually has two table types. One is scaled out. The other — called “reference” — is replicated on every node.
  • MemSQL has now added a third table type, which is columnar and which resides in flash memory.
  • If you want to keep data in, for example, both the scale-out row store and the column store, you’d have to copy/replicate it within MemSQL. And if you wanted to access data from both versions at once (e.g. because different copies cover different time periods), you’d likely have to do a UNION or something like that.

*MemSQL’s first columnar offering sounds pretty basic; for example, there’s no columnar compression yet. (Edit: Oops, that’s not accurate. See comment below.) But at least they actually have one, which puts them ahead of many other row-based RDBMS vendors that come to mind.

And to hammer home the contrast:

  • IBM, Oracle and Microsoft, which all sell row-based DBMS meant to run on disk or other persistent storage, have added or will add columnar options that run in RAM.
  • MemSQL, which sells a row-based DBMS that runs in RAM, has added a columnar option that runs in persistent solid-state storage.
Categories: Other

Distinctions in SQL/Hadoop integration

DBMS2 - Sun, 2014-02-09 12:50

Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):

  • Are the SQL engine and Hadoop:
    • Necessarily on the same cluster?
    • Necessarily or at least most naturally on different clusters?
  • How, if at all, is Hadoop invoked by the SQL engine? Specifically, what is the role of:
    • HDFS (Hadoop Distributed File System)?
    • Hadoop MapReduce?
    • HCatalog?
  • How, if at all, is the SQL engine invoked by Hadoop?

In particular:

  • If something is called a “connector”, then Hadoop and the SQL engine are most likely on separate clusters. Good features include (but these can partially contradict each other):
    • A way of making data transfer maximally parallel.
    • Query planning that is smart about when to process on the SQL engine and when to use Hadoop’s native SQL (Hive or otherwise).
  • If something is called “SQL-on-Hadoop”, then Hadoop and the SQL engine are or should be on the same cluster, using the same nodes to store and process data. But while that’s a necessary condition, I’d prefer that it not be sufficient.

Let’s go to some examples.

Hive is the closest example of SQL/Hadoop integration known. Hive executes a somewhat low-grade dialect of SQL — HQL (Hive Query Language) — via very standard Hadoop: Hadoop MapReduce, all HDFS file formats, etc. HCatalog is an enhancement/replacement for the Hive metadata store. HQL is just another language that can be used to write (parts of) Hadoop jobs.

Impala is Cloudera’s replacement for Hive. Impala is and/or is planned to be much like Hive, but much better, for example in performance and in SQL functionality. Impala has its own custom execution engine, including a daemon on every Hadoop data node, and seems to run against a variety of but not all HDFS file formats.

Stinger is Hortonworks’ (and presumably also Apache’s) answer to Impala, but is more of a Hive upgrade than an outright replacement. In particular, Stinger’s answer to the new Impala engine is a port of Hive to the new engine Tez.

Teradata SQL-H is an RDBMS-Hadoop connector that uses HCatalog, and plans queries across the two clusters. Microsoft Polybase is like SQL-H, but it seems more willing than Teradata or Teradata Aster to (optionally) coexist on the same nodes as Hadoop.

Hadapt runs on the Hadoop cluster, putting PostgreSQL* and other software on each Hadoop data node. It has two query engines, one that invokes Hadoop MapReduce (the original one, still best for longer-running queries) and one that doesn’t (more analogous to Impala). When last I looked, Hadapt didn’t query or update against the HDFS API, but there was an interesting future in preloading data from HDFS into Hadapt PostgreSQL tables, and I think that Hadapt’s PostgreSQL tables are technically HDFS files. I don’t think Hadapt makes much use of HCatalog.

*Hacked to allow Hadapt to offer more than just SQL/Hadoop integration.

Splice Machine is a new entrant (public beta is imminent) that has put Apache Derby over an HBase back end. (Apache Derby is the former Cloudscape, an embeddable Java RDBMS that was acquired by Informix and hence later by IBM.) Splice Machine runs on your Hadoop nodes as an HBase coprocessor. Its relationship to non-HBase parts of Hadoop is arm’s-length. I wish this weren’t called “SQL-on-Hadoop”.

Related links

  • Dan Abadi and Dave Dewitt opined last June about how to categorize Hadapt and Polybase.
  • My most detailed discussions of Impala and Stinger were last June and August, respectively.
Categories: Other

Some stuff I’m thinking about (early 2014)

DBMS2 - Sun, 2014-02-02 12:51

From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:

Other stuff on my mind includes but is not limited to:

1. Certain categories of buying organizations are inherently leading-edge.

  • Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
  • US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
  • Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
  • … as have national-security agencies …
  • … as have pharmaceutical research departments.

Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.

2. In particular, I hope to figure out where Hadoop is or soon will be getting major adoption.

  • Widespread Hadoop adoption at ordinary large enterprises is, I think, inevitable and imminent. But it hasn’t quite happened yet.
  • I think that part of the “enterprise data hub” story is a great bet to come true — Hadoop is becoming a key destination for data to land and be transformed. MapReduce was invented for data transformation; Hadoop was invented to do MapReduce; data transformation workloads have already been moving from expensive analytic RDBMS to cheaper Hadoop.
  • I also think Hadoop — enhanced with Spark or whatever — will win as a platform for sophisticated predictive modeling; Hadoop’s (and Spark’s) flexibility is at least as useful for the purpose as RDBMS’ SQL execution speed.
  • I’m still skeptical about ordinary enterprises’ adoption of Hadoop as a business intelligence platform, but it’s definitely another area to track.

3. Analytic RDBMS and data warehouse appliance pricing is always a big deal. Hadoop’s great price advantage doesn’t have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or Infobright.

Speaking of that, it turns out Teradata now publishes per-terabyte pricing. Please note that those are uncompressed prices; actual prices can be assumed to be lower, at least for databases that compress well.

Analytic RDBMS prices are still shaking out.

4. As I previously noted, ensemble models have become the norm for machine learning. I want to learn more about the implications of that.

One conjecture — everything we learned in school about statistics is wrong, or at least it’s less important than we thought. Predictive modeling is not mainly about least squares, regressions, curve-fitting, etc. Rather, it’s first and foremost about data segmentation and clustering, with all the curve-fitting stuff being secondary.

Besides fitting — as it were — what I hear, this hypothesis also matches common sense. How do businesses use predictive modeling? For each customer/prospect/site-visitor/whatever, they decide which of a limited number of possible actions to take. At its core, that’s an exercise in segmentation.

5. I think data integration is getting a lot smarter than it was. Hadoop-based transformation is the obvious example. But there’s also ClearStory’s data intelligence pitch. (And yes, I know I need to talk with Paxata. There’s been a lot of ball-dropping on that one, including by me.)

6. There’s a meta-theme in the above — stuff that’s not exactly a DBMS or DBMS-like data store. Streaming fits into that. So does smart data integration. So, arguably, does Spark. So do data grids, another of those topics I’d like to know more about but haven’t nailed down yet.

Data management is getting ever more complex.

Categories: Other

Spark and Databricks

DBMS2 - Sun, 2014-02-02 12:50

I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.

  • Spark is very new. All Spark adoption is recent.
  • Databricks was founded to commercialize Spark. It is very much in stealth mode …
  • … except insofar as Databricks folks are going out and trying to drum up Spark adoption. :)
  • Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated. :)
  • Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
  • Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.

The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:

  • Spark is a distributed execution engine for analytic processes …
  • … which works well with Hadoop.
  • Spark is distinguished by a flexible in-memory data model …
  • … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
  • Intended analytic use cases for Spark include:
    • SQL data manipulation.
    • ETL-like data manipulation.
    • Streaming-like data manipulation.
    • Machine learning.
    • Graph analytics.

Except for certain low-latency operations,* anything you can do in Spark can also be done in straight Hadoop; Spark just can have advantages in performance and programming ease. Spark RDDs (Resilient Distributed Datasets) are immutable at this time, so Spark is not suited for short-request update workloads.

*A new Spark task requires a thread, not a whole Java Virtual Machine.

Everybody agrees that machine learning is a top Spark use case. In particular:

  • Cloudera sees machine learning as the major area of Spark adoption to date.
  • Ion gave me the impression machine learning is one of the major areas of Spark adoption to date.
  • Mike gave me the impression that machine learning was a core intended use case for Spark the first time we talked about it.
  • There’s a machine learning library for Spark, and also a way to use Spark to do distributed R.

I believe data transformation is a major Spark use case as well.

  • Ion gave me that impression, although Cloudera surprisingly did not. Edit: Actually, see Matt Brandwine’s comment below.
  • I have one client (ClearStory) using Spark that way, and a second that’s likely to.
  • It makes sense that the #1 Hadoop use case (to date), which is something Spark also is well-suited for, would be an important early Spark use case as well.

Spark Streaming is fairly new, but is already getting some adoption. Notes on that start:

  • The actual technology is a form of micro-batching. I plan to learn more about that in the future.
  • Cloudera sees streaming as one of the two big Spark use cases, and praises Spark Streaming for its fault tolerance and its great ease of coding.
  • Mike Franklin knows a lot about streaming.

Part of that story is a sudden decline in the reputation of Storm, whose troubles seem to include:

  • Project founder and Twitter employee Nathan Marz seems no longer to be associated with Storm nor employed at Twitter.
  • I am told that in general the Storm community is not all that vibrant.
  • Various aspects of Storm’s technology are disappointing people.

Other notes on Spark use cases include:

  • Impala-loving Cloudera doesn’t plan to support Shark. Duh.
  • Cloudera also won’t at first support any Spark predictive modeling add-on.
  • Ion’s other company, Conviva, is doing some real-time decisioning in Spark.

Spark data management has been enhanced by a project called Tachyon.* The main point of Tachyon is that Spark RDDs (Resilient Distributed Datasets) now persist in memory beyond the life of a job; besides offering the RDDs to other Spark jobs, Tachyon also opens them to Hadoop via an HDFS emulator.

*If there’s ever a Spark/Tachyon management suite, I hope some aspect is named Cherenkov — i.e., the radiation that is measured to detect the passage of tachyons.:)

And finally, some metrics and so on:

  • Databricks has between 10 and 20 employees.
  • Spark has >100 individual contributors from >25 different companies.
  • There was a Spark Summit with >450 attendees (from >180 organizations), and an earlier Spark-mainly conference with >200 attendees.
  • The Spark meet-up group in San Francisco has >1500 members signed up.
  • Various Spark users and subprojects are identified on the Apache Spark pages.

Related link

  • Most of the current substance on Databricks’ website is in its blog.
Categories: Other